本篇博文主要内容为 2026-03-02 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-03-02)
今日共更新497篇论文,其中:
- 自然语言处理共59篇(Computation and Language (cs.CL))
- 人工智能共136篇(Artificial Intelligence (cs.AI))
- 计算机视觉共123篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共139篇(Machine Learning (cs.LG))
- 多智能体系统共8篇(Multiagent Systems (cs.MA))
- 信息检索共33篇(Information Retrieval (cs.IR))
- 人机交互共28篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Sharing is caring: data sharing in multi-agent supply chains
【速读】:该论文旨在解决多智能体供应链系统中因信息不透明导致的优化难题,尤其关注企业在数据共享意愿不足时如何提升整体性能。传统方法通常假设所有代理(agent)拥有全局可观测性,这在现实中难以实现,因为企业不愿共享敏感数据。为此,作者提出了一种新型多智能体框架,其中工厂代理可选择向下游传递信息,策略包括不共享、撒谎、说真话或混合策略,从而构建一个隐马尔可夫过程(Hidden-Markov Process)。解决方案的关键在于引入可控的信息共享机制,并结合合作奖励塑造(cooperative reward shaping),以增强系统整体协调能力;实验表明,在低需求场景下,诚实披露信息能显著提升所有参与方收益,而在高需求场景下,即使撒谎也能带来边际改善,验证了信息策略与环境状态之间的动态适配对系统绩效的重要性。
链接: https://arxiv.org/abs/2602.24074
作者: Wan Wang,Haiyan Wang,Adam Sobey
机构: University of Southampton (南安普顿大学); Wuhan University of Technology (武汉理工大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Modern supply networks are complex interconnected systems. Multi-agent models are increasingly explored to optimise their performance. Most research assumes agents will have full observability of the system by having a single policy represent the agents, which seems unrealistic as this requires companies to share their data. The alternative is to develop a Hidden-Markov Process with separate policies, making the problem challenging to solve. In this paper, we propose a multi-agent system where the factory agent can share information downstream, increasing the observability of the environment. It can choose to share no information, lie, tell the truth or combine these in a mixed strategy. The results show that data sharing can boost the performance, especially when combined with a cooperative reward shaping. In the high demand scenario there is limited ability to change the strategy and therefore no data sharing approach benefits both agents. However, lying benefits the factory enough for an overall system improvement, although only by a relatively small amount compared to the overall reward. In the low demand scenario, the most successful data sharing is telling the truth which benefits all actors significantly.
[MA-1] A Novel Hierarchical Multi-Agent System for Payments Using LLM s PAKDD2026
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在支付任务中无法端到端自动化执行的问题。现有代理方案虽已引起广泛关注,但在实现完整的支付工作流方面仍存在显著挑战。解决方案的关键在于提出一种分层多智能体系统(Hierarchical Multi-Agent System for Payments, HMASP),其采用模块化架构,包含四个层级:对话式支付代理(Conversational Payment Agent, CPA)、监督代理、路由代理和流程总结代理。该系统通过共享状态变量、解耦的消息状态及结构化的交接协议,实现跨代理与工作流的高效协调,从而首次实现了基于LLM的端到端支付自动化流程。
链接: https://arxiv.org/abs/2602.24068
作者: Joon Kiat Chua,Donghao Huang,Zhaoxia Wang
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注: 12 pages, 1 figure, 3 tables. Accepted at PAKDD 2026
Abstract:Large language model (LLM) agents, such as OpenAI’s Operator and Claude’s Computer Use, can automate workflows but unable to handle payment tasks. Existing agentic solutions have gained significant attention; however, even the latest approaches face challenges in implementing end-to-end agentic payment workflows. To address this gap, this research proposes the Hierarchical Multi-Agent System for Payments (HMASP), which provides an end-to-end agentic method for completing payment workflows. The proposed HMASP leverages either open-weight or proprietary LLMs and employs a modular architecture consisting of the Conversational Payment Agent (CPA - first agent level), Supervisor agents (second agent level), Routing agents (third agent level), and the Process summary agent (fourth agent level). The CPA serves as the central entry point, handling all external requests and coordinating subsequent tasks across hierarchical levels. HMASP incorporates architectural patterns that enable modular task execution across agents and levels for payment operations, including shared state variables, decoupled message states, and structured handoff protocols that facilitate coordination across agents and workflows. Experimental results demonstrate the feasibility of the proposed HMASP. To our knowledge, HMASP is the first LLM-based multi-agent system to implement end-to-end agentic payment workflows. This work lays a foundation for extending agentic capabilities into the payment domain.
[MA-2] Mixed Choice in Asynchronous Multiparty Session Types
【速读】:该论文旨在解决分布式系统中多参与方会话类型(multiparty session type, MST)在异步通信场景下的协议一致性问题,尤其是当参与者之间存在状态不一致时如何保证最终收敛到全局一致的状态。其解决方案的关键在于提出一种异步混合选择(asynchronous mixed choice, MC)的核心构造机制,该机制允许分布式参与者在协议执行过程中出现暂时的状态不一致,但通过形式化证明确保所有参与者最终都能达到相互一致的状态。作者进一步建立了全局类型与分布式局部类型投影之间的操作对应关系,并基于此实现了可验证的工具链和Erlang/OTP中的gen_statem进程实现,从而在实践中保障了异步MST协议的正确性与可用性。
链接: https://arxiv.org/abs/2602.23927
作者: Laura Bocchi,Raymond Hu,Adriana Laura Voinea,Simon Thompson
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Formal Languages and Automata Theory (cs.FL); Multiagent Systems (cs.MA); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:
Abstract:We present a multiparty session type (MST) framework with asynchronous mixed choice (MC). We propose a core construct for MC that allows transient inconsistencies in protocol state between distributed participants, but ensures all participants can always eventually reach a mutually consistent state. We prove the correctness of our system by establishing a progress property and an operational correspondence between global types and distributed local type projections. Based on our theory, we implement a practical toolchain for specifying and validating asynchronous MST protocols featuring MC, and programming compliant gen_statem processes in Erlang/OTP. We test our framework by using our toolchain to specify and reimplement part of the amqp_client of the RabbitMQ broker for Erlang.
[MA-3] Dynamics of Learning under User Choice: Overspecialization and Peer-Model Probing
【速读】:该论文旨在解决多平台在共享用户池中进行机器学习建模时出现的“全局性能劣化”问题,即当各平台仅优化自身观测数据上的局部损失时,可能陷入一种称为“过度专业化陷阱(overspecialization trap)”的反馈机制:平台因持续优化偏好其服务的用户群体,导致对其他用户的吸引力下降,进而进一步限制可获取的数据分布,最终收敛至全局风险极高的模型,即使存在全局表现良好的模型也难以被发现。解决方案的关键在于引入一种基于知识蒸馏思想的“探测(probing)”机制,允许学习者主动获取同侪模型的预测结果,从而间接学习非偏好用户的行为模式,打破数据偏倚的闭环;理论分析表明,当探测源信息充分(如已知市场领导者或多数表现优异的同行)时,该方法能几乎必然收敛至全局风险有界的稳定点。
链接: https://arxiv.org/abs/2602.23565
作者: Adhyyan Narang,Sarah Dean,Lillian J Ratliff,Maryam Fazel
机构: 未知
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:In many economically relevant contexts where machine learning is deployed, multiple platforms obtain data from the same pool of users, each of whom selects the platform that best serves them. Prior work in this setting focuses exclusively on the “local” losses of learners on the distribution of data that they observe. We find that there exist instances where learners who use existing algorithms almost surely converge to models with arbitrarily poor global performance, even when models with low full-population loss exist. This happens through a feedback-induced mechanism, which we call the overspecialization trap: as learners optimize for users who already prefer them, they become less attractive to users outside this base, which further restricts the data they observe. Inspired by the recent use of knowledge distillation in modern ML, we propose an algorithm that allows learners to “probe” the predictions of peer models, enabling them to learn about users who do not select them. Our analysis characterizes when probing succeeds: this procedure converges almost surely to a stationary point with bounded full-population risk when probing sources are sufficiently informative, e.g., a known market leader or a majority of peers with good global performance. We verify our findings with semi-synthetic experiments on the MovieLens, Census, and Amazon Sentiment datasets.
[MA-4] Rudder: Steering Prefetching in Distributed GNN Training using LLM Agents
【速读】:该论文旨在解决大规模图神经网络(Graph Neural Networks, GNNs)在分布式训练中因频繁不规则通信导致的性能瓶颈问题,尤其是在节点邻居采样和数据分布动态变化下,静态预取策略难以适应复杂运行时条件的局限性。解决方案的关键在于提出Rudder——一个嵌入到AWS DistDGL框架中的自主预取软件模块,其核心创新是利用生成式AI(Generative AI)在大型语言模型(Large Language Models, LLMs)中表现出的上下文学习(In-Context Learning, ICL)与多步逻辑推理能力,实现对远程节点访问模式的自适应预测与控制,从而显著降低通信开销并提升训练效率。
链接: https://arxiv.org/abs/2602.23556
作者: Aishwarya Sarkar,Sayan Ghosh,Nathan Tallent,Aman Chadha,Tanya Roosta,Ali Jannesari
机构: Iowa State University (爱荷华州立大学); Pacific Northwest National Laboratory (太平洋西北国家实验室); Amazon GenAI (亚马逊生成式AI); University of California, Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Performance (cs.PF)
备注: Accepted to the 40th ACM International Conference on Supercomputing (ICS 2026)
Abstract:Large-scale Graph Neural Networks (GNNs) are typically trained by sampling a vertex’s neighbors to a fixed distance. Because large input graphs are distributed, training requires frequent irregular communication that stalls forward progress. Moreover, fetched data changes with graph, graph distribution, sample and batch parameters, and caching polices. Consequently, any static prefetching method will miss crucial opportunities to adapt to different dynamic conditions. In this paper, we introduce Rudder, a software module embedded in the state-of-the-art AWS DistDGL framework, to autonomously prefetch remote nodes and minimize communication. Rudder’s adaptation contrasts with both standard heuristics and traditional ML classifiers. We observe that the generative AI found in contemporary Large Language Models (LLMs) exhibits emergent properties like In-Context Learning (ICL) for zero-shot tasks, with logical multi-step reasoning. We find this behavior well-suited for adaptive control even with substantial undertraining. Evaluations using standard datasets and unseen configurations on the NERSC Perlmutter supercomputer show up to 91% improvement in end-to-end training performance over baseline DistDGL (no prefetching), and an 82% improvement over static prefetching, reducing communication by over 50%. Our code is available at this https URL.
[MA-5] Optimization of Edge Directions and Weights for Mixed Guidance Graphs in Lifelong Multi-Agent Path Finding
【速读】:该论文旨在解决终身多智能体路径规划(Lifelong Multi-Agent Path Finding, LMAPF)中因仅依赖边权重优化而提供的软性引导导致的路径规划效率低下的问题。现有方法通过引导图优化(Guidance Graph Optimization, GGO)构建双向加权图以指导智能体移动,但其边权重仅提供软约束——高权重仅降低使用概率而非禁止通行,难以实现精确控制。论文的关键创新在于将边方向优化引入GGO框架,提出混合引导图优化(Mixed Guidance Graph Optimization, MGGO),通过两种方法同时优化边的方向与权重:第一种分两阶段分别优化方向和权重;第二种利用质量多样性(Quality Diversity)算法训练神经网络直接生成方向与权重。此外,论文还将与边方向相关的交通模式融入GGO,使生成的引导图具备方向感知能力,从而实现对智能体行为的严格约束与更高效的路径分配。
链接: https://arxiv.org/abs/2602.23468
作者: Yulun Zhang,Varun Bhatt,Matthew C. Fontaine,Stefanos Nikolaidis,Jiaoyang Li
机构: Robotics Institute, Carnegie Mellon University (卡内基梅隆大学机器人研究所); Lila Sciences; Thomas Lord Department of Computer Science, University of Southern California (南加州大学计算机科学系)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Multi-Agent Path Finding (MAPF) aims to move agents from their start to goal vertices on a graph. Lifelong MAPF (LMAPF) continuously assigns new goals to agents as they complete current ones. To guide agents’ movement in LMAPF, prior works have proposed Guidance Graph Optimization (GGO) methods to optimize a guidance graph, which is a bidirected weighted graph whose directed edges represent moving and waiting actions with edge weights being action costs. Higher edge weights represent higher action costs. However, edge weights only provide soft guidance. An edge with a high weight only discourages agents from using it, instead of prohibiting agents from traversing it. In this paper, we explore the need to incorporate edge directions optimization into GGO, providing strict guidance. We generalize GGO to Mixed Guidance Graph Optimization (MGGO), presenting two MGGO methods capable of optimizing both edge weights and directions. The first optimizes edge directions and edge weights in two phases separately. The second applies Quality Diversity algorithms to optimize a neural network capable of generating edge directions and weights. We also incorporate traffic patterns relevant to edge directions into a GGO method, making it capable of generating edge-direction-aware guidance graphs.
[MA-6] Enhancing CLIP Robustness via Cross-Modality Alignment NEURIPS2025
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)如CLIP在零样本分类任务中对对抗扰动高度敏感的问题。现有方法主要依赖对抗微调或提示优化,但忽略了CLIP编码特征中存在的文本与图像特征空间分离现象——这种错位在对抗扰动下被显著放大,导致分类性能严重下降。解决方案的关键在于提出一种基于最优传输(Optimal Transport, OT)的跨模态对齐框架COLA(Cross-modality Alignment),其核心创新包括:首先将对抗图像嵌入投影到由类别文本特征张成的子空间中,以滤除非语义扰动并保留判别信息;其次将图像和文本建模为多个增强视图上的离散分布,并通过OT优化其对齐关系,同时将子空间投影无缝集成至代价计算中,从而在对抗条件下实现稳定的跨模态对齐。该方法无需训练且兼容已有微调模型,在14个零样本分类基准上验证了有效性,尤其在PGD攻击下ImageNet及其变体上平均提升6.7%。
链接: https://arxiv.org/abs/2510.24038
作者: Xingyu Zhu,Beier Zhu,Shuo Wang,Kesen Zhao,Hanwang Zhang
机构: University of Science and Technology of China (中国科学技术大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: NeurIPS 2025 Spotlight
Abstract:Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization; they often overlook the gaps in CLIP’s encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose Cross-modality Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space. (1) COLA first projects adversarial image embeddings onto a subspace spanned by class text features, effectively filtering out non-semantic distortions while preserving discriminative information. (2) It then models images and texts as discrete distributions over multiple augmented views and refines their alignment via OT, with the subspace projection seamlessly integrated into the cost computation. This design ensures stable cross-modal alignment even under adversarial conditions. COLA is training-free and compatible with existing fine-tuned models. Extensive evaluations across 14 zero-shot classification benchmarks demonstrate the effectiveness of COLA, especially with an average improvement of 6.7% on ImageNet and its variants under PGD adversarial attacks, while maintaining high accuracy on clean samples.
[MA-7] QD-MAPPER: A Quality Diversity Framework to Automatically Evaluate Multi-Agent Path Finding Algorithms in Diverse Maps
【速读】:该论文旨在解决多智能体路径规划(Multi-Agent Path Finding, MAPF)算法在评估过程中因依赖人工设计的固定地图而导致的泛化能力不足和过拟合问题。传统方法通常在有限且特定的地图上测试算法性能,难以全面反映算法在多样化场景下的表现。解决方案的关键在于提出一种名为QD-MAPPER的通用框架,其核心是利用质量多样性(Quality Diversity, QD)算法与神经细胞自动机(Neural Cellular Automata, NCA)相结合,自动生成具有不同结构模式的多样化地图,从而系统性地评估MAPF算法的性能边界,并支持公平比较不同类型的MAPF算法(如基于搜索、优先级、规则和学习的方法),进而识别各算法的优势场景及运行效率差异。
链接: https://arxiv.org/abs/2409.06888
作者: Cheng Qian,Yulun Zhang,Varun Bhatt,Matthew Christopher Fontaine,Stefanos Nikolaidis,Jiaoyang Li
机构: Carnegie Mellon University (卡内基梅隆大学); University of Southern California (南加州大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 14 pages, 23 figures
Abstract:We use the Quality Diversity (QD) algorithm with Neural Cellular Automata (NCA) to automatically evaluate Multi-Agent Path Finding (MAPF) algorithms by generating diverse maps. Previously, researchers typically evaluate MAPF algorithms on a set of specific, human-designed maps at their initial stage of algorithm design. However, such fixed maps may not cover all scenarios, and algorithms may overfit to the small set of maps. To seek further improvements, systematic evaluations on a diverse suite of maps are needed. In this work, we propose Quality-Diversity Multi-Agent Path Finding Performance EvaluatoR (QD-MAPPER), a general framework that takes advantage of the QD algorithm to comprehensively understand the performance of MAPF algorithms by generating maps with patterns, be able to make fair comparisons between two MAPF algorithms, providing further information on the selection between two algorithms and on the design of the algorithms. Empirically, we employ this technique to evaluate and compare the behavior of different types of MAPF algorithms, including search-based, priority-based, rule-based, and learning-based algorithms. Through both single-algorithm experiments and comparisons between algorithms, researchers can identify patterns that each MAPF algorithm excels and detect disparities in runtime or success rates between different algorithms.
自然语言处理
[NLP-0] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLM s in Data Science ICLR2026
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在执行复杂多步骤数据科学任务时缺乏准确基准评估的问题,具体包括两个关键缺口:一是现有基准缺少标准化、过程感知的评价机制以衡量指令遵循性和流程保真度;二是高质量标注训练数据的稀缺。解决方案的关键在于提出DARE-bench,一个面向机器学习建模和数据科学指令遵循能力的基准测试集,其核心创新在于所有任务均具备可验证的真值(ground truth),从而实现客观、可复现的评估;同时,DARE-bench包含6300个源自Kaggle的任务,提供大规模训练与评估数据,支持代理工具(agentic tools)的应用。实证表明,基于DARE-bench进行微调可显著提升模型性能,如监督微调使Qwen3-32B准确率提高1.83倍,强化学习使Qwen3-4B准确率提升超8倍,验证了该基准在评估与训练两方面的双重价值。
链接: https://arxiv.org/abs/2602.24288
作者: Fan Shu,Yite Wang,Ruofan Wu,Boyi Liu,Zhewei Yao,Yuxiong He,Feng Yan
机构: University of Houston (休斯顿大学); Snowflake AI Research (Snowflake AI 研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published as a conference paper at ICLR 2026. 10 pages plus appendix
Abstract:The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures instruction adherence and process fidelity, and (ii) the scarcity of accurately labeled training data. To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following. Unlike many existing benchmarks that rely on human- or model-based judges, all tasks in DARE-bench have verifiable ground truth, ensuring objective and reproducible evaluation. To cover a broad range of tasks and support agentic tools, DARE-bench consists of 6,300 Kaggle-derived tasks and provides both large-scale training data and evaluation sets. Extensive evaluations show that even highly capable models such as gpt-o4-mini struggle to achieve good performance, especially in machine learning modeling tasks. Using DARE-bench training tasks for fine-tuning can substantially improve model performance. For example, supervised fine-tuning boosts Qwen3-32B’s accuracy by 1.83x and reinforcement learning boosts Qwen3-4B’s accuracy by more than 8x. These significant improvements verify the importance of DARE-bench both as an accurate evaluation benchmark and critical training data.
[NLP-1] Do LLM s Benefit From Their Own Words?
【速读】: 该论文旨在解决大语言模型在多轮对话中因保留全部历史上下文(包括模型自身的历史回复)而导致的冗余信息积累问题,这不仅增加了计算资源消耗,还可能引入错误传播与风格污染。其核心解决方案是提出“仅保留用户回合”的提示策略(user-turn-only prompting),通过实证发现:约36.4%的对话轮次为自包含提示,无需依赖助理历史;且在某些情况下,删除助理侧历史可显著提升响应质量,避免因过度依赖过往回复导致的幻觉或风格漂移。进一步地,作者设计了一种上下文过滤机制,有选择性地剔除助理侧历史内容,在保持甚至改善输出质量的同时大幅降低累计上下文长度(最高可达10倍压缩)。
链接: https://arxiv.org/abs/2602.24287
作者: Jenny Y. Huang,Leshem Choshen,Ramon Astudillo,Tamara Broderick,Jacob Andreas
机构: Massachusetts Institute of Technology (麻省理工学院); MIT-IBM Watson AI Lab (MIT-IBM沃森人工智能实验室); IBM Research (IBM研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-turn interactions with large language models typically retain the assistant’s own past responses in the conversation history. In this work, we revisit this design choice by asking whether large language models benefit from conditioning on their own prior responses. Using in-the-wild, multi-turn conversations, we compare standard (full-context) prompting with a user-turn-only prompting approach that omits all previous assistant responses, across three open reasoning models and one state-of-the-art model. To our surprise, we find that removing prior assistant responses does not affect response quality on a large fraction of turns. Omitting assistant-side history can reduce cumulative context lengths by up to 10x. To explain this result, we find that multi-turn conversations consist of a substantial proportion (36.4%) of self-contained prompts, and that many follow-up prompts provide sufficient instruction to be answered using only the current user turn and prior user turns. When analyzing cases where user-turn-only prompting substantially outperforms full context, we identify instances of context pollution, in which models over-condition on their previous responses, introducing errors, hallucinations, or stylistic artifacts that propagate across turns. Motivated by these findings, we design a context-filtering approach that selectively omits assistant-side context. Our findings suggest that selectively omitting assistant history can improve response quality while reducing memory consumption.
[NLP-2] aming Momentum: Rethinking Optimizer States Through Low-Rank Approximation ICLR2026
【速读】: 该论文旨在解决现代优化器(如Adam和Muon)在训练大语言模型时因依赖一阶和二阶动量而导致的显著内存开销问题,从而限制了模型的可扩展性和计算效率。其解决方案的关键在于将传统指数移动平均(Exponential Moving Average, EMA)重新建模为在线梯度流下的线性回归器训练过程,并基于此等价关系提出LoRA-Pre——一种用于高效预训练的低秩优化器。该方法通过在在线线性学习器中将完整的动量矩阵分解到紧凑的低秩子空间,有效降低了优化器的内存占用,同时保持了优化性能;实验表明,LoRA-Pre在Llama系列模型上实现了比基线方法更优或相当的结果,且仅需1/8的秩即可达到相近甚至更好的效果,且在微调阶段也显著优于标准LoRA。
链接: https://arxiv.org/abs/2602.24283
作者: Zhengbo Wang,Jian Liang,Ran He,Zilei Wang,Tieniu Tan
机构: University of Science and Technology of China (中国科学技术大学); NLPR MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所NLPR MAIS); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Nanjing University (南京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Camera-ready version. Accepted as Oral at ICLR 2026
Abstract:Modern optimizers like Adam and Muon are central to training large language models, but their reliance on first- and second-order momenta introduces significant memory overhead, which constrains scalability and computational efficiency. In this work, we reframe the exponential moving average (EMA) used in these momenta as the training of a linear regressor via online gradient flow. Building on this equivalence, we introduce LoRA-Pre, a novel low-rank optimizer designed for efficient pre-training. Specifically, LoRA-Pre reduces the optimizer’s memory footprint by decomposing the full momentum matrix into a compact low-rank subspace within the online linear learner, thereby maintaining optimization performance while improving memory efficiency. We empirically validate LoRA-Pre’s efficacy by pre-training models from the Llama architecture family, scaling from 60M to 1B parameters. LoRA-Pre achieves the highest performance across all model sizes. Notably, LoRA-Pre demonstrates remarkable rank efficiency, achieving comparable or superior results using only 1/8 the rank of baseline methods. Beyond pre-training, we evaluate LoRA-Pre’s effectiveness in fine-tuning scenarios. With the same rank, LoRA-Pre consistently outperforms all efficient fine-tuning baselines. Specifically, compared to standard LoRA, LoRA-Pre achieves substantial improvements of 3.14 points on Llama-3.1-8B and 6.17 points on Llama-2-7B, validating our approach’s effectiveness across both pre-training and fine-tuning paradigms. Our code is publicly available at this https URL.
[NLP-3] Controllable Reasoning Models Are Private Thinkers
【速读】: 该论文旨在解决生成式 AI (Generative AI) 代理在推理过程中因难以控制推理轨迹(reasoning traces)而导致的敏感用户数据意外泄露问题。其核心挑战在于,即使模型最终输出符合隐私保护要求,其内部推理过程仍可能暴露私有信息。解决方案的关键在于:首先,通过构建一个包含显式推理约束的新指令跟随数据集对模型进行微调,使模型不仅在最终答案中遵循指令,也在推理轨迹中遵守隐私限制;其次,引入一种基于分离 LoRA(Low-Rank Adaptation)适配器的生成策略,将推理与答案生成解耦,从而更灵活地控制推理阶段的行为。实验证明,该方法在多个模型和基准上显著提升了隐私保护能力(最高提升51.9个百分点),同时保持了较强的指令跟随性能(最高提升20.9点)。
链接: https://arxiv.org/abs/2602.24210
作者: Haritz Puerto,Haonan Li,Xudong Han,Timothy Baldwin,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, Technical University of Darmstadt and National Research Center for Applied Cybersecurity ATHENE, Germany; Mohamed bin Zayed University of Artificial Intelligence, UAE; LibrAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents powered by reasoning models require access to sensitive user data. However, their reasoning traces are difficult to control, which can result in the unintended leakage of private information to external parties. We propose training models to follow instructions not only in the final answer, but also in reasoning traces, potentially under different constraints. We hypothesize that improving their instruction following abilities in the reasoning traces can improve their privacy-preservation skills. To demonstrate this, we fine-tune models on a new instruction-following dataset with explicit restrictions on reasoning traces. We further introduce a generation strategy that decouples reasoning and answer generation using separate LoRA adapters. We evaluate our approach on six models from two model families, ranging from 1.7B to 14B parameters, across two instruction-following benchmarks and two privacy benchmarks. Our method yields substantial improvements, achieving gains of up to 20.9 points in instruction-following performance and up to 51.9 percentage points on privacy benchmarks. These improvements, however, can come at the cost of task utility, due to the trade-off between reasoning performance and instruction-following abilities. Overall, our results show that improving instruction-following behavior in reasoning models can significantly enhance privacy, suggesting a promising direction for the development of future privacy-aware agents. Our code and data are available at this https URL
[NLP-4] Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume ICML2025 ICLR2025
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成看似合理但实际错误输出时缺乏有效不确定性量化机制的问题,从而影响其可靠部署。现有不确定性度量方法存在局限性,如仅适用于特定模态、依赖外部工具或计算成本高。论文提出的解决方案是UIMRE框架,其关键在于无需训练即可实现跨模态的不确定性估计,仅利用模型内部的模态特征,通过计算采样响应的“语义体积”并调整不一致性,同时捕捉全局语义多样性与局部响应不一致性,基于模型自身的置信度进行建模,从而在图像、音频和视频文本等多种任务中均展现出优异的错误检测能力和校准性能。
链接: https://arxiv.org/abs/2602.24195
作者: Gregory Kang Ruey Lau,Hieu Dao,Nicole Kan Hui Lin,Bryan Kian Hsiang Low
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Earlier versions presented at ICLR 2025 QUESTION workshop and ICML 2025 R2-FM workshop
Abstract:Despite their capabilities, Multimodal Large Language Models (MLLMs) may produce plausible but erroneous outputs, hindering reliable deployment. Accurate uncertainty metrics could enable escalation of unreliable queries to human experts or larger models for improved performance. However, existing uncertainty metrics have practical constraints, such as being designed only for specific modalities, reliant on external tools, or computationally expensive. We introduce UMPIRE, a training-free uncertainty quantification framework for MLLMs that works efficiently across various input and output modalities without external tools, relying only on the models’ own internal modality features. UMPIRE computes the incoherence-adjusted semantic volume of sampled MLLM responses for a given task instance, effectively capturing both the global semantic diversity of samples and the local incoherence of responses based on internal model confidence. We propose uncertainty desiderata for MLLMs and provide theoretical analysis motivating UMPIRE’s design. Extensive experiments show that UMPIRE consistently outperforms baseline metrics in error detection and uncertainty calibration across image, audio, and video-text benchmarks, including adversarial and out-of-distribution settings. We also demonstrate UMPIRE’s generalization to non-text output tasks, including image and audio generation.
[NLP-5] MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games
【速读】: 该论文旨在解决当前语言模型在多轮交互中协作能力不足的问题,特别是其在处理涉及私有信息的协同任务时难以超越非交互式基线场景的表现。解决方案的关键在于提出一种可扩展的评估方法——MT-PingEval,通过一系列需要参与者共享私有信息的协作游戏来测试模型的多轮对话能力,并引入“固定token预算分配到不同轮次”的交互式扩展分析框架,从而系统性地揭示语言模型在规划与执行多轮协作对话中的局限性。
链接: https://arxiv.org/abs/2602.24188
作者: Jacob Eisenstein,Fantine Huot,Adam Fisch,Jonathan Berant,Mirella Lapata
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We present a scalable methodology for evaluating language models in multi-turn interactions, using a suite of collaborative games that require effective communication about private information. This enables an interactive scaling analysis, in which a fixed token budget is divided over a variable number of turns. We find that in many cases, language models are unable to use interactive collaboration to improve over the non-interactive baseline scenario in which one agent attempts to summarize its information and the other agent immediately acts – despite substantial headroom. This suggests that state-of-the-art models still suffer from significant weaknesses in planning and executing multi-turn collaborative conversations. We analyze the linguistic features of these dialogues, assessing the roles of sycophancy, information density, and discourse coherence. While there is no single linguistic explanation for the collaborative weaknesses of contemporary language models, we note that humans achieve comparable task success at superior token efficiency by producing dialogues that are more coherent than those produced by most language models. The proactive management of private information is a defining feature of real-world communication, and we hope that MT-PingEval will drive further work towards improving this capability.
[NLP-6] ask-Centric Acceleration of Small-Language Models
【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在高吞吐、低延迟场景下推理效率不足的问题。其核心挑战在于如何在不显著牺牲任务性能的前提下提升SLM的生成速度。解决方案的关键在于提出TASC(Task-Adaptive Sequence Compression)框架,包含两个互补策略:一是TASC-ft,在微调阶段通过迭代扩充词表词汇(基于高频输出n-gram),使模型能利用扩展后的词汇空间;二是TASC-spec,在推理阶段采用无需训练的推测解码(speculative decoding)方法,基于任务语料构建n-gram草稿模型,并混合任务与上下文n-gram以避免词汇对齐约束,从而实现高效且无额外训练负担的加速。
链接: https://arxiv.org/abs/2602.24174
作者: Dor Tsur,Sharon Adar,Ran Levy
机构: Amazon
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:Small language models (SLMs) have emerged as efficient alternatives to large language models for task-specific applications. However, they are often employed in high-volume, low-latency settings, where efficiency is crucial. We propose TASC, Task-Adaptive Sequence Compression, a framework for SLM acceleration comprising two use-cases: When performing SLM fine-tuning, we propose TASC-ft, which iteratively enriches the tokenizer vocabulary with high-frequency output n-grams and then fine-tunes the model to utilize the expanded vocabulary. Next, we propose an inference-time method, termed TASC-spec. TASC-spec is a lightweight, training-free speculative decoding method that constructs an n-gram draft model from the task’s output corpus, mixing task and context n-gram this http URL-spec avoids any additional training, while bypassing draft-target vocabulary alignment constraints. We demonstrate the effectiveness of both methods across multiple low output-variability generation tasks. Our methods show consistent improvements in inference efficiency while maintaining task performance.
[NLP-7] ArgLLM -App: An Interactive System for Argumentative Reasoning with Large Language Models AAMAS2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)在决策过程中缺乏可解释性和人类可质疑性的问题,即如何使基于大语言模型(Large Language Models, LLMs)的决策结果既对人类用户可理解,又允许其对推理过程进行审查与反驳。解决方案的关键在于提出一个基于网页的系统——ArgLLM-App,该系统利用计算论证(computational argumentation)驱动的论辩型大语言模型(Argumentative LLMs, ArgLLMs),实现决策过程的可视化呈现,并支持用户交互式地识别和挑战系统推理中的错误,同时具备高度模块化设计,可集成来自可信外部数据源的信息以增强推理可靠性。
链接: https://arxiv.org/abs/2602.24172
作者: Adam Dejl,Deniz Gorur,Francesca Toni
机构: Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: AAMAS 2026 Demonstration Track
Abstract:Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by humans. Here we propose a web-based system implementing ArgLLM-empowered agents for binary tasks. ArgLLM-App supports visualisation of the produced explanations and interaction with human users, allowing them to identify and contest any mistakes in the system’s reasoning. It is highly modular and enables drawing information from trusted external sources. ArgLLM-App is publicly available at this https URL, with a video demonstration at this https URL.
[NLP-8] CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning
【速读】: 该论文旨在解决移动代理(Mobile Agent)在执行用户指令时面临的混合能力推理(hybrid-capabilities reasoning)难题,即如何在屏幕摘要、子任务规划、动作决策和动作函数等不同推理阶段实现解耦增强与平衡整合。其核心解决方案是提出Channel-of-Mobile-Experts (CoME) 架构,该架构由四个分别对应特定推理阶段的专家组成,并通过输出导向型激活机制(output-oriented activation)动态调用相应专家生成输出 token。关键创新在于引入渐进式训练策略:Expert-FT 实现各专家能力的独立增强,Router-FT 使专家激活与推理阶段对齐,CoT-FT 促进多能力间的协同优化;同时提出 InfoGain-Driven DPO(Info-DPO),利用信息增益评估中间步骤贡献,有效缓解混合能力推理中的误差传播问题。
链接: https://arxiv.org/abs/2602.24142
作者: Yuxuan Liu,Weikai Xu,Kun Huang,Changyu Chen,Jiankun Zhao,Pengzhi Gao,Wei Liu,Jian Luan,Shuo Shang,Bo Du,Ji-Rong Wen,Rui Yan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, action decision and action function. However, existing agents struggle to achieve both decoupled enhancement and balanced integration of these capabilities. To address these challenges, we propose Channel-of-Mobile-Experts (CoME), a novel agent architecture consisting of four distinct experts, each aligned with a specific reasoning stage, CoME activates the corresponding expert to generate output tokens in each reasoning stage via output-oriented activation. To empower CoME with hybrid-capabilities reasoning, we introduce a progressive training strategy: Expert-FT enables decoupling and enhancement of different experts’ capability; Router-FT aligns expert activation with the different reasoning stage; CoT-FT facilitates seamless collaboration and balanced optimization across multiple capabilities. To mitigate error propagation in hybrid-capabilities reasoning, we propose InfoGain-Driven DPO (Info-DPO), which uses information gain to evaluate the contribution of each intermediate step, thereby guiding CoME toward more informative reasoning. Comprehensive experiments show that CoME outperforms dense mobile agents and MoE methods on both AITZ and AMEX datasets.
[NLP-9] Agent icOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation
【速读】: 该论文旨在解决视觉文档检索增强生成(Visual Document Retrieval-Augmented Generation, Visual RAG)系统中因采用页面级切片(page-level chunking)而导致的冗余上下文引入问题,该问题会过度加载生成器的注意力机制并稀释关键证据,同时在有限的视觉标记预算下增加幻觉风险。解决方案的关键在于提出AgenticOCR,这是一种动态解析范式,将光学字符识别(Optical Character Recognition, OCR)从静态全文本处理转变为查询驱动的按需提取系统;通过自主分析文档布局并以“与图像共思”(thinking with images)的方式识别感兴趣区域,实现视觉标记的按需解压缩,从而有效解耦检索粒度与刚性的页面级切片,提升视觉RAG系统的效率和准确性。
链接: https://arxiv.org/abs/2602.24134
作者: Zhengren Wang,Dongsheng Ma,Huaping Zhong,Jiayu Li,Wentao Zhang,Bin Wang,Conghui He
机构: Peking University (北京大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:The expansion of retrieval-augmented generation (RAG) into multimodal domains has intensified the challenge for processing complex visual documents, such as financial reports. While page-level chunking and retrieval is a natural starting point, it creates a critical bottleneck: delivering entire pages to the generator introduces excessive extraneous context. This not only overloads the generator’s attention mechanism but also dilutes the most salient evidence. Moreover, compressing these information-rich pages into a limited visual token budget further increases the risk of hallucinations. To address this, we introduce AgenticOCR, a dynamic parsing paradigm that transforms optical character recognition (OCR) from a static, full-text process into a query-driven, on-demand extraction system. By autonomously analyzing document layout in a “thinking with images” manner, AgenticOCR identifies and selectively recognizes regions of interest. This approach performs on-demand decompression of visual tokens precisely where needed, effectively decoupling retrieval granularity from rigid page-level chunking. AgenticOCR has the potential to serve as the “third building block” of the visual document RAG stack, operating alongside and enhancing standard Embedding and Reranking modules. Experimental results demonstrate that AgenticOCR improves both the efficiency and accuracy of visual RAG systems, achieving expert-level performance in long document understanding. Code and models are available at this https URL.
[NLP-10] rminology Rarity Predicts Catastrophic Failure in LLM Translation of Low-Resource Ancient Languages: Evidence from Ancient Greek
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在古希腊语技术散文翻译中的质量评估问题,尤其是针对低资源古代语言缺乏可靠参考文本时的评价困境。其核心挑战在于如何在无参照译文的情况下客观衡量LLM翻译质量,并识别影响翻译成败的关键因素。解决方案的关键在于采用专家主导的无参考人类评估框架——基于改进的多维质量度量(Multidimensional Quality Metrics, MQM)体系对60个翻译结果进行系统评分,同时结合多种自动化指标(如BLEU、BERTScore等)与术语稀有性(通过Diorisis古希腊语语料库频率量化)的分析,揭示了术语罕见性是导致翻译失败的主要预测因子(r = -0.97),从而为古典学研究中LLM的应用及低资源古代语言自动评估工具的设计提供了实证依据和方法论指导。
链接: https://arxiv.org/abs/2602.24119
作者: James L. Zainaldin,Cameron Pattison,Manuela Marai,Jacob Wu,Mark J. Schiefsky
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Article + supplementary information
Abstract:This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose. We evaluate translations by three commercial LLMs (Claude, Gemini, ChatGPT) of twenty paragraph-length passages from two works by the Greek physician Galen of Pergamum (ca. 129-216 CE): On Mixtures, which has two published English translations, and On the Composition of Drugs according to Kinds, which has never been fully translated into English. We assess translation quality using both standard automated evaluation metrics (BLEU, chrF++, METEOR, ROUGE-L, BERTScore, COMET, BLEURT) and expert human evaluation via a modified Multidimensional Quality Metrics (MQM) framework applied to all 60 translations by a team of domain specialists. On the previously translated expository text, LLMs achieved high translation quality (mean MQM score 95.2/100), with performance approaching expert level. On the untranslated pharmacological text, aggregate quality was lower (79.9/100) but with high variance driven by two passages presenting extreme terminological density; excluding these, scores converged to within 4 points of the translated text. Terminology rarity, operationalized via corpus frequency in the literary Diorisis Ancient Greek Corpus, emerged as a strong predictor of translation failure (r = -.97 for passage-level quality on the untranslated text). Automated metrics showed moderate correlation with human judgment overall on the text with a wide quality spread (Composition), but no metric discriminated among high-quality translations. We discuss implications for the use of LLMs in Classical scholarship and for the design of automated evaluation pipelines for low-resource ancient languages.
[NLP-11] oward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification
【速读】: 该论文旨在解决生成式 AI(Generative AI)在医学影像报告生成中常见的逻辑不一致性问题,即模型生成的诊断结论与自身感知到的影像发现不一致或遗漏了应有推论。传统基于词汇匹配的评估指标无法有效捕捉此类推理错误,尤其在无参考文本的场景下表现不佳。解决方案的关键在于提出一种神经符号验证框架(neurosymbolic verification framework),通过将自由文本的放射学发现自动形式化为结构化的命题证据,并结合SMT求解器(Z3)与临床知识库,对每个诊断声明进行数学蕴含性验证,从而识别出支持不足的幻觉或遗漏的逻辑结论。此方法在多个胸部X光数据集上验证了其有效性,可作为后处理手段显著提升生成报告的诊断严谨性和准确性。
链接: https://arxiv.org/abs/2602.24111
作者: Vikash Singh,Debargha Ganguly,Haotian Yu,Chengwei Zhou,Prerna Singh,Brandon Lee,Vipin Chaudhary,Gourav Datta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:
Abstract:Vision-language models (VLMs) show promise in drafting radiology reports, yet they frequently suffer from logical inconsistencies, generating diagnostic impressions unsupported by their own perceptual findings or missing logically entailed conclusions. Standard lexical metrics heavily penalize clinical paraphrasing and fail to capture these deductive failures in reference-free settings. Toward guarantees for clinical reasoning, we introduce a neurosymbolic verification framework that deterministically audits the internal consistency of VLM-generated reports. Our pipeline autoformalizes free-text radiographic findings into structured propositional evidence, utilizing an SMT solver (Z3) and a clinical knowledge base to verify whether each diagnostic claim is mathematically entailed, hallucinated, or omitted. Evaluating seven VLMs across five chest X-ray benchmarks, our verifier exposes distinct reasoning failure modes, such as conservative observation and stochastic hallucination, that remain invisible to traditional metrics. On labeled datasets, enforcing solver-backed entailment acts as a rigorous post-hoc guarantee, systematically eliminating unsupported hallucinations to significantly increase diagnostic soundness and precision in generative clinical assistants.
[NLP-12] Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance
【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)的局限性问题,即标准结果导向的监督信号对“大部分正确但因少量错误导致失败”的轨迹与完全错误的轨迹施加相同惩罚,从而导致模型丢弃有价值的近似正确轨迹,造成轨迹多样性下降,过早缩小探索空间。解决方案的关键在于提出SCOPE框架,该框架利用过程奖励模型(Process Reward Model)精准定位次优轨迹中的首个错误步骤,并通过细粒度的、基于步骤的离策略修正机制对部分正确的轨迹进行局部修复,从而有效保留并增强近似正确轨迹的利用效率,提升探索空间的多样性(多样性评分提高13.5%),并在数学推理和分布外推理任务上实现新的最先进性能(平均准确率达46.6%,分布外任务准确率达53.4%)。
链接: https://arxiv.org/abs/2602.24110
作者: Yanwei Ren,Haotian Zhang,Likang Xiao,Xikai Zhang,Jiaxing Huang,Jiayan Qiu,Baosheng Yu,Quan Chen,Liu Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the complex reasoning capabilities of Large Reasoning Models. However, standard outcome-based supervision suffers from a critical limitation that penalizes trajectories that are largely correct but fail due to several missteps as heavily as completely erroneous ones. This coarse feedback signal causes the model to discard valuable largely correct rollouts, leading to a degradation in rollout diversity that prematurely narrows the exploration space. Process Reward Models have demonstrated efficacy in providing reliable step-wise verification for test-time scaling, naively integrating these signals into RLVR as dense rewards proves this http URL methods attempt to introduce off-policy guided whole-trajectory replacement that often outside the policy model’s distribution, but still fail to utilize the largely correct rollouts generated by the model itself and thus do not effectively mitigate the narrowing of the exploration space. To address these issues, we propose SCOPE (Step-wise Correction for On-Policy Exploration), a novel framework that utilizes Process Reward Models to pinpoint the first erroneous step in suboptimal rollouts and applies fine-grained, step-wise off-policy rectification. By applying precise refinement on partially correct rollout, our method effectively salvages partially correct trajectories and increases diversity score by 13.5%, thereby sustaining a broad exploration space. Extensive experiments demonstrate that our approach establishes new state-of-the-art results, achieving an average accuracy of 46.6% on math reasoning and exhibiting robust generalization with 53.4% accuracy on out-of-distribution reasoning tasks.
[NLP-13] ARGUS: Seeing the Influence of Narrative Features on Persuasion in Argumentative Texts
【速读】: 该论文旨在解决“叙事是否能增强论点的说服力”以及“哪些叙事特征最为关键”的问题,尤其聚焦于在线非结构化辩论语境中叙事的作用机制。其解决方案的关键在于提出ARGUS框架,该框架包含一个标注了故事存在性及六个核心叙事特征的新语料库(ChangeMyView),并整合了两种成熟的理论框架以捕捉文本层面的叙事特征及其对受众的影响;同时利用编码器类分类器与零样本大语言模型(LLM)实现叙事识别与大规模分析,从而系统评估不同叙事维度在在线论辩中的说服效果。
链接: https://arxiv.org/abs/2602.24109
作者: Sara Nabhani,Federico Pianzola,Khalid Al-Khatib,Malvina Nissim
机构: University of Groningen (格罗宁根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 8 figures, submitted to ACM Transactions on Intelligent Systems and Technology
Abstract:Can narratives make arguments more persuasive? And to this end, which narrative features matter most? Although stories are often seen as powerful tools for persuasion, their specific role in online, unstructured argumentation remains underexplored. To address this gap, we present ARGUS, a framework for studying the impact of narration on persuasion in argumentative discourse. ARGUS introduces a new ChangeMyView corpus annotated for story presence and six key narrative features, integrating insights from two established theoretical frameworks that capture both textual narrative features and their effects on recipients. Leveraging both encoder-based classifiers and zero-shot large language models (LLMs), ARGUS identifies stories and narrative features and applies them at scale to examine how different narrative dimensions influence persuasion success in online argumentation.
[NLP-14] Preference Packing: Efficient Preference Optimization for Large Language Models
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在训练过程中因数据中存在相同输入提示(prompt)对应多个响应(response)而导致的资源效率低下问题,尤其是在奖励模型(reward models)或直接偏好优化(Direct Preference Optimization, DPO)等场景下。其解决方案的关键在于提出**偏好打包(preference packing)**方法,通过合并具有相同输入提示的多个偏好样本,在不改变模型训练逻辑的前提下减少重复输入的注意力计算(attention operations)并降低键值缓存(KV cache)内存占用,从而显著提升训练效率。实验表明,该方法在纯文本和图文混合数据集上均能实现至少37%的训练时间缩短,并可与现有优化技术(如批量排序)协同使用,进一步获得3.22倍的速度提升。
链接: https://arxiv.org/abs/2602.24082
作者: Jaekyung Cho
机构: AWS GenAI Innovation Center (AWS生成式AI创新中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Resource-efficient training optimization techniques are becoming increasingly important as the size of large language models (LLMs) continues to grow. In particular, batch packing is commonly used in pre-training and supervised fine-tuning to achieve resource-efficient training. We propose preference packing, a method to enhance resource efficiency in training techniques that use data with different responses for the same input prompt, such as reward models or Direct Preference Optimization (DPO). Preference packing improves resource efficiency by reducing the attention operations for duplicate input prompts and decreasing KV cache memory usage. We conducted experiments on text-only datasets and image-included datasets and achieved at least 37% reduction in training time. Notably, this method can be applied alongside existing optimization techniques such as batch sorting, resulting in a 3.22x speedup.
[NLP-15] SongSong: A Time Phonograph for Chinese SongCi Music from Thousand of Years Away AAAI2025
【速读】: 该论文旨在解决当前音乐生成模型难以复现具有独特节奏与风格的古代音乐(如中国古代宋词音乐)的问题。现有模型多聚焦于现代流行歌曲生成,缺乏对古乐结构、韵律和文化语境的准确建模能力。解决方案的关键在于提出SongSong模型,其核心流程包括:首先从输入的宋词文本中预测旋律,随后分别生成人声演唱与伴奏,并最终融合三者合成完整作品;同时,为弥补古乐数据稀缺问题,构建了包含29.9小时高质量曲目的OpenSongSong数据集,从而显著提升模型在宋词音乐生成任务上的表现力与真实性。
链接: https://arxiv.org/abs/2602.24071
作者: Jiajia Li,Jiliang Hu,Ziyi Pan,Chong Chen,Zuchao Li,Ping Wang,Lefei Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. Peking University (北京大学); 4. Shanghai Jiao Tong University (上海交通大学)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: 9 pages, 6 figures, accepted by AAAI 2025
Abstract:Recently, there have been significant advancements in music generation. However, existing models primarily focus on creating modern pop songs, making it challenging to produce ancient music with distinct rhythms and styles, such as ancient Chinese SongCi. In this paper, we introduce SongSong, the first music generation model capable of restoring Chinese SongCi to our knowledge. Our model first predicts the melody from the input SongCi, then separately generates the singing voice and accompaniment based on that melody, and finally combines all elements to create the final piece of music. Additionally, to address the lack of ancient music datasets, we create OpenSongSong, a comprehensive dataset of ancient Chinese SongCi music, featuring 29.9 hours of compositions by various renowned SongCi music masters. To assess SongSong’s proficiency in performing SongCi, we randomly select 85 SongCi sentences that were not part of the training set for evaluation against SongSong and music generation platforms such as Suno and SkyMusic. The subjective and objective outcomes indicate that our proposed model achieves leading performance in generating high-quality SongCi music.
[NLP-16] ask Complexity Matters: An Empirical Study of Reasoning in LLM s for Sentiment Analysis PAKDD2026
【速读】: 该论文试图解决的问题是:当前关于生成式 AI(Generative AI)中推理能力普遍提升语言任务性能的假设是否成立。通过系统性评估504种模型配置(涵盖七种模型家族及多种推理架构),研究发现推理效果具有显著的任务依赖性,且在简单任务中反而会导致性能下降,从而挑战了“推理 universally improves performance”的主流观点。解决方案的关键在于:采用多粒度情感分类任务(二分类、五分类与27类情绪识别)进行大规模实验,并结合定量分析(如F1分数变化、帕累托前沿比较)与定性错误分析,揭示出推理机制在不同任务复杂度下的效率-性能权衡特性,明确指出仅在高复杂度任务(如27类情绪识别)中推理才具合理性,而在简单任务中则因系统性过度思考(systematic over-deliberation)导致性能劣化。
链接: https://arxiv.org/abs/2602.24060
作者: Donghao Huang,Zhaoxia Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure, 3 tables. Accepted at PAKDD 2026
Abstract:Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across language tasks. We test this claim through a comprehensive evaluation of 504 configurations across seven model families–including adaptive, conditional, and reinforcement learning-based reasoning architectures–on sentiment analysis datasets of varying granularity (binary, five-class, and 27-class emotion). Our findings reveal that reasoning effectiveness is strongly task-dependent, challenging prevailing assumptions: (1) Reasoning shows task-complexity dependence–binary classification degrades up to -19.9 F1 percentage points (pp), while 27-class emotion recognition gains up to +16.0pp; (2) Distilled reasoning variants underperform base models by 3-18 pp on simpler tasks, though few-shot prompting enables partial recovery; (3) Few-shot learning improves over zero-shot in most cases regardless of model type, with gains varying by architecture and task complexity; (4) Pareto frontier analysis shows base models dominate efficiency-performance trade-offs, with reasoning justified only for complex emotion recognition despite 2.1x-54x computational overhead. We complement these quantitative findings with qualitative error analysis revealing that reasoning degrades simpler tasks through systematic over-deliberation, offering mechanistic insight beyond the high-level overthinking hypothesis.
[NLP-17] Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving
【速读】: 该论文旨在解决大规模分布式服务系统中,同时托管数百个大型语言模型(Large Language Model, LLM)适配器时所面临的复杂缓存与调度挑战,尤其是如何在保证请求不被饿死和GPU内存错误的前提下,通过最大化吞吐量来提升资源效率。其解决方案的关键在于构建一个数据驱动的优化管道,包含三个核心组件:(i) 针对LLM适配器服务定制的数字孪生(Digital Twin, DT),能够以低于5%的吞吐量预测误差高保真模拟真实系统动态;(ii) 基于DT生成数据训练的轻量化机器学习(Machine Learning, ML)模型,实现快速性能估计且精度损失微小;(iii) 利用ML预测结果进行贪婪放置的算法,从而在满足约束条件下最小化所需GPU数量。该方法显著提升了GPU利用率,并具备可扩展性和灵活性,适用于不同优化目标如延迟最小化。
链接: https://arxiv.org/abs/2602.24044
作者: Ferran Agullo,Joan Oliveras,Chen Wang,Alberto Gutierrez-Torre,Olivier Tardieu,Alaa Youssef,Jordi Torres,Josep Ll. Berral
机构: Barcelona Supercomputing Center (巴塞罗那超级计算中心); IBM (国际商业机器公司); Universitat Politècnica de Catalunya (加泰罗尼亚理工大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: journal extension of the workshop paper titled as “A data-driven ml approach for maximizing performance in llm-adapter serving”
Abstract:Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently. While prior work has largely focused on latency minimization, resource efficiency through throughput maximization remains underexplored. This paper presents a data-driven pipeline that, for a given workload, computes an adapter placement that serves the workload with the minimum number of GPUs while avoiding request starvation and GPU memory errors. To that end, the approach identifies the maximum feasible throughput attainable on each GPU by leveraging accurate performance predictions learned from real serving behavior. The proposed pipeline integrates three components: (i) a Digital Twin (DT) tailored to LLM-adapter serving, (ii) a distilled machine learning (ML) model trained on DT-generated data, and (iii) a greedy placement algorithm that exploits ML-based performance estimates to maximize GPU efficiency. The DT emulates real system dynamics with high fidelity, achieving below 5% throughput estimation error while executing up to 90 times faster than full LLM benchmarking across both predictable and unpredictable workloads. The learned ML models further accelerate performance estimation with marginal accuracy degradation, enabling scalable optimization. Experimental results demonstrate that the pipeline substantially improves GPU efficiency by reducing the number of GPUs required to sustain target workloads. Beyond GPU efficiency, the pipeline can be adapted to alternative objectives, such as latency minimization, highlighting its versatility for future large-scale LLM serving infrastructures.
[NLP-18] RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)中奖励模型(Reward Model)在对齐大语言模型(Large Language Models, LLMs)与人类偏好时,因依赖点估计(pointwise reward estimates)而忽略认知不确定性(epistemic uncertainty)所导致的效率低下和过优化问题。其解决方案的关键在于提出一个统一的评估框架 RewardUQ,用于系统性地比较不同不确定性量化(Uncertainty Quantification, UQ)方法在奖励模型中的表现,通过标准指标(如准确性与校准度)以及一种融合两者的排序策略进行简化对比,从而揭示模型规模和初始化对性能的显著影响,并为后续方法开发与下游应用部署提供可复现、开源的工具支持。
链接: https://arxiv.org/abs/2602.24040
作者: Daniel Yang,Samuel Stante,Florian Redhardt,Lena Libon,Parnian Kassraie,Ido Hakimi,Barna Pásztor,Andreas Krause
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that overlook the epistemic uncertainty in reward models arising from limited human feedback. Recent work suggests that quantifying this uncertainty can reduce the costs of human annotation via uncertainty-guided active learning and mitigate reward overoptimization in LLM post-training. However, uncertainty-aware reward models have so far been adopted without thorough comparison, leaving them poorly understood. This work introduces a unified framework, RewardUQ, to systematically evaluate uncertainty quantification for reward models. We compare common methods along standard metrics measuring accuracy and calibration, and we propose a new ranking strategy incorporating both dimensions for a simplified comparison. Our experimental results suggest that model size and initialization have the most meaningful impact on performance, and most prior work could have benefited from alternative design choices. To foster the development and evaluation of new methods and aid the deployment in downstream applications, we release our open-source framework as a Python package. Our code is available at this https URL.
[NLP-19] Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)的越狱(jailbreak)攻击技术演进速度远超基准测试更新速度的问题,导致鲁棒性评估结果快速过时且不同研究间难以比较。其关键解决方案是提出JAILBREAK FOUNDRY(JBF)系统,通过多智能体工作流将越狱论文自动转化为可执行模块,并在统一的评估框架内进行即时测试。该系统包含三个核心组件:JBF-LIB用于共享接口与可复用工具,JBF-FORGE实现论文到模块的自动化翻译,JBF-EVAL标准化评估流程;通过该架构,JBF实现了高保真度的攻击重现(平均成功率偏差仅+0.26个百分点),显著减少重复代码量(降低近50%),并支持跨模型、跨攻击的标准化评测,从而构建可动态演进的“活体”基准(living benchmarks),以适应快速变化的安全威胁环境。
链接: https://arxiv.org/abs/2602.24009
作者: Zhicheng Fang,Jingjie Zheng,Chenxu Fu,Wei Xu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this gap via a multi-agent workflow to translate jailbreak papers into executable modules for immediate evaluation within a unified harness. JBF features three core components: (i) JBF-LIB for shared contracts and reusable utilities; (ii) JBF-FORGE for the multi-agent paper-to-module translation; and (iii) JBF-EVAL for standardizing evaluations. Across 30 reproduced attacks, JBF achieves high fidelity with a mean (reproduced-reported) attack success rate (ASR) deviation of +0.26 percentage points. By leveraging shared infrastructure, JBF reduces attack-specific implementation code by nearly half relative to original repositories and achieves an 82.5% mean reused-code ratio. This system enables a standardized AdvBench evaluation of all 30 attacks across 10 victim models using a consistent GPT-4o judge. By automating both attack integration and standardized evaluation, JBF offers a scalable solution for creating living benchmarks that keep pace with the rapidly shifting security landscape.
[NLP-20] Dialect and Gender Bias in YouTubes Spanish Captioning System
【速读】: 该论文试图解决的问题是:YouTube自动字幕系统在处理不同西班牙语方言时是否存在偏差,从而影响特定地区用户的使用体验。解决方案的关键在于通过对比来自不同地区、性别各异的西班牙语说话者所生成的字幕质量,识别出系统在不同方言上的性能差异,进而揭示算法在多语言多样性背景下的潜在偏见,并强调数字平台的算法技术需针对用户群体的多样化需求进行校准与优化。
链接: https://arxiv.org/abs/2602.24002
作者: Iris Dania Jimenez,Christoph Kern
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 4 tables
Abstract:Spanish is the official language of twenty-one countries and is spoken by over 441 million people. Naturally, there are many variations in how Spanish is spoken across these countries. Media platforms such as YouTube rely on automatic speech recognition systems to make their content accessible to different groups of users. However, YouTube offers only one option for automatically generating captions in Spanish. This raises the question: could this captioning system be biased against certain Spanish dialects? This study examines the potential biases in YouTube’s automatic captioning system by analyzing its performance across various Spanish dialects. By comparing the quality of captions for female and male speakers from different regions, we identify systematic disparities which can be attributed to specific dialects. Our study provides further evidence that algorithmic technologies deployed on digital platforms need to be calibrated to the diverse needs and experiences of their user populations.
[NLP-21] he GRADIEND Python Package: An End-to-End System for Gradient-Based Feature Learning
【速读】: 该论文旨在解决如何从语言模型中学习特征方向(feature directions)的问题,以实现对模型内部表征的可控干预与解释。其核心挑战在于如何有效利用事实-反事实掩码语言模型(MLM)和因果语言模型(CLM)梯度来提取语义相关的特征方向,并将其应用于模型的重写与多特征比较。解决方案的关键在于提出并开源了一个名为Gradient的Python工具包,该工具包实现了GRADIEND方法,提供了一套统一的工作流,涵盖特征相关数据生成、训练、评估、可视化、通过受控权重更新进行持久化模型重写以及多特征对比分析,从而显著提升了特征方向学习的可操作性与实用性。
链接: https://arxiv.org/abs/2602.23993
作者: Jonathan Drechsel,Steffen Herbold
机构: University of Passau (帕绍大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present gradiend, an open-source Python package that operationalizes the GRADIEND method for learning feature directions from factual-counterfactual MLM and CLM gradients in language models. The package provides a unified workflow for feature-related data creation, training, evaluation, visualization, persistent model rewriting via controlled weight updates, and multi-feature comparison. We demonstrate GRADIEND on an English pronoun paradigm and on a large-scale feature comparison that reproduces prior use cases.
[NLP-22] MemEmo: Evaluating Emotion in Memory Systems of Agents
【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)在长时间交互中因上下文丢失而导致的情感信息处理能力不足的问题。现有记忆系统在人类认知水平上的情感信息处理效能尚不明确,缺乏统一的评估基准。为此,作者提出了一种情感增强型记忆评估基准,并构建了Human-Like Memory Emotion (HLME) 数据集,从情感信息提取、情感记忆更新和情感记忆问答三个维度对主流及前沿记忆系统进行系统性评测。实验表明,当前所有被测系统均未在三类任务中表现出稳健性能,揭示了现有记忆机制在处理情感记忆方面的显著缺陷,为未来研究指明了新方向。
链接: https://arxiv.org/abs/2602.23944
作者: Peng Liu,Zhen Tao,Jihao Zhao,Ding Chen,Yansong Zhang,Cuiping Li,Zhiyu Li,Hong Chen
机构: Renmin University of China (中国人民大学); China Telecom Research Institute (中国电信研究院); MemTensor (Shanghai) Technology (MemTensor(上海)科技)
类目: Computation and Language (cs.CL)
备注:
Abstract:Memory systems address the challenge of context loss in Large Language Model during prolonged interactions. However, compared to human cognition, the efficacy of these systems in processing emotion-related information remains inconclusive. To address this gap, we propose an emotion-enhanced memory evaluation benchmark to assess the performance of mainstream and state-of-the-art memory systems in handling affective information. We developed the \textbfHuman-\textbfLike \textbfMemory \textbfEmotion (\textbfHLME) dataset, which evaluates memory systems across three dimensions: emotional information extraction, emotional memory updating, and emotional memory question answering. Experimental results indicate that none of the evaluated systems achieve robust performance across all three tasks. Our findings provide an objective perspective on the current deficiencies of memory systems in processing emotional memories and suggest a new trajectory for future research and system optimization.
[NLP-23] Benchmarking BERT-based Models for Sentence-level Topic Classification in Nepali Language WWW
【速读】: 该论文旨在解决低资源语言尼泊尔语(Nepali)在自然语言处理(Natural Language Processing, NLP)领域中模型性能不足的问题,特别是针对尼泊尔语主题分类任务的基准缺失问题。解决方案的关键在于系统性地评估多种预训练语言模型,包括多语言(如mBERT、XLM-R)和印度语言专用模型(如MuRIL、IndicBERT、DevBERT、NepBERTa),并在一个包含25,006句平衡数据集上进行微调与测试,最终发现印度语言专用模型(尤其是MuRIL-large)在F1-score上达到90.60%,显著优于其他模型,从而为尼泊尔语文档级分类任务建立了可靠的基线,并推动了尼泊尔语NLP应用的发展。
链接: https://arxiv.org/abs/2602.23940
作者: Nischal Karki,Bipesh Subedi,Prakash Poudyal,Rupak Raj Ghimire,Bal Krishna Bal
机构: Kathmandu University ( Kathmandu 大学); Gauhati University (Gauhati 大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 pages, 2 figures. Accepted and presented at the Regional International Conference on Natural Language Processing (RegICON 2025), Gauhati University, Guwahati, India, November 27-29, 2025. To appear in the conference proceedings. Accepted papers list available at: this https URL
Abstract:Transformer-based models such as BERT have significantly advanced Natural Language Processing (NLP) across many languages. However, Nepali, a low-resource language written in Devanagari script, remains relatively underexplored. This study benchmarks multilingual, Indic, Hindi, and Nepali BERT variants to evaluate their effectiveness in Nepali topic classification. Ten pre-trained models, including mBERT, XLM-R, MuRIL, DevBERT, HindiBERT, IndicBERT, and NepBERTa, were fine-tuned and tested on the balanced Nepali dataset containing 25,006 sentences across five conceptual domains and the performance was evaluated using accuracy, weighted precision, recall, F1-score, and AUROC metrics. The results reveal that Indic models, particularly MuRIL-large, achieved the highest F1-score of 90.60%, outperforming multilingual and monolingual models. NepBERTa also performed competitively with an F1-score of 88.26%. Overall, these findings establish a robust baseline for future document-level classification and broader Nepali NLP applications.
[NLP-24] he Astonishing Ability of Large Language Models to Parse Jabberwockified Language
【速读】: 该论文旨在解决语言模型在面对严重退化文本(如内容词被随机替换为无意义字符串)时,仍能恢复语义的能力问题。其核心解决方案在于利用大型语言模型(Large Language Models, LLMs)对结构线索(如形态句法和封闭类词)的强依赖性,从而在词汇信息缺失的情况下,通过语法结构与世界知识的紧密整合实现语义重建。实验表明,LLMs能够将“Jabberwockified”英文文本转化为接近原始含义的可理解句子,揭示了语言处理中句法、词汇语义与常识知识之间高度协同的作用机制。
链接: https://arxiv.org/abs/2602.23928
作者: Gary Lupyan,Senyi Yang
机构: University of Wisconsin-Madison(威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL)
备注: Submitted to the 2026 Annual Meeting of the Cognitive Science Society
Abstract:We show that large language models (LLMs) have an astonishing ability to recover meaning from severely degraded English texts. Texts in which content words have been randomly substituted by nonsense strings, e.g., “At the ghybe of the swuint, we are haiveed to Wourge Phrear-gwurr, who sproles into an ghitch flount with his crurp”, can be translated to conventional English that is, in many cases, close to the original text, e.g., “At the start of the story, we meet a man, Chow, who moves into an apartment building with his wife.” These results show that structural cues (e.g., morphosyntax, closed-class words) constrain lexical meaning to a much larger degree than imagined. Although the abilities of LLMs to make sense of “Jabberwockified” English are clearly superhuman, they are highly relevant to understanding linguistic structure and suggest that efficient language processing either in biological or artificial systems likely benefits from very tight integration between syntax, lexical semantics, and general world knowledge.
[NLP-25] Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks ICLR2026
【速读】: 该论文旨在解决当前 referring expression comprehension (REC) 基准测试(如 RefCOCO、RefCOCO+ 和 RefCOCOg)中存在的“捷径依赖”问题,即模型往往通过简单线索(如短表达式、少干扰项或冗余描述)即可完成任务,而无需真正理解语言与视觉之间的复杂对应关系。解决方案的关键在于引入 Ref-Adv 数据集,该数据集通过精心设计的语言表达与视觉信息匹配机制,仅提供唯一确定目标对象所必需的信息,并引入具有挑战性的干扰项和推理维度(如否定句式),从而抑制模型对表面特征的依赖。实验表明,尽管主流多模态大语言模型(Multimodal LLMs)在传统基准上表现优异,但在 Ref-Adv 上性能显著下降,揭示了其在视觉推理和定位能力上的不足,为未来研究提供了更严谨的评估标准和改进方向。
链接: https://arxiv.org/abs/2602.23898
作者: Qihua Dong,Kuo Yang,Lin Ju,Handong Zhao,Yitian Zhang,Yizhou Wang,Huimin Zeng,Jianglin Lu,Yun Fu
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICLR 2026
Abstract:Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.
[NLP-26] LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding
【速读】: 该论文旨在解决生成式 AI(Generative AI)中自回归大语言模型(LLM)推理速度受限的问题,特别是针对推测解码(Speculative Decoding)技术中因轻量级草稿模型(draft model)容量有限而导致的接受率(acceptance rate)偏低问题。标准训练方法以最小化KL散度为目标,虽与最优接受率共享全局最优解,但在小模型场景下易陷入次优解,无法有效提升接受率。解决方案的关键在于提出LK损失函数(LK losses),这是一种直接优化接受率的新型训练目标,无需额外计算开销即可显著提升接受率指标,在多种架构和规模的模型配置下均表现出一致且稳定的性能提升,为现有草稿模型训练框架提供了一种高效、可直接集成的替代方案。
链接: https://arxiv.org/abs/2602.23881
作者: Alexander Samarin,Sergei Krutikov,Anton Shevtsov,Sergei Skvortsov,Filipp Fisin,Alexander Golubev
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate tokens that are then verified in parallel by the target model. The speedup is significantly determined by the acceptance rate, yet standard training minimizes Kullback-Leibler (KL) divergence as a proxy objective. While KL divergence and acceptance rate share the same global optimum, small draft models, having limited capacity, typically converge to suboptimal solutions where minimizing KL does not guarantee maximizing acceptance rate. To address this issue, we propose LK losses, special training objectives that directly target acceptance rate. Comprehensive experiments across four draft architectures and six target models, ranging from 8B to 685B parameters, demonstrate consistent improvements in acceptance metrics across all configurations compared to the standard KL-based training. We evaluate our approach on general, coding and math domains and report gains of up to 8-10% in average acceptance length. LK losses are easy to implement, introduce no computational overhead and can be directly integrated into any existing speculator training framework, making them a compelling alternative to the existing draft training objectives.
[NLP-27] SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale
【速读】: 该论文旨在解决软件工程代理(Software Engineering Agents, SWE)训练中面临的高质量、大规模任务数据稀缺问题,尤其是受限于缺乏可复现的执行环境和可靠测试套件的挑战。现有基准数据集在规模、语言多样性或资源支持上存在局限,难以支撑多语言生态下的通用SWE模型训练。其解决方案的关键在于提出SWE-rebench V2——一个语言无关的自动化流水线,能够从真实开源仓库中收集可执行的SWE任务,并通过交互式设置代理合成特定仓库的安装与测试流程,同时利用大语言模型(Large Language Models, LLMs)集成判别器过滤无效实例,确保数据质量。该方法构建了包含32,000+个跨20种编程语言和3,600+个仓库的任务数据集,并额外释放超过120,000个带安装说明、失败到通过测试及丰富元数据的任务,显著提升了训练数据的规模与多样性,为SWE代理的大规模跨语言训练提供了坚实基础。
链接: https://arxiv.org/abs/2602.23866
作者: Ibragim Badertdinov,Maksim Nekrashevich,Anton Shevtsov,Alexander Golubev
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,000+ tasks spanning 20 languages and 3,600+ repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.
[NLP-28] NAU-QMUL: Utilizing BERT and CLIP for Multi-modal AI-Generated Image Detection
【速读】: 该论文旨在解决AI生成图像的检测问题,并进一步识别生成图像所使用的具体模型(即模型溯源)。其解决方案的关键在于提出了一种多模态多任务学习框架,该框架融合了预训练的BERT文本编码器与CLIP视觉编码器提取图文特征,并通过定制化的多任务损失函数实现跨模态特征融合;同时引入基于伪标签的数据增强策略,以高置信度样本扩充训练数据集,从而提升模型在真实场景下的检测性能和模型溯源能力。
链接: https://arxiv.org/abs/2602.23863
作者: Xiaoyu Guo,Arkaitz Zubiaga
机构: Nanjing Audit University (南京审计大学); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:With the aim of detecting AI-generated images and identifying the specific models responsible for their generation, we propose a multi-modal multi-task model. The model leverages pre-trained BERT and CLIP Vision encoders for text and image feature extraction, respectively, and employs cross-modal feature fusion with a tailored multi-task loss function. Additionally, a pseudo-labeling-based data augmentation strategy was utilized to expand the training dataset with high-confidence samples. The model achieved fifth place in both Tasks A and B of the `CT2: AI-Generated Image Detection’ competition, with F1 scores of 83.16% and 48.88%, respectively. These findings highlight the effectiveness of the proposed architecture and its potential for advancing AI-generated content detection in real-world scenarios. The source code for our method is published on this https URL.
[NLP-29] CLFEC: A New Task for Unified Linguistic and Factual Error Correction in parag raph-level Chinese Professional Writing
【速读】: 该论文旨在解决中文专业写作中语言错误(如拼写、语法、标点)与事实性错误常共现且相互影响的问题,传统方法通常将二者分离处理,难以实现高效统一修正。其解决方案的关键在于提出一个新任务CLFEC(Chinese Linguistic and Factual Error Correction),构建覆盖时政、金融、法律和医学等多领域的混合型中文专业写作数据集,并系统评估基于大语言模型(LLM)的多种校正范式,包括提示工程、检索增强生成(RAG)及代理工作流(agentic workflows)。实证表明,在同一上下文中联合处理语言与事实错误优于解耦流程,且合适的骨干模型配合代理工作流可有效提升校正效果,从而为工业场景下可靠、全自动的校对系统提供实践指导。
链接: https://arxiv.org/abs/2602.23845
作者: Jian Kai,Zidong Zhang,Jiwen Chen,Zhengxiang Wu,Songtao Sun,Fuyang Li,Yang Cao,Qiang Liu
机构: Huazhong University of Science and Technology (华中科技大学); WPS AI, Kingsoft Office (金山办公)
类目: Computation and Language (cs.CL)
备注:
Abstract:Chinese text correction has traditionally focused on spelling and grammar, while factual error correction is usually treated separately. However, in paragraph-level Chinese professional writing, linguistic (word/grammar/punctuation) and factual errors frequently co-occur and interact, making unified correction both necessary and challenging. This paper introduces CLFEC (Chinese Linguistic Factual Error Correction), a new task for joint linguistic and factual correction. We construct a mixed, multi-domain Chinese professional writing dataset spanning current affairs, finance, law, and medicine. We then conduct a systematic study of LLM-based correction paradigms, from prompting to retrieval-augmented generation (RAG) and agentic workflows. The analysis reveals practical challenges, including limited generalization of specialized correction models, the need for evidence grounding for factual repair, the difficulty of mixed-error paragraphs, and over-correction on clean inputs. Results further show that handling linguistic and factual Error within the same context outperform decoupled processes, and that agentic workflows can be effective with suitable backbone models. Overall, our dataset and empirical findings provide guidance for building reliable, fully automatic proofreading systems in industrial settings.
[NLP-30] GLUScope: A Tool for Analyzing GLU Neurons in Transformer Language Models
【速读】: 该论文旨在解决当前神经网络可解释性工具对现代Transformer语言模型中新型激活函数(如SwiGLU)支持不足的问题。传统工具多聚焦于早期模型,难以捕捉门控机制下激活信号的复杂性——即门控(gate)和内部激活(in activation)可能同时为正或负,形成四种符号组合,每种组合在功能上可能差异显著。解决方案的关键在于开发GLUScope这一开源分析工具,能够系统展示每个神经元在四种符号组合下的文本示例,并量化各类组合的发生频率,从而帮助研究人员更全面地理解神经元的功能分化与行为模式。
链接: https://arxiv.org/abs/2602.23826
作者: Sebastian Gerstner,Hinrich Schütze
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 pages for main body, 9 pages in total. 4 figures
Abstract:We present GLUScope, an open-source tool for analyzing neurons in Transformer-based language models, intended for interpretability researchers. We focus on more recent models than previous tools do; specifically we consider gated activation functions such as SwiGLU. This introduces a new challenge: understanding positive activations is not enough. Instead, both the gate and the in activation of a neuron can be positive or negative, leading to four different possible sign combinations that in some cases have quite different functionalities. Accordingly, for any neuron, our tool shows text examples for each of the four sign combinations, and indicates how often each combination occurs. We describe examples of how our tool can lead to novel insights. A demo is available at https: //sjgerstner.this http URL.
[NLP-31] Divide and Conquer: Accelerating Diffusion-Based Large Language Models via Adaptive Parallel Decoding
【速读】: 该论文旨在解决扩散型大语言模型(Diffusion-based Large Language Models, dLLMs)理论上的并行生成能力与其实际应用中仍采用逐token生成方式之间的差距问题。尽管dLLMs在理论上支持每解码步骤并行生成多个token,但直接对多个掩码token进行并行解码会导致生成质量下降和稳定性不足。为弥合这一差距,论文提出了一种自适应并行解码方法DiCo,其核心在于采用三阶段“分治”范式:首先在Divide阶段识别种子token并构建局部簇;接着在Conquer阶段并行解码不同局部簇中的token;随后重复交替执行Divide与Conquer直至收敛;最后在Finalize阶段通过细粒度复合解码策略完成剩余掩码token的生成。该方案有效释放了dLLMs的内在并行性,在显著提升推理速度的同时保持了高质量的生成性能。
链接: https://arxiv.org/abs/2602.23792
作者: Xiangzhong Luo,Yilin An,Zhicheng Yu,Weichen Liu,Xu Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 7 figures
Abstract:Diffusion-based large language models (dLLMs) have shown promising performance across various reasoning tasks, establishing themselves as an alternative to autoregressive large language models (LLMs). Unlike autoregressive LLMs that generate one token per step based on all previous tokens, dLLMs theoretically enable parallel generation of multiple tokens at each decoding step. However, recent dLLMs still favor one-token-per-step generation in practice, as directly decoding multiple masked tokens often leads to degraded generation quality and stability. This reveals a substantial gap between the theoretical parallelism and practical performance of dLLMs. To bridge this gap, we introduce an adaptive parallel decoding approach, namely DiCo, which features a three-phase divide-and-conquer paradigm to unleash the inherent parallelism of dLLMs. During the Divide phase, DiCo first explores the input masked sequence and identifies masked tokens as seed tokens, which are then expanded to construct a set of local clusters. During the Conquer phase, DiCo performs parallel decoding across different local clusters constructed in the Divide phase. The divide-and-conquer process repeatedly alternates between the Divide and Conquer phases until convergence. During the Finalize phase, DiCo decodes the remaining few masked tokens using an effective fine-grained compound decoding scheme to finalize the generation. Extensive experiments demonstrate that DiCo can achieve significant inference speedups while maintaining competitive generation quality.
[NLP-32] Structured Prompt Optimization for Few-Shot Text Classification via Semantic Alignment in Latent Space
【速读】: 该论文旨在解决少样本文本分类中存在的语义纠缠(semantic entanglement)、标签结构不清晰(unclear label structure)以及特征表示不足(insufficient feature representation)等问题。其解决方案的关键在于提出一种基于结构化提示(structured prompts)的优化框架:首先利用预训练语言模型获取基础语义表示,随后设计由多维语义因子构成的结构化提示,并通过可学习的融合机制将其与文本特征结合,从而在潜在空间中形成边界清晰的任务相关表示;同时构建结构化的标签嵌入矩阵并引入跨空间对齐机制,增强文本特征与标签语义之间的一致性;此外,通过提示正交性约束和联合优化目标,确保不同语义因子间的独立性,使提示能够为分类决策提供透明且可控的引导。
链接: https://arxiv.org/abs/2602.23753
作者: Jiasen Zheng,Zijun Zhou,Huajun Zhang,Junjiang Lin,Jingyun Jia,Qi Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This study addresses the issues of semantic entanglement, unclear label structure, and insufficient feature representation in few-shot text classification, and proposes an optimization framework based on structured prompts to enhance semantic understanding and task adaptation under low-resource conditions. The framework first uses a pretrained language model to encode the input text and obtain basic semantic representations. It then introduces structured prompts composed of multi-dimensional semantic factors and integrates them with text features through a learnable combination mechanism, which forms task-related representations with clear boundaries in the latent space. To further strengthen the consistency between text representations and label semantics, the method constructs a structured label embedding matrix and employs a cross-space alignment mechanism to ensure stable matching between textual features and label attributes. In addition, the model applies prompt orthogonality constraints and a joint optimization objective to maintain independence across different semantic factors in the prompts, allowing the structured prompts to provide transparent and controllable guidance for classification decisions. Three types of sensitivity experiments, including learning rate sensitivity, prompt length sensitivity, and data scale sensitivity, are designed to evaluate the stability and robustness of the framework under different conditions. Experimental results show that the proposed structured prompt optimization framework effectively alleviates semantic conflicts and label ambiguity in few-shot text classification. It significantly improves performance on accuracy, precision, recall, and AUC, and demonstrates strong cross-task applicability.
[NLP-33] UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking CVPR2026
【速读】: 该论文旨在解决基于单流Transformer的视觉目标跟踪器在计算效率上的瓶颈问题,即现有方法在进行token剪枝时通常孤立地处理搜索区域、动态模板和静态模板,忽略了三者之间的关键依赖关系,导致剪枝效果不佳且精度下降。其解决方案的关键在于提出一种统一的Token剪枝框架UTPTrack,首次实现了对这三个组件的联合压缩;该框架采用注意力引导、token类型感知的策略,全局建模冗余信息,从而在保持高精度的同时显著提升效率,实现在RGB和多模态场景下分别剪枝65.4%和67.5%的视觉token,同时保留99.7%和100.5%的基线性能。
链接: https://arxiv.org/abs/2602.23734
作者: Hao Wu,Xudong Wang,Jialiang Zhang,Junlong Tong,Xinghao Chen,Junyan Lin,Yunpu Ma,Xiaoyu Shen
机构: Institute of Digital Twin, Eastern Institute of Technology, Ningbo; Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative; Shanghai Jiao Tong University; The Hong Kong Polytechnic University; Munich Center for Machine Learning, LMU Munich
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to CVPR 2026
Abstract:One-stream Transformer-based trackers achieve advanced performance in visual object tracking but suffer from significant computational overhead that hinders real-time deployment. While token pruning offers a path to efficiency, existing methods are fragmented. They typically prune the search region, dynamic template, and static template in isolation, overlooking critical inter-component dependencies, which yields suboptimal pruning and degraded accuracy. To address this, we introduce UTPTrack, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multimodal and language-guided tasks within a single model. Extensive evaluations on 10 benchmarks demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4% of vision tokens in RGB-based tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance, respectively. This strong performance across both RGB and multimodal scenarios underlines its potential as a robust foundation for future research in efficient visual tracking. Code will be released at this https URL.
[NLP-34] From Static Benchmarks to Dynamic Protocol: Agent -Centric Text Anomaly Detection for Evaluating LLM Reasoning ICLR2026
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估中依赖静态数据集所带来的局限性,即静态数据集难以体现模型推理能力的动态演进,且无法适应日益增强的模型性能。其解决方案的关键在于提出一种以智能体(agent)为中心的动态基准测试范式:通过引入一个由教师智能体生成候选问题、协调者智能体验证问题有效性并防御对抗攻击、学生智能体尝试解答问题的迭代协议,实现基准难度的自动递增。当学生正确解答时,协调者会促使教师生成更具挑战性的变体;若问题无效,则由教师修正直至通过验证。该机制使得评估过程随参与智能体能力提升而自适应进化,无需人工构建新数据集,从而系统性揭示传统基准难以暴露的逻辑推理缺陷,并支持多维度评估(如跨模型对比与问题难度变化)。
链接: https://arxiv.org/abs/2602.23729
作者: Seungdong Yoa,Sanghyu Yoon,Suhee Yoon,Dongmin Kim,Ye Seul Sim,Junhyun Lee,Woohyung Lim
机构: LG AI Research(LG人工智能研究); Hankuk University of Foreign Studies(韩国外国语大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICLR 2026
Abstract:The evaluation of large language models (LLMs) has predominantly relied on static datasets, which offer limited scalability and fail to capture the evolving reasoning capabilities of recent models. To overcome these limitations, we propose an agent-centric benchmarking paradigm that moves beyond static datasets by introducing a dynamic protocol in which autonomous agents iteratively generate, validate, and solve problems. Within this protocol, a teacher agent generates candidate problems, an orchestrator agent rigorously verifies their validity and guards against adversarial attacks, and a student agent attempts to solve the validated problems. An invalid problem is revised by the teacher agent until it passes validation. If the student correctly solves the problem, the orchestrator prompts the teacher to generate more challenging variants. Consequently, the benchmark scales in difficulty automatically as more capable agents are substituted into any role, enabling progressive evaluation of large language models without manually curated datasets. Adopting text anomaly detection as our primary evaluation format, which demands cross-sentence logical inference and resists pattern-matching shortcuts, we demonstrate that this protocol systematically exposes corner-case reasoning errors that conventional benchmarks fail to reveal. We further advocate evaluating systems along several complementary axes including cross-model pairwise performance and progress between the initial and orchestrator-finalized problems. By shifting the focus from fixed datasets to dynamic protocols, our approach offers a sustainable direction for evaluating ever-evolving language models and introduces a research agenda centered on the co-evolution of agent-centric benchmarks.
[NLP-35] HiDrop: Hierarchical Vision Token Reduction in MLLM s via Late Injection Concave Pyramid Pruning and Early Exit ICLR2026
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中视觉标记(vision tokens)处理带来的二次计算复杂度问题,该问题限制了MLLMs的广泛应用。现有基于渐进式视觉标记剪枝的方法因误读浅层模块功能且采用固定剪枝策略,未能充分释放效率潜力。其解决方案的关键在于提出HiDrop框架,通过两个核心创新实现:(1) Late Injection——跳过被动的浅层处理,将视觉标记直接注入到真正开始融合的深层位置;(2) Concave Pyramid Pruning with Early Exit——在中层与深层动态调整剪枝率,并结合层间相似性度量和可微分top-k操作进行优化。此外,为保障实际效率,HiDrop引入持久位置编码、支持FlashAttention的标记选择机制以及视觉计算的并行解耦,从而消除动态剪枝带来的隐式开销。实验表明,HiDrop可在保持原始性能的同时压缩约90%的视觉标记,并使训练速度提升1.72倍。
链接: https://arxiv.org/abs/2602.23699
作者: Hao Wu,Yingqi Fan,Jinyang Dai,Junlong Tong,Yunpu Ma,Xiaoyu Shen
机构: Institute of Digital Twin, Eastern Institute of Technology, Ningbo; Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative; University of Science and Technology of China; Shanghai Jiao Tong University; Munich Center for Machine Learning, LMU Munich
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted to ICLR 2026
Abstract:The quadratic computational cost of processing vision tokens in Multimodal Large Language Models (MLLMs) hinders their widespread adoption. While progressive vision token pruning offers a promising solution, current methods misinterpret shallow layer functions and use rigid schedules, which fail to unlock the full efficiency potential. To address these issues, we propose HiDrop, a framework that aligns token pruning with the true hierarchical function of MLLM layers. HiDrop features two key innovations: (1) Late Injection, which bypasses passive shallow layers to introduce visual tokens exactly where active fusion begins; and (2) Concave Pyramid Pruning with an Early Exit mechanism to dynamically adjust pruning rates across middle and deep layers. This process is optimized via an inter-layer similarity measure and a differentiable top-k operator. To ensure practical efficiency, HiDrop further incorporates persistent positional encoding, FlashAttention-compatible token selection, and parallel decoupling of vision computation to eliminate hidden overhead associated with dynamic token reduction. Extensive experiments show that HiDrop compresses about 90% visual tokens while matching the original performance and accelerating training by 1.72 times. Our work not only sets a new state-of-the-art for efficient MLLM training and inference but also provides valuable insights into the hierarchical nature of multimodal fusion. The code is released at this https URL.
[NLP-36] RIZ-RAG NER: A Retrieval-Augmented Large Language Model for TRIZ-Aware Named Entity Recognition in Patent-Based Contradiction Mining
【速读】: 该论文旨在解决专利文本中TRIZ矛盾挖掘(TRIZ-based contradiction mining)的难题,特别是传统规则系统和机器学习模型在处理复杂专利语言时面临的语义模糊性、领域依赖性和泛化能力不足的问题。其核心解决方案是提出TRIZ-RAGNER框架,该框架将矛盾挖掘重构为语义层面的命名实体识别(Named Entity Recognition, NER)任务,并通过引入检索增强生成(Retrieval-Augmented Generation, RAG)机制,结合TRIZ知识库的密集检索、交叉编码器重排序以及结构化大语言模型(Large Language Model, LLM)提示技术,实现对改善参数与恶化参数的精准提取。关键创新在于将领域特定的TRIZ知识注入LLM推理过程,有效降低语义噪声并提升抽取一致性,实验表明该方法在PaTRIZ数据集上显著优于现有基线模型,F1分数提升达7.3个百分点。
链接: https://arxiv.org/abs/2602.23656
作者: Zitong Xu,Yuqing Wu,Yue Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:TRIZ-based contradiction mining is a fundamental task in patent analysis and systematic innovation, as it enables the identification of improving and worsening technical parameters that drive inventive problem solving. However, existing approaches largely rely on rule-based systems or traditional machine learning models, which struggle with semantic ambiguity, domain dependency, and limited generalization when processing complex patent language. Recently, large language models (LLMs) have shown strong semantic understanding capabilities, yet their direct application to TRIZ parameter extraction remains challenging due to hallucination and insufficient grounding in structured TRIZ knowledge. To address these limitations, this paper proposes TRIZ-RAGNER, a retrieval-augmented large language model framework for TRIZ-aware named entity recognition in patent-based contradiction mining. TRIZ-RAGNER reformulates contradiction mining as a semantic-level NER task and integrates dense retrieval over a TRIZ knowledge base, cross-encoder reranking for context refinement, and structured LLM prompting to extract improving and worsening parameters from patent sentences. By injecting domain-specific TRIZ knowledge into the LLM reasoning process, the proposed framework effectively reduces semantic noise and improves extraction consistency. Experiments on the PaTRIZ dataset demonstrate that TRIZ-RAGNER consistently outperforms traditional sequence labeling models and LLM-based baselines. The proposed framework achieves a precision of 85.6%, a recall of 82.9%, and an F1-score of 84.2% in TRIZ contradiction pair identification. Compared with the strongest baseline using prompt-enhanced GPT, TRIZ-RAGNER yields an absolute F1-score improvement of 7.3 percentage points, confirming the effectiveness of retrieval-augmented TRIZ knowledge grounding for robust and accurate patent-based contradiction mining.
[NLP-37] LLM -Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在真实任务导向对话中逻辑推理能力评估不足的问题。现有基准测试难以反映现实场景的复杂性,其数据往往过于简化、脱离实际任务流程与领域约束,且存在预训练语料污染导致评估结果不可靠,以及传统众包方法效率低、难扩展等局限。解决方案的关键在于提出一种基于LLM驱动的多轮任务导向对话合成框架,通过三层次优化机制(trilevel optimization)生成具有真实情境基础、富含现实信息且上下文连贯的对话数据,并围绕这些对话设计并迭代优化推理任务,从而构建高质量、可扩展的评测基准,有效提升LLM在真实逻辑推理场景下的能力。
链接: https://arxiv.org/abs/2602.23610
作者: Yu Zhu,Kai Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The reasoning capability of large language models (LLMs), defined as their ability to analyze, infer, and make decisions based on input information, is essential for building intelligent task-oriented dialogue systems. However, existing benchmarks do not sufficiently reflect the complexity of real-world scenarios, which limits their effectiveness in evaluating and enhancing LLM reasoning in practical contexts. Many current reasoning datasets are overly simplistic and abstract, often disconnected from realistic task flows, domain constraints, and operational rules, making it difficult to effectively evaluate LLMs’ logical reasoning ability. In addition, data contamination from pretraining corpora undermines the reliability of evaluation results, and traditional crowdsourcing methods for dataset construction are labor-intensive and difficult to scale. To address these challenges, we propose a LLM-driven framework for synthesizing multi-turn, task-oriented dialogues grounded in realistic reasoning scenarios, leveraging trilevel optimization to enhance dialogue quality. Our method generates dialogues grounded in authentic task scenarios, enriched with real-world information, and exhibiting strong contextual coherence. Corresponding reasoning tasks are carefully designed around these dialogues and iteratively refined to continuously improve the tasks’ quality and challenge. The resulting dataset serves as a valuable benchmark for assessing and advancing the realistic logical reasoning capabilities of LLMs. Experimental results show that our synthetic data-based reasoning tasks introduce non-trivial reasoning challenges and provide meaningful support for improving the reasoning capabilities of LLMs.
[NLP-38] BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation
【速读】: 该论文旨在解决自动化评分系统中因模型偏见放大(bias amplification)而导致的公平性问题,尤其关注英语学习者(English Language Learners, ELLs)等代表性不足群体在生成式 AI (Generative AI) 评分中被系统性低估的现象。其核心问题是:由于训练数据中高分ELL样本稀缺,基于经验风险最小化的模型倾向于学习多数群体(非ELL)的语言模式,从而对使用不同语言表达但具备同等学科知识的ELL学生做出偏低预测,加剧了评估不公平。解决方案的关键在于提出BRIDGE框架——一种面向低资源评估场景的跨组数据生成方法,通过将大量高分非ELL样本中的构念相关(construct-relevant)内容“粘贴”至真实ELL语言模式中,合成高质量高分ELL样本,并引入判别器模型保障合成样本质量,从而有效缓解偏见并提升公平性表现。
链接: https://arxiv.org/abs/2602.23580
作者: Yun Wang,Xuansheng Wu,Jingyuan Huang,Lei Liu,Xiaoming Zhai,Ninghao Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 1 figure
Abstract:In the field of educational assessment, automated scoring systems increasingly rely on deep learning and large language models (LLMs). However, these systems face significant risks of bias amplification, where model prediction gaps between student groups become larger than those observed in training data. This issue is especially severe for underrepresented groups such as English Language Learners (ELLs), as models may inherit and further magnify existing disparities in the data. We identify that this issue is closely tied to representation bias: the scarcity of minority (high-scoring ELL) samples makes models trained with empirical risk minimization favor majority (non-ELL) linguistic patterns. Consequently, models tend to under-predict ELL students who even demonstrate comparable domain knowledge but use different linguistic patterns, thereby undermining the fairness of automated scoring outcomes. To mitigate this, we propose BRIDGE, a Bias-Reducing Inter-group Data GEneration framework designed for low-resource assessment settings. Instead of relying on the limited minority samples, BRIDGE synthesizes high-scoring ELL samples by “pasting” construct-relevant (i.e., rubric-aligned knowledge and evidence) content from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns. We further introduce a discriminator model to ensure the quality of synthetic samples. Experiments on California Science Test (CAST) datasets demonstrate that BRIDGE effectively reduces prediction bias for high-scoring ELL students while maintaining overall scoring performance. Notably, our method achieves fairness gains comparable to using additional real human data, offering a cost-effective solution for ensuring equitable scoring in large-scale assessments.
[NLP-39] Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations
【速读】: 该论文旨在解决当前基于社交平台的自杀风险识别方法中存在的两大问题:一是现有方法依赖预定义规则(如引用或回复)来记录对话,仅能捕捉用户互动的有限范围;二是忽略了诸如用户从众效应和自杀模仿行为等隐性影响因素,这些因素可能显著改变在线社区中自杀表达的传播模式。解决方案的关键在于提出多智能体因果推理(Multi-Agent Causal Reasoning, MACR)框架,其核心机制包括两个协同工作的组件:一是推理智能体(Reasoning Agent),基于认知评估理论生成反事实用户反应以扩展交互数据,并通过认知、情感与行为三个维度进行结构化分析;二是偏见感知决策智能体(Bias-aware Decision-Making Agent),利用前门调整策略缓解由隐性因素引发的偏见。MACR通过二者协作,在减少隐性偏见的同时,引入反事实知识丰富交互上下文信息,从而提升自杀风险识别的准确性与鲁棒性。
链接: https://arxiv.org/abs/2602.23577
作者: Jun Li,Xiangmeng Wang,Haoyang Li,Yifei Yan,Shijie Zhang,Hong Va Leong,Ling Feng,Nancy Xiaonan Yu,Qing Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Suicide remains a pressing global public health concern. While social media platforms offer opportunities for early risk detection through online conversation trees, existing approaches face two major limitations: (1) They rely on predefined rules (e.g., quotes or relies) to log conversations that capture only a narrow spectrum of user interactions, and (2) They overlook hidden influences such as user conformity and suicide copycat behavior, which can significantly affect suicidal expression and propagation in online communities. To address these limitations, we propose a Multi-Agent Causal Reasoning (MACR) framework that collaboratively employs a Reasoning Agent to scale user interactions and a Bias-aware Decision-Making Agent to mitigate harmful biases arising from hidden influences. The Reasoning Agent integrates cognitive appraisal theory to generate counterfactual user reactions to posts, thereby scaling user interactions. It analyses these reactions through structured dimensions, i.e., cognitive, emotional, and behavioral patterns, with a dedicated sub-agent responsible for each dimension. The Bias-aware Decision-Making Agent mitigates hidden biases through a front-door adjustment strategy, leveraging the counterfactual user reactions produced by the Reasoning Agent. Through the collaboration of reasoning and bias-aware decision making, the proposed MACR framework not only alleviates hidden biases, but also enriches contextual information of user interactions with counterfactual knowledge. Extensive experiments on real-world conversational datasets demonstrate the effectiveness and robustness of MACR in identifying suicide risk.
[NLP-40] France or Spain or Germany or France: A Neural Account of Non-Redundant Redundant Disjunctions
【速读】: 该论文试图解决语言中看似冗余实则可接受的句法现象(如“她将去法国或西班牙,或者也许去德国或法国”)在不同语境下为何具有语义合理性的问题。传统分析多依赖符号形式化方法,而本文提出一种基于人工神经机制的补充解释:其关键在于语言模型通过两种相互作用的机制实现冗余规避——一是将上下文相关的语义信息绑定到重复词汇项上,二是Transformer结构中的归纳头(induction heads)选择性地关注这些由上下文授权的表示。这一神经机制揭示了语境敏感语义解释的内在运作方式,并与既有符号分析形成互补。
链接: https://arxiv.org/abs/2602.23547
作者: Sasha Boguraev,Qing Yao,Kyle Mahowald
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注: 7 pages, 6 figures
Abstract:Sentences like “She will go to France or Spain, or perhaps to Germany or France.” appear formally redundant, yet become acceptable in contexts such as “Mary will go to a philosophy program in France or Spain, or a mathematics program in Germany or France.” While this phenomenon has typically been analyzed using symbolic formal representations, we aim to provide a complementary account grounded in artificial neural mechanisms. We first present new behavioral evidence from humans and large language models demonstrating the robustness of this apparent non-redundancy across contexts. We then show that, in language models, redundancy avoidance arises from two interacting mechanisms: models learn to bind contextually relevant information to repeated lexical items, and Transformer induction heads selectively attend to these context-licensed representations. We argue that this neural explanation sheds light on the mechanisms underlying context-sensitive semantic interpretation, and that it complements existing symbolic analyses.
[NLP-41] Humans and LLM s Diverge on Probabilistic Inferences
【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在处理非确定性、概率性推理任务时表现不足的问题,即模型在面对有限信息下需做出概率性推断的场景中,难以模拟人类的 graded(分级的)和多样化的判断分布。解决方案的关键在于构建了一个名为 ProbCOPA 的高质量人工标注数据集,其中包含 210 条英文概率性推理题,每条均经由 25–30 名人类参与者标注其推理可能性;通过对比人类标注分布与八种前沿推理 LLM 的输出,揭示了模型在概率分布建模上的系统性偏差,并进一步分析了 LLM 推理链中的共通模式,从而强调了在非确定性语境下评估模型推理能力的重要性。
链接: https://arxiv.org/abs/2602.23546
作者: Gaurav Kamath,Sreenath Madathil,Sebastian Schuster,Marie-Catherine de Marneffe,Siva Reddy
机构: McGill University (麦吉尔大学); Mila – Quebec AI Institute (蒙特利尔学习算法研究所); University of Vienna (维也纳大学); FNRS – UCLouvain (比利时国家科学基金会–鲁汶大学); Canada CIFAR AI Chair (加拿大 CIFAR 人工智能主席)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25–30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.
[NLP-42] IDP Accelerator: Agent ic Document Intelligence from Extraction to Compliance Validation
【速读】: 该论文旨在解决工业自然语言处理(Natural Language Processing, NLP)中从非结构化文档中提取结构化信息的挑战,尤其针对多文档包处理、复杂推理以及严格合规性要求下传统流水线方法失效的问题。其解决方案的关键在于提出IDP(Intelligent Document Processing)加速器框架,包含四大核心组件:(1) DocSplit,通过BIO标注的多模态分类器实现复杂文档包的分割;(2) 可配置的抽取模块,利用多模态大语言模型(Multimodal Large Language Models, LLMs)将非结构化内容转化为结构化数据;(3) 遵循Model Context Protocol (MCP) 的代理分析模块,通过安全沙箱代码执行提供数据访问;(4) 规则验证模块,以LLM驱动逻辑替代确定性引擎,实现复杂合规性检查。该方案在医疗等行业实现高精度分类(98%)、显著降低延迟(减少80%)和运营成本(降低77%),并开源部署以支持社区使用。
链接: https://arxiv.org/abs/2602.23481
作者: Md Mofijul Islam,Md Sirajus Salekin,Joe King,Priyashree Roy,Vamsi Thilak Gudi,Spencer Romo,Akhil Nooney,Boyi Xie,Bob Strahan,Diego A. Socolinsky
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Understanding and extracting structured insights from unstructured documents remains a foundational challenge in industrial NLP. While Large Language Models (LLMs) enable zero-shot extraction, traditional pipelines often fail to handle multi-document packets, complex reasoning, and strict compliance requirements. We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging to segment complex document packets; (2) configurable Extraction Module leveraging multimodal LLMs to transform unstructured content into structured data; (3) Agentic Analytics Module, compliant with the Model Context Protocol (MCP) providing data access through secure, sandboxed code execution; and (4) Rule Validation Module replacing deterministic engines with LLM-driven logic for complex compliance checks. The interactive demonstration enables users to upload document packets, visualize classification results, and explore extracted data through an intuitive web interface. We demonstrate effectiveness across industries, highlighting a production deployment at a leading healthcare provider achieving 98% classification accuracy, 80% reduced processing latency, and 77% lower operational costs over legacy baselines. IDP Accelerator is open-sourced with a live demonstration available to the community.
[NLP-43] FHIRPath-QA: Executable Question Answering over FHIR Electronic Health Records LREC2026 ALT
【速读】: 该论文旨在解决患者在使用电子健康记录(EHR)时,现有界面难以提供精准、可信的个性化问答支持的问题。当前基于大语言模型(LLM)的临床问答(QA)方法多依赖检索机制,存在计算效率低、幻觉严重及难以部署于真实EHR环境等局限。其解决方案的关键在于提出一种“文本到FHIRPath问答”(text-to-FHIRPath QA)范式,将原本依赖自由文本生成的推理过程转变为FHIRPath查询合成任务,从而显著减少对LLM的依赖并提升答案的可解释性与准确性。该方法构建于MIMIC-IV数据集之上,包含14,000余条自然语言问题及其对应的标准化FHIRPath查询和答案,为未来开发安全、高效且具备互操作性的患者端健康应用提供了可靠基础。
链接: https://arxiv.org/abs/2602.23479
作者: Michael Frew,Nishit Bheda,Bryan Tripp
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to LREC 2026 CL4Health Workshop
Abstract:Though patients are increasingly granted digital access to their electronic health records (EHRs), existing interfaces may not support precise, trustworthy answers to patient-specific questions. Large language models (LLM) show promise in clinical question answering (QA), but retrieval-based approaches are computationally inefficient, prone to hallucination, and difficult to deploy over real-life EHRs. In this work, we introduce FHIRPath-QA, the first open dataset and benchmark for patient-specific QA that includes open-standard FHIRPath queries over real-world clinical data. We propose a text-to-FHIRPath QA paradigm that shifts reasoning from free-text generation to FHIRPath query synthesis, significantly reducing LLM usage. Built on MIMIC-IV on FHIR Demo, the dataset pairs over 14k natural language questions in patient and clinician phrasing with validated FHIRPath queries and answers. Further, we demonstrate that state-of-the-art LLMs struggle to deal with ambiguity in patient language and perform poorly in FHIRPath query synthesis. However, they benefit strongly from supervised fine-tuning. Our results highlight that text-to-FHIRPath synthesis has the potential to serve as a practical foundation for safe, efficient, and interoperable consumer health applications, and our dataset and benchmark serve as a starting point for future research on the topic. The full dataset and generation code is available at: this https URL.
[NLP-44] CiteAudit: You Cited It But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在科学写作中生成虚假引用(hallucinated citations)的问题,即模型可能编造看似合理但实际上不存在的文献,从而破坏学术引用的准确性与可信度。其解决方案的关键在于构建了一个端到端的多智能体验证框架,通过分解 citation 检查任务为 claim extraction(主张提取)、evidence retrieval(证据检索)、passage matching(段落匹配)、reasoning(推理)和 calibrated judgment(校准判断)五个步骤,实现对引文真实性和证据一致性的系统性评估,并基于跨领域的高质量人工标注数据集定义了统一的评估指标,显著提升了检测准确率与可解释性。
链接: https://arxiv.org/abs/2602.23452
作者: Zhengqing Yuan,Kaiwen Shi,Zheyuan Zhang,Lichao Sun,Nitesh V. Chawla,Yanfang Ye
机构: University of Notre Dame (圣母大学); Lehigh University (利哈伊大学)
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:
Abstract:Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted papers at major machine learning venues, exposing vulnerabilities in peer review. Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation. We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing. Our multi-agent verification pipeline decomposes citation checking into claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment to assess whether a cited source truly supports its claim. We construct a large-scale human-validated dataset across domains and define unified metrics for citation faithfulness and evidence alignment. Experiments with state-of-the-art LLMs reveal substantial citation errors and show that our framework significantly outperforms prior methods in both accuracy and interpretability. This work provides the first scalable infrastructure for auditing citations in the LLM era and practical tools to improve the trustworthiness of scientific references.
[NLP-45] EvoX: Meta-Evolution for Automated Discovery
【速读】: 该论文旨在解决现有AI驱动的进化优化方法中搜索策略固定不变的问题,即在优化过程中无法根据任务进展或搜索空间的变化动态调整策略,从而限制了算法在复杂多变场景下的适应性和性能。解决方案的关键在于提出EvoX,一种自适应进化方法,其核心机制是将候选解与生成它们的搜索策略共同演化:通过持续评估历史解的表现并反馈至搜索策略本身,EvoX能够动态调整如何选择和变异先前解,从而实现搜索策略的在线优化与自适应切换,显著提升了在近200个真实世界优化任务中的泛化能力和效率。
链接: https://arxiv.org/abs/2602.23413
作者: Shu Liu,Shubham Agarwal,Monishwaran Maheswaran,Mert Cemri,Zhifei Li,Qiuyang Mang,Ashwin Naren,Ethan Boneh,Audrey Cheng,Melissa Z. Pan,Alexander Du,Kurt Keutzer,Alexandros G. Dimakis,Koushik Sen,Matei Zaharia,Ion Stoica
机构: UC Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学); Bespoke Labs (定制实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Recent work such as AlphaEvolve has shown that combining LLM-driven optimization with evolutionary search can effectively improve programs, prompts, and algorithms across domains. In this paradigm, previously evaluated solutions are reused to guide the model toward new candidate solutions. Crucially, the effectiveness of this evolution process depends on the search strategy: how prior solutions are selected and varied to generate new candidates. However, most existing methods rely on fixed search strategies with predefined knobs (e.g., explore-exploit ratios) that remain static throughout execution. While effective in some settings, these approaches often fail to adapt across tasks, or even within the same task as the search space changes over time. We introduce EvoX, an adaptive evolution method that optimizes its own evolution process. EvoX jointly evolves candidate solutions and the search strategies used to generate them, continuously updating how prior solutions are selected and varied based on progress. This enables the system to dynamically shift between different search strategies during the optimization process. Across nearly 200 real-world optimization tasks, EvoX outperforms existing AI-driven evolutionary methods including AlphaEvolve, OpenEvolve, GEPA, and ShinkaEvolve on the majority of tasks.
[NLP-46] ask-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages LREC2026
【速读】: 该论文旨在解决低资源语言(尤其是印度语种)在语音技术研究中因数据稀缺而导致的跨任务应用受限问题,其核心挑战在于现有数据集通常仅针对单一任务进行标注与整理,缺乏对多任务适用性的系统性评估。解决方案的关键在于提出Task-Lens框架,通过跨任务分析50个印度语音数据集(涵盖26种语言)在9类下游语音任务中的可用性,识别可复用的元数据并提出任务对齐增强策略,从而挖掘已有数据集的多任务潜力,并揭示当前资源严重不足的任务与语言方向,为后续数据构建提供优先级指导。
链接: https://arxiv.org/abs/2602.23388
作者: Swati Sharma,Divya V. Sharma,Anubha Gupta
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at LREC 2026
Abstract:The rising demand for inclusive speech technologies amplifies the need for multilingual datasets for Natural Language Processing (NLP) research. However, limited awareness of existing task-specific resources in low-resource languages hinders research. This challenge is especially acute in linguistically diverse countries, such as India. Cross-task profiling of existing Indian speech datasets can alleviate the data scarcity challenge. This involves investigating the utility of datasets across multiple downstream tasks rather than focusing on a single task. Prior surveys typically catalogue datasets for a single task, leaving comprehensive cross-task profiling as an open opportunity. Therefore, we propose Task-Lens, a cross-task survey that assesses the readiness of 50 Indian speech datasets spanning 26 languages for nine downstream speech tasks. First, we analyze which datasets contain metadata and properties suitable for specific tasks. Next, we propose task-aligned enhancements to unlock datasets to their full downstream potential. Finally, we identify tasks and Indian languages that are critically underserved by current resources. Our findings reveal that many Indian speech datasets contain untapped metadata that can support multiple downstream tasks. By uncovering cross-task linkages and gaps, Task-Lens enables researchers to explore the broader applicability of existing datasets and to prioritize dataset creation for underserved tasks and languages.
[NLP-47] Hello-Chat: Towards Realistic Social Audio Interactions
【速读】: 该论文旨在解决当前大型音频语言模型(Large Audio Language Models, LALMs)在语音生成中普遍存在的“感知-表达断层”问题,即模型输出缺乏自然语调、情感共鸣和人类互动的 spontaneity(自发性),表现为机械化的“读音式”语音。解决方案的关键在于提出一个端到端的音频语言模型 Hello-Chat,其核心创新包括:利用真实生活对话的大规模数据集进行训练,并采用模态交错(modality-interleaved)的训练策略,从而显著提升语音的韵律自然度(prosodic naturalness)和情感一致性(emotional alignment),实现更具人性化的语音生成能力。
链接: https://arxiv.org/abs/2602.23387
作者: Yueran Hou,Peilei Jia,Zihan Sun,Qihang Lu,Wenbing Yang,Yingming Gao,Ya Li,Jun Gao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recent advancements in Large Audio Language Models (LALMs) have demonstrated exceptional performance in speech recognition and translation. However, existing models often suffer from a disconnect between perception and expression, resulting in a robotic “read-speech” style that lacks the spontaneity and emotional resonance of real human interaction. In this report, we introduce Hello-Chat, an end-to-end audio language model designed for realistic social scenarios. By leveraging a massive dataset of real-life conversations and employing a modality-interleaved training strategy, Hello-Chat achieves a breakthrough in anthropomorphic generation. Experimental results show that our model not only reaches state-of-the-art (SOTA) performance on specific audio understanding tasks but also significantly outperforms existing baselines in prosodic naturalness and emotional alignment, paving the way for the next generation of empathetic AI agents.
信息检索
[IR-0] Resources for Automated Evaluation of Assistive RAG Systems that Help Readers with News Trustworthiness Assessment
【速读】:该论文旨在解决用户在面对在线新闻时难以判断其可信度的问题,尤其是在可靠报道与虚假信息共存的背景下。为此,研究者设计并组织了TREC 2025 DRAGUN(Detection, Retrieval, and Augmented Generation for Understanding News)任务,通过构建基于检索增强生成(Retrieval-Augmented Generation, RAG)的辅助系统来支持读者对新闻可信度的评估。解决方案的关键在于:首先,开发了两个核心任务——问题生成(Task 1)和报告生成(Task 2),分别用于生成高相关性调查问题和基于MS MARCO V2.1 Segmented Corpus的结构化报告;其次,引入由TREC评估员制定的重要性加权评分标准(rubrics),明确读者评估新闻可信度所需的关键信息点,并据此对参赛系统输出进行人工评判;最后,建立自动评估机制(AutoJudge),利用这些人工标注的rubrics实现对新提交结果的自动化排序,其性能与人工评估高度一致(Task 1的Kendall’s τ = 0.678,Task 2的τ = 0.872),从而为RAG系统在新闻可信度辅助评估中的持续评测与优化提供了可复用资源和基准。
链接: https://arxiv.org/abs/2602.24277
作者: Dake Zhang,Mark D. Smucker,Charles L. A. Clarke
机构: University of Waterloo (滑铁卢大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Many readers today struggle to assess the trustworthiness of online news because reliable reporting coexists with misinformation. The TREC 2025 DRAGUN (Detection, Retrieval, and Augmented Generation for Understanding News) Track provided a venue for researchers to develop and evaluate assistive RAG systems that support readers’ news trustworthiness assessment by producing reader-oriented, well-attributed reports. As the organizers of the DRAGUN track, we describe the resources that we have newly developed to allow for the reuse of the track’s tasks. The track had two tasks: (Task 1) Question Generation, producing 10 ranked investigative questions; and (Task 2, the main task) Report Generation, producing a 250-word report grounded in the MS MARCO V2.1 Segmented Corpus. As part of the track’s evaluation, we had TREC assessors create importance-weighted rubrics of questions with expected short answers for 30 different news articles. These rubrics represent the information that assessors believe is important for readers to assess an article’s trustworthiness. The assessors then used their rubrics to manually judge the participating teams’ submitted runs. To make these tasks and their rubrics reusable, we have created an automated process to judge runs not part of the original assessing. We show that our AutoJudge ranks existing runs well compared to the TREC human-assessed evaluation (Kendall’s \tau = 0.678 for Task 1 and \tau = 0.872 for Task 2). These resources enable both the evaluation of RAG systems for assistive news trustworthiness assessment and, with the human evaluation as a benchmark, research on improving automated RAG evaluation.
[IR-1] Beyond the Click: A Framework for Inferring Cognitive Traces in Search
【速读】:该论文试图解决现有用户模拟器(user simulator)在评估搜索系统时存在的局限性问题,即当前模拟器仅能复制用户行为而无法捕捉其背后的认知过程(cognitive traces),如困惑、满意等心理状态。为填补这一空白,论文提出了一种基于信息觅食理论(Information Foraging Theory, IFT)与人类专家判断相结合的多智能体框架,用于从大规模交互日志中推断用户的认知轨迹。该方法的关键在于利用IFT建模用户的信息搜索决策机制,并融合专家标注以增强对隐含认知状态的识别能力,从而显著提升模型在会话结果预测和用户困境恢复等任务上的性能。
链接: https://arxiv.org/abs/2602.24265
作者: Saber Zerhoudi,Michael Granitzer
机构: University of Passau (帕绍大学); Interdisciplinary Transformation University Austria (跨学科转型大学奥地利)
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
备注:
Abstract:User simulators are essential for evaluating search systems, but they primarily copy user actions without understanding the underlying thought process. This gap exists since large-scale interaction logs record what users do, but not what they might be thinking or feeling, such as confusion or satisfaction. To solve this problem, we present a framework to infer cognitive traces from behavior logs. Our method uses a multi-agent system grounded in Information Foraging Theory (IFT) and human expert judgment. These traces improve model performance on tasks like forecasting session outcomes and user struggle recovery. We release a collection of annotations for several public datasets, including AOL and Stack Overflow, and an open-source tool that allows researchers to apply our method to their own data. This work provides the tools and data needed to build more human-like user simulators and to assess retrieval systems on user-oriented dimensions of performance.
[IR-2] UXSim: Towards a Hybrid User Search Simulation
【速读】:该论文旨在解决传统方法在模拟复杂交互式搜索系统中用户体验时面临的挑战,即依赖静态用户代理或独立的大语言模型(Large Language Model, LLM)代理,这些方法往往缺乏深度且可验证的现实基础。解决方案的关键在于提出UXSim框架,通过融合传统模拟器提供的有据可依的数据来引导和约束一个自适应LLM代理的推理过程,从而实现更精确、动态的用户行为模拟,并为底层认知过程提供可解释的验证路径。
链接: https://arxiv.org/abs/2602.24241
作者: Saber Zerhoudi,Michael Granitzer
机构: University of Passau (帕绍大学); IT:U Austria (IT:U 奥地利)
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
备注:
Abstract:Simulating nuanced user experiences within complex interactive search systems poses distinct challenge for traditional methodologies, which often rely on static user proxies or, more recently, on standalone large language model (LLM) agents that may lack deep, verifiable grounding. The true dynamism and personalization inherent in human-computer interaction demand a more integrated approach. This work introduces UXSim, a novel framework that integrates both approaches. It leverages grounded data from traditional simulators to inform and constrain the reasoning of an adaptive LLM agent. This synthesis enables more accurate and dynamic simulations of user behavior while also providing a pathway for the explainable validation of the underlying cognitive processes.
[IR-3] Science Fiction and Fantasy in Wikipedia: Exploring Structural and Semantic Cues
【速读】:该论文试图解决如何准确识别维基百科中与科幻(Science Fiction, SF)和奇幻(Fantasy)及其混合类型相关的内容这一问题,其核心挑战在于这些类型的边界模糊且经常重叠。解决方案的关键在于综合利用维基百科的结构化信息(如分类标签、内部链接(wikilinks)及对应的Wikidata声明)与语义特征,从而构建更鲁棒的识别模型,以克服单一信号因社区惯例导致的偏倚或不完整性问题。
链接: https://arxiv.org/abs/2602.24229
作者: Włodzimierz Lewoniewski,Milena Stróżyna,Izabela Czumałowska,Elżbieta Lewańska
机构: Poznań University of Economics and Business (波兹南经济与工商大学)
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
备注: Supplementary materials: this https URL
Abstract:Identifying which Wikipedia articles are related to science fiction, fantasy, or their hybrids is challenging because genre boundaries are porous and frequently overlap. Wikipedia nonetheless offers machine-readable structure beyond text, including categories, internal links (wikilinks), and statements if corresponding Wikidata items. However, each of these signals reflects community conventions and can be biased or incomplete. This study examines structural and semantic features of Wikipedia articles that can be used to identify content related to science fiction and fantasy (SF/F).
[IR-4] Recommendation Algorithms: A Comparative Study in Movie Domain
【速读】:该论文旨在解决电影推荐系统中如何提高推荐准确性的核心问题,特别是在用户评分预测任务中的建模优化。其解决方案的关键在于构建一个基于回归的模型,利用从Netflix挑战数据集提取的多种特征(包括聚合特征、基于矩阵分解(Matrix Factorization, MF)的特征以及用户与电影相似性特征)作为输入,并结合XGBoost回归算法进行预测。实验表明,MF-based方法在Root Mean Square Error (RMSE)指标上表现最优,说明融合协同过滤思想的特征工程与机器学习回归模型相结合,是提升推荐精度的有效策略。
链接: https://arxiv.org/abs/2602.24125
作者: Rohit Chivukula,T. Jaya Lakshmi,Hemlata Sharma,C.H.S.N.P. Sairam Rallabandi
机构: 未知
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注:
Abstract:Intelligent recommendation systems have clearly increased the revenue of well-known e-commerce firms. Users receive product recommendations from recommendation systems. Cinematic recommendations are made to users by a movie recommendation system. There have been numerous approaches to the problem of recommendation in the literature. It is viewed as a regression task in this research. A regression model was built using novel properties extracted from the dataset and used as features in the model. For experimentation, the Netflix challenge dataset has been used. Video streaming service Netflix is a popular choice for many. Customers’ prior viewing habits are taken into account when Netflix makes movie recommendations to them. An exploratory data analysis on the Netflix dataset was conducted to gain insights into user rating behaviour and movie characteristics. Various kinds of features, including aggregating, Matrix Factorization (MF) based, and user and movie similarity based, have been extracted in the subsequent stages. In addition to a feature in the XGBoost regression algorithm, the K-Nearest Neighbors and MF algorithms from Python’s Surprise library are used for recommendations. Based on Root Mean Square Error (RMSE), MF-based algorithms have provided the best recommendations.
[IR-5] Colour Contrast on the Web: A WCAG 2.1 Level AA Compliance Audit of Common Crawls Top 500 Domains
【速读】:该论文旨在解决网页色彩对比度不合规对用户可访问性(Accessibility)造成的广泛障碍问题,特别是针对WCAG 2.1/2.2 Level AA标准中规定的4.5:1正常文本对比度阈值。其解决方案的关键在于利用Common Crawl的公开Warc归档数据进行大规模静态CSS分析,而非实时爬取网页,从而实现高可复现性并避免对目标服务器造成负载;通过对500个最常抓取域名中的240个主页进行分析,识别出4,327种前景/背景颜色组合,发现其中40.9%不符合对比度要求,揭示了主流网站在色彩无障碍方面存在的系统性问题及显著的领域差异。
链接: https://arxiv.org/abs/2602.24067
作者: Thom Vaughan,Pedro Ortiz Suarez
机构: Common Crawl Foundation(公共爬虫基金会)
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
备注: 8 pages, 4 tables. Companion website and reproducible analysis code available at this https URL and this https URL
Abstract:We present a large-scale automated audit of WCAG 2.1/2.2 Level AA colour contrast compliance across the 500 most frequently crawled registered domains in Common Crawl’s CC-MAIN-2026-08 February 2026 crawl archive. Rather than conducting a live crawl, all page content was sourced from Common Crawl’s open WARC archives, ensuring reproducibility and eliminating any load on target web servers. Our static CSS analysis of 240 homepages identified 4,327 unique foreground/background colour pairings, of which 1,771 (40.9%) failed to meet the 4.5:1 contrast ratio threshold for normal text. The median per-site pass rate was 62.7%, with 20.4% of sites achieving full compliance across all detected colour pairings. These findings suggest that colour contrast remains a widespread accessibility barrier on the most prominent websites, with significant variation across domain categories.
[IR-6] GPU-Native Approximate Nearest Neighbor Search with IVF-RaBitQ: Fast Index Build and Search
【速读】:该论文旨在解决在GPU上实现近似最近邻搜索(Approximate Nearest Neighbor Search, ANNS)时面临的多目标优化难题:如何同时实现快速索引构建、高吞吐量搜索、高召回率以及低存储开销。现有方法中,基于图的索引虽能提供高召回率但构建和存储成本高昂;而基于聚类的方法虽然构建高效且可扩展,却往往需要大量探测次数以达到高召回率,导致内存带宽和计算资源紧张。解决方案的关键在于提出一种GPU原生的IVF-RaBitQ (GPU)框架,将聚类方法IVF与RaBitQ量化技术深度融合,形成高效的GPU索引构建与搜索流水线:一方面设计了可扩展的GPU原生RaBitQ量化方法,实现大规模低比特编码的快速准确处理;另一方面开发了针对RaBitQ码的GPU原生距离计算方案及融合搜索核,显著提升吞吐量并保持高召回率。实验表明,在召回率约0.95时,IVF-RaBitQ相比最先进的图方法CAGRA提升2.2倍查询每秒(QPS),索引构建速度平均快7.7倍,相较聚类方法IVF-PQ平均吞吐量提升2.7倍且无需访问原始向量进行重排序。
链接: https://arxiv.org/abs/2602.23999
作者: Jifan Shi,Jianyang Gao,James Xia,Tamás Béla Fehér,Cheng Long
机构: Nanyang Technological University (南洋理工大学); NVIDIA (英伟达)
类目: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
备注:
Abstract:Approximate nearest neighbor search (ANNS) on GPUs is gaining increasing popularity for modern retrieval and recommendation workloads that operate over massive high-dimensional vectors. Graph-based indexes deliver high recall and throughput but incur heavy build-time and storage costs. In contrast, cluster-based methods build and scale efficiently yet often need many probes for high recall, straining memory bandwidth and compute. Aiming to simultaneously achieve fast index build, high-throughput search, high recall, and low storage requirement for GPUs, we present IVF-RaBitQ (GPU), a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. Specifically, for index build, we develop a scalable GPU-native RaBitQ quantization method that enables fast and accurate low-bit encoding at scale. For search, we develop GPU-native distance computation schemes for RaBitQ codes and a fused search kernel to achieve high throughput with high recall. With IVF-RaBitQ implemented and integrated into the NVIDIA cuVS Library, experiments on cuVS Bench across multiple datasets show that IVF-RaBitQ offers a strong performance frontier in recall, throughput, index build time, and storage footprint. For Recall approximately equal to 0.95, IVF-RaBitQ achieves 2.2x higher QPS than the state-of-the-art graph-based method CAGRA, while also constructing indices 7.7x faster on average. Compared to the cluster-based method IVF-PQ, IVF-RaBitQ delivers on average over 2.7x higher throughput while avoiding accessing the raw vectors for reranking.
[IR-7] Robust Aggregation for Federated Sequential Recommendation with Sparse and Poisoned Data
【速读】:该论文旨在解决联邦序列推荐(Federated Sequential Recommendation)中两个相互交织的挑战:一是客户端由于交互序列短且高度稀疏,导致用户表征学习可靠性低;二是联邦优化过程易受恶意或损坏客户端更新的影响,中毒梯度会显著扭曲全局模型。解决方案的关键在于提出一种防御感知的聚合框架,通过识别并降低不可靠客户端更新的权重,同时保留来自稀疏但良性参与者的有效信号;此外,引入表示层约束以稳定用户和物品嵌入,防止中毒或异常贡献主导全局参数空间,并结合序列感知正则化机制,在局部观测有限的情况下维持用户建模的时间一致性。
链接: https://arxiv.org/abs/2602.23982
作者: Minh Hieu Nguyen
机构: 未知
类目: Information Retrieval (cs.IR)
备注:
Abstract:Federated sequential recommendation distributes model training across user devices so that behavioural data remains local, reducing privacy risks. Yet, this setting introduces two intertwined difficulties. On the one hand, individual clients typically contribute only short and highly sparse interaction sequences, limiting the reliability of learned user representations. On the other hand, the federated optimisation process is vulnerable to malicious or corrupted client updates, where poisoned gradients can significantly distort the global model. These challenges are particularly severe in sequential recommendation, where temporal dynamics further complicate signal aggregation. To address this problem, we propose a robust aggregation framework tailored for federated sequential recommendation under sparse and adversarial conditions. Instead of relying on standard averaging, our method introduces a defence-aware aggregation mechanism that identifies and down-weights unreliable client updates while preserving informative signals from sparse but benign participants. The framework incorporates representation-level constraints to stabilise user and item embeddings, preventing poisoned or anomalous contributions from dominating the global parameter space. In addition, we integrate sequence-aware regularisation to maintain temporal coherence in user modelling despite limited local observations.
[IR-8] owards Efficient and Generalizable Retrieval: Adaptive Semantic Quantization and Residual Knowledge Transfer
【速读】:该论文旨在解决语义ID(Semantic ID)驱动的生成式检索(Generative Retrieval)方法中存在的固有权衡问题:头部项目(head items)易受ID冲突影响,从而损害下游任务性能;而数据稀疏的尾部项目(tail items),尤其是冷启动项目(cold-start items),则表现出有限的泛化能力。解决方案的关键在于提出 Anchored Curriculum with Sequential Adaptive Quantization (SA²CRQ) 框架,其核心创新包括:(1) Sequential Adaptive Residual Quantization (SARQ),通过基于项目路径熵动态分配码长,为头部项目分配更长、更具区分度的ID,为尾部项目分配更短、更具泛化能力的ID;(2) Anchored Curriculum Residual Quantization (ACRQ),利用从头部项目中学习到的冻结语义流形(frozen semantic manifold)作为正则项,加速并提升尾部项目的表征学习效果。该框架在大规模工业搜索系统和多个公开数据集上均验证了对基线方法的一致改进,尤其在冷启动检索场景下表现显著。
链接: https://arxiv.org/abs/2602.23978
作者: Huimu Wang,Xingzhi Yao,Yiming Qiu,Qinghong Zhang,Haotian Wang,Yufan Cui,Songlin Wang,Sulong Xu,Mingming Li
机构: JD.com(京东); IIE, UCAS(中科院自动化研究所)
类目: Information Retrieval (cs.IR)
备注:
Abstract:While semantic ID-based generative retrieval enables efficient end-to-end modeling in industrial applications, these methods face a persistent trade-off: head items are susceptible to ID collisions that negatively impact downstream tasks, whereas data-sparse tail items, including cold-start items, exhibit limited generalization. To address this issue, we propose the Anchored Curriculum with Sequential Adaptive Quantization (SA^2CRQ) framework. The framework introduces Sequential Adaptive Residual Quantization (SARQ) to dynamically allocate code lengths based on item path entropy, assigning longer, discriminative IDs to head items and shorter, generalizable IDs to tail items. To mitigate data sparsity, the Anchored Curriculum Residual Quantization (ACRQ) component utilizes a frozen semantic manifold learned from head items to regularize and accelerate the representation learning of tail items. Experimental results from a large-scale industrial search system and multiple public datasets indicate that SA^2CRQ yields consistent improvements over existing baselines, particularly in cold-start retrieval scenarios.
[IR-9] RAD-DPO: Robust Adaptive Denoising Direct Preference Optimization for Generative Retrieval in E-commerce
【速读】:该论文旨在解决生成式检索(Generative Retrieval, GR)在电商搜索中与复杂用户偏好对齐的难题,尤其针对直接应用直接偏好优化(Direct Preference Optimization, DPO)于结构化语义ID(Semantic IDs, SIDs)时所面临的三大挑战:(i) 共享层级前缀被错误惩罚导致梯度冲突;(ii) 由隐式反馈产生的噪声伪负例使模型易受干扰;(iii) 在多标签查询场景下,有效候选项间出现概率“挤压效应”。解决方案的关键在于提出RAD-DPO方法,其核心创新包括:token级梯度解耦以保护前缀结构完整性、基于相似度的动态奖励加权机制以缓解标签噪声影响,以及融合全局对比学习目标与全局监督微调(SFT)损失的多标签全局对比目标,从而显式扩展正样本覆盖范围,显著提升排序质量与训练效率。
链接: https://arxiv.org/abs/2602.23964
作者: Zhiguo Chen,Guohao Sun,Yiming Qiu,Xingzhi Yao,Mingming Li,Huimu Wang,Yangqi Zhang,Songlin Wang,Sulong Xu
机构: JD.com(京东); Peking University (北京大学); IIE, UCAS (中国科学院大学工程研究所)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Generative Retrieval (GR) has emerged as a powerful paradigm in e-commerce search, retrieving items via autoregressive decoding of Semantic IDs (SIDs). However, aligning GR with complex user preferences remains challenging. While Direct Preference Optimization (DPO) offers an efficient alignment solution, its direct application to structured SIDs suffers from three limitations: (i) it penalizes shared hierarchical prefixes, causing gradient conflicts; (ii) it is vulnerable to noisy pseudo-negatives from implicit feedback; and (iii) in multi-label queries with multiple relevant items, it exacerbates a probability “squeezing effect” among valid candidates. To address these issues, we propose RAD-DPO, which introduces token-level gradient detachment to protect prefix structures, similarity-based dynamic reward weighting to mitigate label noise, and a multi-label global contrastive objective integrated with global SFT loss to explicitly expand positive coverage. Extensive offline experiments and online A/B testing on a large-scale e-commerce platform demonstrate significant improvements in ranking quality and training efficiency.
[IR-10] HotelQuEST: Balancing Quality and Efficiency in Agent ic Search EACL2026
【速读】:该论文旨在解决当前生成式搜索(Agentic Search)系统在实际部署中面临的两大核心问题:一是现有评估基准主要关注检索质量而忽视效率因素,导致系统虽性能优异却难以落地;二是用户查询常存在未明确表达的偏好(underspecified preferences),这一挑战在现有评估体系中尚未被充分探索。解决方案的关键在于提出HotelQuEST基准,包含214个从简单到复杂的酒店搜索查询,并通过收集标注者对隐含偏好的澄清信息,使这些偏好显式化以支持更贴近真实场景的评估。实验表明,基于大语言模型(LLM)的代理系统虽然准确率更高,但因冗余工具调用和不当的任务路由策略导致显著更高的计算成本,揭示了当前系统在效率方面的不足,并指出了面向成本优化的改进空间。
链接: https://arxiv.org/abs/2602.23949
作者: Guy Hadad,Shadi Iskander,Oren Kalinsky,Sofia Tolmach,Ran Levy,Haggai Roitman
机构: Ben-Gurion University (本-古里安大学); Amazon (亚马逊)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: To be published in EACL 2026
Abstract:Agentic search has emerged as a promising paradigm for adaptive retrieval systems powered by large language models (LLMs). However, existing benchmarks primarily focus on quality, overlooking efficiency factors that are critical for real-world deployment. Moreover, real-world user queries often contain underspecified preferences, a challenge that remains largely underexplored in current agentic search evaluation. As a result, many agentic search systems remain impractical despite their impressive performance. In this work, we introduce HotelQuEST, a benchmark comprising 214 hotel search queries that range from simple factual requests to complex queries, enabling evaluation across the full spectrum of query difficulty. We further address the challenge of evaluating underspecified user preferences by collecting clarifications that make annotators’ implicit preferences explicit for evaluation. We find that LLM-based agents achieve higher accuracy than traditional retrievers, but at substantially higher costs due to redundant tool calls and suboptimal routing that fails to match query complexity to model capability. Our analysis exposes inefficiencies in current agentic search systems and demonstrates substantial potential for cost-aware optimization.
[IR-11] EDDA-Coordinata: An Annotated Dataset of Historical Geographic Coordinates LREC2026
【速读】:该论文旨在解决从十八世纪历史文本中自动提取和标准化地理坐标的问题,这类文本中的地理信息常以多样且精度不一的方式表达,导致传统方法难以准确识别与归一化。解决方案的关键在于构建了一个高质量的金标准(gold standard)标注数据集,并基于此训练了基于Transformer架构的两阶段模型:第一阶段使用分类器识别包含坐标的条目,第二阶段采用编码器-解码器或仅解码器结构的模型进行坐标抽取与标准化。该方法在跨语言(法语至英语)、跨领域(不同词典)测试中均展现出良好泛化能力,验证了其有效性与实用性。
链接: https://arxiv.org/abs/2602.23941
作者: Ludovic Moncla,Pierre Nugues,Thierry Joliveau,Katherine McDonough
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: Accepted at LREC 2026
Abstract:This paper introduces a dataset of enriched geographic coordinates retrieved from Diderot and d’Alembert’s eighteenth-century Encyclopedie. Automatically recovering geographic coordinates from historical texts is a complex task, as they are expressed in a variety of ways and with varying levels of precision. To improve retrieval of coordinates from similar digitized early modern texts, we have created a gold standard dataset, trained models, published the resulting inferred and normalized coordinate data, and experimented applying these models to new texts. From 74,000 total articles in each of the digitized versions of the Encyclopedie from ARTFL and ENCCRE, we examined 15,278 geographical entries, manually identifying 4,798 containing coordinates, and 10,480 with descriptive but non-numerical references. Leveraging our gold standard annotations, we trained transformer-based models to retrieve and normalize coordinates. The pipeline presented here combines a classifier to identify coordinate-bearing entries and a second model for retrieval, tested across encoder-decoder and decoder architectures. Cross-validation yielded an 86% EM score. On an out-of-domain eighteenth-century Trevoux dictionary (also in French), our fine-tuned model had a 61% EM score, while for the nineteenth-century, 7th edition of the Encyclopaedia Britannica in English, the EM was 77%. These findings highlight the gold standard dataset’s usefulness as training data, and our two-step method’s cross-lingual, cross-domain generalizability.
[IR-12] UniFAR: A Unified Facet-Aware Retrieval Framework for Scientific Documents
【速读】:该论文旨在解决科学文献检索(Scientific Document Retrieval, SDR)中因模型架构与任务范式不匹配所导致的性能瓶颈问题,具体表现为:输入粒度差异(长文档 vs. 短问题)、语义焦点错位(科学论述结构 vs. 问题意图)以及训练信号不一致(基于引用的相似性 vs. 问题导向的相关性)。其解决方案的关键在于提出统一的面向特征的检索框架(Unified Facet-Aware Retrieval, UniFAR),通过自适应多粒度聚合缓解输入粒度差异,利用可学习的特征锚点(facet anchors)对齐文档结构与问题意图,并在单一架构内联合训练文档-文档(doc-doc)和问题-文档(q-doc)检索任务,从而实现两种检索范式的统一建模与性能提升。
链接: https://arxiv.org/abs/2602.23766
作者: Zheng Dou,Zhao Zhang,Deqing Wang,Yikun Ban,Fuzhen Zhuang
机构: Beihang University (北京航空航天大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Existing scientific document retrieval (SDR) methods primarily rely on document-centric representations learned from inter-document relationships for document-document (doc-doc) retrieval. However, the rise of LLMs and RAG has shifted SDR toward question-driven retrieval, where documents are retrieved in response to natural-language questions (q-doc). This change has led to systematic mismatches between document-centric models and question-driven retrieval, including (1) input granularity (long documents vs. short questions), (2) semantic focus (scientific discourse structure vs. specific question intent), and (3) training signals (citation-based similarity vs. question-oriented relevance). To this end, we propose UniFAR, a Unified Facet-Aware Retrieval framework to jointly support doc-doc and q-doc SDR within a single architecture. UniFAR reconciles granularity differences through adaptive multi-granularity aggregation, aligns document structure with question intent via learnable facet anchors, and unifies doc-doc and q-doc supervision through joint training. Experimental results show that UniFAR consistently outperforms prior methods across multiple retrieval tasks and base models, confirming its effectiveness and generality.
[IR-13] Recommending Search Filters To Improve Conversions At Airbnb
【速读】:该论文旨在解决在线市场中搜索筛选器(search filters)对预订转化率(booking conversions)影响机制不明确的问题,即尽管筛选器被设计用于提升用户转化,但其直接作用尚未在现有文献中得到充分验证。解决方案的关键在于提出一种以转化为导向的机器学习建模框架,通过推荐中间工具——即搜索筛选器——来直接优化下端转化目标(booking),并成功将其落地为 Airbnb 的筛选推荐系统。该框架解决了冷启动和实时服务等工程挑战,在多个用户界面部署后通过线上 A/B 测试验证了其对预订转化率的显著提升效果。
链接: https://arxiv.org/abs/2602.23717
作者: Hao Li,Kedar Bellare,Siyu Yang,Sherry Chen,Liwei He,Stephanie Moyerman,Sanjeev Katariya
机构: Airbnb, Inc.(Airbnb公司)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Airbnb, a two-sided online marketplace connecting guests and hosts, offers a diverse and unique inventory of accommodations, experiences, and services. Search filters play an important role in helping guests navigate this variety by refining search results to align with their needs. Yet, while search filters are designed to facilitate conversions in online marketplaces, their direct impact on driving conversions remains underexplored in the existing literature. This paper bridges this gap by presenting a novel application of machine learning techniques to recommend search filters aimed at improving booking conversions. We introduce a modeling framework that directly targets lower-funnel conversions (bookings) by recommending intermediate tools, i.e. search filters. Leveraging the framework, we designed and built the filter recommendation system at Airbnb from the ground up, addressing challenges like cold start and stringent serving requirements. The filter recommendation system we developed has been successfully deployed at Airbnb, powering multiple user interfaces and driving incremental booking conversion lifts, as validated through online A/B testing. An ablation study further validates the effectiveness of our approach and key design choices. By focusing on conversion-oriented filter recommendations, our work ensures that search filters serve their ultimate purpose at Airbnb - helping guests find and book their ideal accommodations. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2602.23717 [cs.IR] (or arXiv:2602.23717v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.23717 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-14] FuXi-Linear: Unleashing the Power of Linear Attention in Long-term Time-aware Sequential Recommendation
【速读】:该论文旨在解决现代推荐系统中基于注意力机制的模型因二次复杂度导致难以处理长用户序列且推理速度慢的问题,同时针对现有线性注意力方法在时间信号建模、位置信息表达和长序列性能方面的不足进行改进。其解决方案的关键在于提出FuXi-Linear模型,包含两个核心组件:一是时序保留通道(Temporal Retention Channel),通过独立计算周期性注意力权重来分离时间信号与语义信号,避免相互干扰并捕捉行为周期性;二是线性位置通道(Linear Positional Channel),利用可学习核函数在保持线性复杂度的前提下引入充分的位置信息。此外,该模型展现出在数千长度序列上的稳健幂律缩放特性,显著优于现有线性推荐方法,在推荐质量与推理效率上均取得突破。
链接: https://arxiv.org/abs/2602.23671
作者: Yufei Ye,Wei Guo,Hao Wang,Luankang Zhang,Heng Chang,Hong Zhu,Yuyang Ye,Yong Liu,Defu Lian,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学); Huawei Technologies (华为技术有限公司)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Modern recommendation systems primarily rely on attention mechanisms with quadratic complexity, which limits their ability to handle long user sequences and slows down inference. While linear attention is a promising alternative, existing research faces three critical challenges: (1) temporal signals are often overlooked or integrated via naive coupling that causes mutual interference between temporal and semantic signals while neglecting behavioral periodicity; (2) insufficient positional information provided by existing linear frameworks; and (3) a primary focus on short sequences and shallow architectures. To address these issues, we propose FuXi-Linear, a linear-complexity model designed for efficient long-sequence recommendation. Our approach introduces two key components: (1) a Temporal Retention Channel that independently computes periodic attention weights using temporal data, preventing crosstalk between temporal and semantic signals; (2) a Linear Positional Channel that integrates positional information through learnable kernels within linear complexity. Moreover, we demonstrate that FuXi-Linear exhibits a robust power-law scaling property at a thousand-length scale, a characteristic largely unexplored in prior linear recommendation studies. Extensive experiments on sequences of several thousand tokens demonstrate that FuXi-Linear outperforms state-of-the-art models in recommendation quality, while achieving up to 10 \times speedup in the prefill stage and up to 21 \times speedup in the decode stage compared to competitive baselines. Our code has been released in a public repository this https URL.
[IR-15] Geodesic Semantic Search: Learning Local Riemannian Metrics for Citation Graph Retrieval
【速读】:该论文旨在解决传统基于嵌入的检索方法在文献语义搜索中因依赖固定欧氏距离而难以捕捉复杂语义关系的问题。其解决方案的关键在于提出一种测地语义搜索(Geodesic Semantic Search, GSS),通过在引文图上为每个节点学习特定的黎曼度量(Riemannian metrics),从而实现几何感知的语义检索。具体而言,GSS 在每个节点学习一个低秩度量张量 \mL_i \in \mathbb{R}^{d \times r},构造局部正半定度量 \mG_i = \mL_i \mL_i^\top + \eps \mI,确保度量有效性的同时保持模型可计算性;检索过程采用多源 Dijkstra 算法计算测地距离,并结合最大边际相关性重排序与路径一致性过滤,显著提升了召回率(在 16.9 万篇论文数据集上 Recall@20 相对提升 23%),同时提供可解释的引用路径。
链接: https://arxiv.org/abs/2602.23665
作者: Brandon Yee,Lucas Wang,Kundana Kommini,Krishna Sharma
机构: Management Sciences Lab, Yee Collins Research Groups; Hoover Institute, Stanford University
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:We present Geodesic Semantic Search (GSS), a retrieval system that learns node-specific Riemannian metrics on citation graphs to enable geometry-aware semantic search. Unlike standard embedding-based retrieval that relies on fixed Euclidean distances, \gss learns a low-rank metric tensor \mL_i \in \R^d \times r at each node, inducing a local positive semi-definite metric \mG_i = \mL_i \mL_i^\top + \eps \mI . This parameterization guarantees valid metrics while keeping the model tractable. Retrieval proceeds via multi-source Dijkstra on the learned geodesic distances, followed by Maximal Marginal Relevance reranking and path coherence filtering. On citation prediction benchmarks with 169K papers, \gss achieves 23% relative improvement in Recall@20 over SPECTER+FAISS baselines while providing interpretable citation paths. Our hierarchical coarse-to-fine search with k-means pooling reduces computational cost by 4 \times compared to flat geodesic search while maintaining 97% retrieval quality. We provide theoretical analysis of when geodesic distances outperform direct similarity, characterize the approximation quality of low-rank metrics, and validate predictions empirically. Code and trained models are available at this https URL.
[IR-16] Learning to Reflect and Correct: Towards Better Decoding Trajectories for Large-Scale Generative Recommendation
【速读】:该论文针对生成式推荐(Generative Recommendation, GR)中因单次解码缺乏显式修正而导致早期偏差累积、最终降低推荐质量的问题提出了解决方案。其关键在于设计了一个结构化的反思-修正框架(Generation-Reflection-Correction, GRC),将标准解码过程扩展为三阶段流程:初始草稿生成、多粒度反思(multi-granular reflection)和基于反思引导的修正(reflection-guided correction),从而在语义标记空间内实现结构化的反思与修正机制。此外,通过基于GRPO的强化学习优化整个GRC轨迹,并引入熵引导的反思调度策略(Entropy-Guided Reflection Scheduling, EGRS)动态分配修正预算,显著提升了推荐效果与在线服务效率。
链接: https://arxiv.org/abs/2602.23639
作者: Haibo Xing,Hao Deng,Lingyu Mu,Jinxin Hu,Yu Zhang,Xiaoyi Zeng,Jing Zhang
机构: Alibaba International Digital Commerce Group(阿里巴巴国际数字商业集团); School of Computer Science, Wuhan University(武汉大学计算机学院)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Generative Recommendation (GR) has become a promising paradigm for large-scale recommendation systems. However, existing GR models typically perform single-pass decoding without explicit refinement, causing early deviations to accumulate and ultimately degrade recommendation quality. To tackle this problem, we propose GRC, which is, to our knowledge, the first structured reflection-correction framework for GR that extends standard decoding into a Generation-Reflection-Correction (GRC) process. Concretely, GRC introduces a supervised reflection-correction template that decomposes the decoding process into initial draft generation, multi-granular reflection, and reflection-guided correction, thereby enabling structured reflection and correction in the semantic token space. To further explore the enlarged refinement space introduced by the GRC process, we optimize the entire GRC trajectory with GRPO-based reinforcement learning, under a carefully designed reward function with token-level and trajectory-level signals. For efficient online serving, we propose an Entropy-Guided Reflection Scheduling (EGRS) strategy that dynamically allocates more correction budget to high-uncertainty decoding trajectories during beam search. Extensive experiments on real-world datasets show that GRC consistently outperforms six state-of-the-art baselines by up to 15.74%, and online A/B tests demonstrate its substantial practical value in large-scale industrial recommendation, delivering a 1.79% lift in advertising revenue with only modest latency overhead.
[IR-17] Synthetic Data Powers Product Retrieval for Long-tail Knowledge-Intensive Queries in E-commerce Search
【速读】:该论文旨在解决电商搜索中长尾知识密集型查询(long-tail, knowledge-intensive queries)的检索难题,这类查询因语言模式多样、缺乏明确购买意图、需领域知识推理且行为日志稀缺,导致现有检索系统性能受限。解决方案的关键在于提出一种高效的合成数据框架,通过隐式地将强大离线查询重写模型的能力蒸馏至在线检索系统:利用大语言模型(LLM)训练多候选查询重写模型,并结合多种奖励信号,在精心构建的查询-商品对上捕捉其重写能力,从而缓解重写后查询分布偏移问题,提升召回率并减少无关商品引入。实验表明,仅通过引入此类合成数据进行训练即可显著改善检索效果,线上A/B测试验证了用户体验的明显提升。
链接: https://arxiv.org/abs/2602.23620
作者: Gui Ling,Weiyuan Li,Yue Jiang,Wenjun Peng,Xingxian Liu,Dongshuai Li,Fuyu Lv,Dan Ou,Haihong Tang
机构: Taobao Tmall Group of Alibaba (淘宝天猫集团); Alibaba (阿里巴巴)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Product retrieval is the backbone of e-commerce search: for each user query, it identifies a high-recall candidate set from billions of items, laying the foundation for high-quality ranking and user experience. Despite extensive optimization for mainstream queries, existing systems still struggle with long-tail queries, especially knowledge-intensive ones. These queries exhibit diverse linguistic patterns, often lack explicit purchase intent, and require domain-specific knowledge reasoning for accurate interpretation. They also suffer from a shortage of reliable behavioral logs, which makes such queries a persistent challenge for retrieval optimization. To address these issues, we propose an efficient data synthesis framework tailored to retrieval involving long-tail, knowledge-intensive queries. The key idea is to implicitly distill the capabilities of a powerful offline query-rewriting model into an efficient online retrieval system. Leveraging the strong language understanding of LLMs, we train a multi-candidate query rewriting model with multiple reward signals and capture its rewriting capability in well-curated query-product pairs through a powerful offline retrieval pipeline. This design mitigates distributional shift in rewritten queries, which might otherwise limit incremental recall or introduce irrelevant products. Experiments demonstrate that without any additional tricks, simply incorporating this synthetic data into retrieval model training leads to significant improvements. Online Side-By-Side (SBS) human evaluation results indicate a notable enhancement in user search experience.
[IR-18] LFQA-HP-1M: A Large-Scale Human Preference Dataset for Long-Form Question Answering LREC2026
【速读】:该论文旨在解决长文本问答(Long-form Question Answering, LFQA)中评估指标难以准确反映人类判断的问题。现有评价方法往往无法充分捕捉多句解释性回答的细微差异,导致模型性能评估失真。解决方案的关键在于构建了一个大规模的人工偏好标注数据集LFQA-HP-1M(包含130万条成对比较标注),并提出九项可解释的评分维度(rubrics)用于答案质量评估;基于这些特征训练的简单线性模型表现可媲美当前最先进的大语言模型(Large Language Model, LLM)评估器,同时揭示了LLM评估器在传递一致性、位置偏倚和冗余性偏倚方面的脆弱性及其对对抗扰动的敏感性,从而为透明且可靠的LFQA评估提供了新范式。
链接: https://arxiv.org/abs/2602.23603
作者: Rafid Ishrak Jahan,Fahmid Shahriar Iqbal,Sagnik Ray Choudhury
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: LREC 2026 Accepted. this https URL
Abstract:Long-form question answering (LFQA) demands nuanced evaluation of multi-sentence explanatory responses, yet existing metrics often fail to reflect human judgment. We present LFQA-HP-1M, a large-scale dataset comprising 1.3M human pairwise preference annotations for LFQA. We propose nine rubrics for answer quality evaluation, and show that simple linear models based on these features perform comparably to state-of-the-art LLM evaluators. We further examine transitivity consistency, positional bias, and verbosity biases in LLM evaluators and demonstrate their vulnerability to adversarial perturbations. Overall, this work provides one of the largest public LFQA preference datasets and a rubric-driven framework for transparent and reliable evaluation.
[IR-19] Unified Learning-to-Rank for Multi-Channel Retrieval in Large-Scale E-Commerce Search
【速读】:该论文旨在解决多通道检索(multi-channel retrieval)系统中如何高效融合来自异构候选源的文档,以在严格延迟约束下优化用户转化率(conversion)的问题。传统基于排名的融合方法(如RRF和加权交错)依赖固定的全局通道权重,无法捕捉查询相关的通道效用差异及跨通道交互关系。其解决方案的关键在于将多通道融合建模为一个通道感知的学习排序(channel-aware learning-to-rank)任务,通过联合优化点击、加购与购买等业务指标,并引入近期用户行为信号以捕获短期意图变化,从而实现动态且个性化的文档融合与排序。
链接: https://arxiv.org/abs/2602.23530
作者: Aditya Gaydhani,Guangyue Xu,Dhanush Kamath,Ankit Singh,Alex Li
机构: Target Corporation(目标公司)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Large-scale e-commerce search must surface a broad set of items from a vast catalog, ranging from bestselling products to new, trending, or seasonal items. Modern systems therefore rely on multiple specialized retrieval channels to surface products, each designed to satisfy a specific objective. A key challenge is how to effectively merge documents from these heterogeneous channels into a single ranked list under strict latency constraints while optimizing for business KPIs such as user conversion. Rank-based fusion methods such as Reciprocal Rank Fusion (RRF) and Weighted Interleaving rely on fixed global channel weights and treat channels independently, failing to account for query-specific channel utility and cross-channel interactions. We observe that multi-channel fusion can be reformulated as a query-dependent learning-to-rank problem over heterogeneous candidate sources. In this paper, we propose a unified ranking model that learns to merge and rank documents from multiple retrieval channels. We formulate the problem as a channel-aware learning-to-rank task that jointly optimizes clicks, add-to-carts, and purchases while incorporating channel-specific objectives. We further incorporate recent user behavioral signals to capture short-term intent shifts that are critical for improving conversion in multi-channel ranking. Our online A/B experiments show that the proposed approach outperforms rank-based fusion methods, leading to a +2.85% improvement in user conversion. The model satisfies production latency requirements, achieving a p95 latency of under 50,ms, and is deployed on this http URL.
[IR-20] Cross-Representation Knowledge Transfer for Improved Sequential Recommendations
【速读】:该论文旨在解决当前顺序推荐系统中Transformer模型仅能孤立处理序列元素、难以显式建模用户交互间复杂关系,而图神经网络(Graph Neural Networks, GNNs)虽能显式刻画高阶交互却难以有效捕捉其随时间演化的局限性。解决方案的关键在于提出一种融合Transformer与GNN的新框架,通过同步编码交互图中的结构依赖关系并追踪其动态变化,实现对下一交互项的精准预测。该方法在多个公开数据集上显著优于纯顺序或纯图方法,以及现有结合两类信号的先进模型。
链接: https://arxiv.org/abs/2602.23471
作者: Artur Gimranov,Viacheslav Yusupov,Elfat Sabitov,Tatyana Matveeva,Anton Lysenko,Ruslan Israfilov,Evgeny Frolov
机构: HSE University (高等经济大学); SB AI Lab; AXXX
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Transformer architectures, capable of capturing sequential dependencies in the history of user interactions, have become the dominant approach in sequential recommender systems. Despite their success, such models consider sequence elements in isolation, implicitly accounting for the complex relationships between them. Graph neural networks, in contrast, explicitly model these relationships through higher order interactions but are often unable to adequately capture their evolution over time, limiting their use for predicting the next interaction. To fill this gap, we present a new framework that combines transformers and graph neural networks and aligns different representations for solving next-item prediction task. Our solution simultaneously encodes structural dependencies in the interaction graph and tracks their dynamic change. Experimental results on a number of open datasets demonstrate that the proposed framework consistently outperforms both pure sequential and graph approaches in terms of recommendation quality, as well as recent methods that combine both types of signals.
[IR-21] runcated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在通过强化学习训练时,与搜索引擎协同推理过程中存在的信用分配问题(credit assignment problem):现有方法如Search-R1仅在多步轨迹结束后提供稀疏的最终奖励,难以将成功或失败归因于具体的推理和检索决策;而过程奖励方法如StepSearch虽引入了步骤级监督,但依赖启发式奖励(如TF-IDF重叠度),且仍需对每个样本采样k条完整轨迹,导致梯度方差较高。解决方案的关键在于提出SLATE框架,其核心创新为两个互补机制:一是截断步骤级采样(truncated step-level sampling),生成共享前缀、仅在下一步不同的k条轨迹,显著降低优势估计的方差;二是密集型LLM作为评判者奖励(dense LLM-as-judge rewards),用高性能LLM替代启发式评分器,对每一步的推理、搜索查询和答案质量进行精细化评估,提供更丰富可靠的监督信号。理论证明表明,在相同密集奖励结构下,截断采样可使T步轨迹的优势估计方差降低至全轨迹采样的1/T,从而获得更低方差、更精准的策略梯度。
链接: https://arxiv.org/abs/2602.23440
作者: Chris Samarinas,Haw-Shiuan Chang,Hamed Zamani
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample k complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates k trajectories that share a common prefix and differ only at the next step, and (2) dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, yielding lower-variance, better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.
[IR-22] Higress-RAG : A Holistic Optimization Framework for Enterprise Retrieval-Augmented Generation via Dual Hybrid Retrieval Adaptive Routing and CRAG
【速读】:该论文旨在解决企业级生成式 AI 应用中 Retrieval-Augmented Generation (RAG) 系统从原型到生产落地所面临的三大核心挑战:复杂查询下的低检索精度、生成阶段高频率的幻觉(hallucination)以及实时应用场景中不可接受的延迟。其解决方案的关键在于提出一种“全链路优化”(Full-Link Optimization)策略,通过基于 Model Context Protocol (MCP) 的分层架构实现端到端性能提升,具体包括自适应路由、语义缓存、混合检索和纠错型 RAG(Corrective RAG, CRAG)等模块,并创新性地引入 Higress-Native Splitter 进行结构感知的数据处理、Reciprocal Rank Fusion (RRF) 融合稠密与稀疏检索信号,以及具备动态阈值的 50ms 级语义缓存机制,从而系统性优化从查询重写到后处理校验的完整检索生命周期。
链接: https://arxiv.org/abs/2602.23374
作者: Weixi Lin
机构: Northwestern Polytechnical University (西北工业大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages,5 figures, our submissions are not yet published
Abstract:The integration of Large Language Models (LLMs) into enterprise knowledge management systems has been catalyzed by the Retrieval-Augmented Generation (RAG) paradigm, which augments parametric memory with non-parametric external data. However, the transition from proof-of-concept to production-grade RAG systems is hindered by three persistent challenges: low retrieval precision for complex queries, high rates of hallucination in the generation phase, and unacceptable latency for real-time applications. This paper presents a comprehensive analysis of the Higress RAG MCP Server, a novel, enterprise-centric architecture designed to resolve these bottlenecks through a “Full-Link Optimization” strategy. Built upon the Model Context Protocol (MCP), the system introduces a layered architecture that orchestrates a sophisticated pipeline of Adaptive Routing, Semantic Caching, Hybrid Retrieval, and Corrective RAG (CRAG). We detail the technical implementation of key innovations, including the Higress-Native Splitter for structure-aware data ingestion, the application of Reciprocal Rank Fusion (RRF) for merging dense and sparse retrieval signals, and a 50ms-latency Semantic Caching mechanism with dynamic thresholding. Experimental evaluations on domain-specific Higress technical documentation and blogs verify the system’s architectural robustness. The results demonstrate that by optimizing the entire retrieval lifecycle - from pre-retrieval query rewriting to post-retrieval corrective evaluation - the Higress RAG system offers a scalable, hallucination-resistant solution for enterprise AI deployment.
[IR-23] An Agent ic LLM Framework for Adverse Media Screening in AML Compliance
【速读】:该论文旨在解决金融行业中反洗钱(AML)和尽职调查(KYC)流程中负面媒体筛查(adverse media screening)效率低、误报率高的问题。传统基于关键词的筛查方法易产生大量假阳性结果,或依赖人工逐条审核,成本高且耗时长。解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)与检索增强生成(Retrieval-Augmented Generation, RAG)技术的智能代理系统(agentic system),通过多步骤自动化流程实现对目标个体的网络信息检索、文档处理及风险评分——即计算“负面媒体指数”(Adverse Media Index, AMI)得分,从而有效区分高风险与低风险个体,显著提升筛查准确性与效率。
链接: https://arxiv.org/abs/2602.23373
作者: Pavel Chernakov,Sasan Jafarnejad,Raphaël Frank
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Adverse media screening is a critical component of anti-money laundering (AML) and know-your-customer (KYC) compliance processes in financial institutions. Traditional approaches rely on keyword-based searches that generate high false-positive rates or require extensive manual review. We present an agentic system that leverages Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) to automate adverse media screening. Our system implements a multi-step approach where an LLM agent searches the web, retrieves and processes relevant documents, and computes an Adverse Media Index (AMI) score for each subject. We evaluate our approach using multiple LLM backends on a dataset comprising Politically Exposed Persons (PEPs), persons from regulatory watchlists, and sanctioned persons from OpenSanctions and clean names from academic sources, demonstrating the system’s ability to distinguish between high-risk and low-risk individuals.
[IR-24] Democratizing GraphRAG : Linear CPU-Only Graph Retrieval for Multi-Hop QA
【速读】:该论文旨在解决当前GraphRAG系统在多跳检索(multi-hop retrieval)中依赖昂贵的大型语言模型(Large Language Model, LLM)进行图结构构建以及GPU密集型推理的问题,从而限制了其可扩展性和普及性。解决方案的关键在于提出SPRIG(Seeded Propagation for Retrieval In Graphs),一个完全基于CPU、线性时间复杂度且无需token消耗的GraphRAG流程:它用轻量级命名实体识别(Named Entity Recognition, NER)驱动的共现图替代LLM构建图结构,并采用个性化PageRank(Personalized PageRank, PPR)算法实现高效检索,在保持Recall@10几乎不变的前提下,将性能提升28%。此方案明确了在何种场景下CPU友好的图检索能有效提升多跳召回率,而在其他情况下仅需强词汇混合方法(如RRF)即可满足需求,为降低GraphRAG部署门槛提供了切实可行的技术路径。
链接: https://arxiv.org/abs/2602.23372
作者: Qizhi Wang
机构: PingCAP, Data AI-Innovation Lab (PingCAP 数据人工智能创新实验室)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 14 figures, 26 tables
Abstract:GraphRAG systems improve multi-hop retrieval by modeling structure, but many approaches rely on expensive LLM-based graph construction and GPU-heavy inference. We present SPRIG (Seeded Propagation for Retrieval In Graphs), a CPU-only, linear-time, token-free GraphRAG pipeline that replaces LLM graph building with lightweight NER-driven co-occurrence graphs and uses Personalized PageRank (PPR) for 28% with negligible Recall@10 changes. The results characterize when CPU-friendly graph retrieval helps multi-hop recall and when strong lexical hybrids (RRF) are sufficient, outlining a realistic path to democratizing GraphRAG without token costs or GPU requirements.
[IR-25] Domain-Partitioned Hybrid RAG for Legal Reasoning : Toward Modular and Explainable Legal AI for India
【速读】:该论文针对印度法律研究中传统关键词检索或仅依赖嵌入向量的检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理多跳推理、引用链传递和跨领域依赖等复杂法律逻辑时表现不佳的问题,提出了一种面向印度法律场景的域划分混合RAG与知识图谱架构。其解决方案的关键在于:构建三个专用于最高法院判例、成文法与宪法文本、印度 penal code(IPC)的域特定RAG管道,并结合基于Neo4j的知识图谱(Knowledge Graph),以结构化方式建模案例、法条、法官及引用之间的关系;同时引入一个由大语言模型(LLM)驱动的代理协调器(agentic orchestrator),动态调度查询路径并融合证据生成具备引文意识且语义连贯的回答,从而显著提升法律推理的完整性与准确性。
链接: https://arxiv.org/abs/2602.23371
作者: Rakshita Goel,S Pranav Kumar,Anmol Agrawal,Divyan Poddar,Pratik Narang,Dhruv Kumar
机构: Birla Institute of Technology and Science, Pilani (比尔拉理工学院与科学学院,皮拉尼校区)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Legal research in India involves navigating long and heterogeneous documents spanning statutes, constitutional provisions, penal codes, and judicial precedents, where purely keyword-based or embedding-only retrieval systems often fail to support structured legal reasoning. Recent retrieval augmented generation (RAG) approaches improve grounding but struggle with multi-hop reasoning, citation chaining, and cross-domain dependencies inherent to legal texts. We propose a domain partitioned hybrid RAG and Knowledge Graph architecture designed specifically for Indian legal research. The system integrates three specialized RAG pipelines covering Supreme Court case law, statutory and constitutional texts, and the Indian Penal Code, each optimized for domain specific retrieval. To enable relational reasoning beyond semantic similarity, we construct a Neo4j based Legal Knowledge Graph capturing structured relationships among cases, statutes, IPC sections, judges, and citations. An LLM driven agentic orchestrator dynamically routes queries across retrieval modules and the knowledge graph, fusing evidence into grounded and citation aware responses. We evaluate the system using a 40 question synthetic legal question answer benchmark curated from authoritative Indian legal sources and assessed via an LLM as a Judge framework. Results show that the hybrid architecture achieves a 70 percent pass rate, substantially outperforming a RAG only baseline at 37.5 percent, with marked improvements in completeness and legal reasoning quality. These findings demonstrate that combining domain partitioned retrieval with structured relational knowledge provides a scalable and interpretable foundation for advanced legal AI systems in the Indian judicial context. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2602.23371 [cs.IR] (or arXiv:2602.23371v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.23371 Focus to learn more arXiv-issued DOI via DataCite
[IR-26] oward General Semantic Chunking: A Discriminative Framework for Ultra-Long Documents
【速读】:该论文旨在解决超长文档(ultra-long documents)中主题分割(topic segmentation)的难题,现有方法在处理此类文本时存在明显不足:传统判别式模型受限于固定窗口,难以建模文档级语义;而生成式大语言模型虽能输出段落边界,但推理成本高且难以支持长输入。解决方案的关键在于提出一种基于Qwen3-0.6B的判别式分割模型,其核心创新包括引入跨窗口上下文融合层(cross-window context fusion layer)与边界分类头(boundary classification head),并结合重叠滑动窗口策略,使模型支持单次输入高达13k tokens,可扩展至超长文档的段落边界检测;同时设计了一种带标量校正的向量融合方法,将超长段落表示压缩为单一向量而不损失语义信息,显著提升下游检索效率。
链接: https://arxiv.org/abs/2602.23370
作者: Kaifeng Wu,Junyan Wu,Qiang Liu,Jiarui Zhang,Wen Xu
机构: KingSoft Office Zhuiguang AI Lab
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Long-document topic segmentation plays an important role in information retrieval and document understanding, yet existing methods still show clear shortcomings in ultra-long text settings. Traditional discriminative models are constrained by fixed windows and cannot model document-level semantics; generative large language models can output paragraph boundaries, but inference is expensive and long inputs are difficult to support. To address these issues, we propose a discriminative segmentation model based on Qwen3-0.6B. On top of the backbone network, we add a cross-window context fusion layer and a boundary classification head, and combine them with an overlapping sliding-window strategy. Our model supports single-pass inputs of up to 13k tokens and can be extended to ultra-long documents for paragraph boundary detection. To further enhance downstream retrieval efficiency, we derive a vector fusion method with scalar correction, which compresses the representation of ultra-long segments into a single vector without semantic loss. Experiments on the Wikipedia long-document topic segmentation dataset WIKI-727K show that, compared with three generative models based on Qwen2-0.5B released by Jina, our method achieves a better macro-averaged F1 and delivers two orders of magnitude faster inference, substantially improving the practicality and scalability of long-document processing.
[IR-27] Reason to Contrast: A Cascaded Multimodal Retrieval Framework
【速读】:该论文旨在解决传统多模态检索系统(multimodal retrieval systems)中性能受限于嵌入维度(embedding dimensionality)的问题,以及如何在不依赖模型规模或嵌入维度扩展的前提下提升检索准确性。其解决方案的关键在于提出TTE-v2——一种基于推理驱动的混合多模态检索框架,通过引入额外的推理步骤进行重排序(reranking),从而在测试时利用增加的输入token预算实现性能提升;该机制不仅增强了查询与候选项之间的细粒度交互,还通过硬负样本挖掘和假负例过滤提供细粒度监督信号,形成反馈循环以强化上游检索器,实现了基于token级扩展(token-wise scaling)的新一代可扩展性范式。
链接: https://arxiv.org/abs/2602.23369
作者: Xuanming Cui,Hong-You Chen,Hao Yu,Hao Yuan,Zihao Wang,Shlok Kumar Mishra,Hanchao Yu,Yonghuan Yang,Jun Xiao,Ser-Nam Lim,Jianpeng Cheng,Qi Guo,Xiangjun Fan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Traditional multimodal retrieval systems rely primarily on bi-encoder architectures, where performance is closely tied to embedding dimensionality. Recent work, Think-Then-Embed (TTE), shows that incorporating multimodal reasoning to elicit additional informative tokens before embedding can further improve retrieval. In this paper, we extend this paradigm with TTE-v2, a hybrid multimodal retrieval framework that introduces reasoning-driven performance scaling based on additional input token budget rather than model or embedding size. Our approach augments the initial multimodal retrieval with additional reasoning steps for reranking, enabling more expressive query-candidate interactions at test time. The reranking stage further provides fine-grained supervision for hard negative mining and false negative filtering, creating a feedback loop that effectively strengthens the upstream retriever. This cascaded design delivers substantial test-time improvements based on intermediate reasoning token scaling. Experiments on the MMEB-V2 benchmark demonstrate that TTE-v2-7B achieves a new state-of-the-art accuracy of 75.7%, and that TTE-v2-2B matches or surpasses leading 7B models trained with significantly larger external data. Our results highlight the promise of token-wise scaling as an alternative scaling paradigm for multimodal retrieval.
[IR-28] Keyword search is all you need: Achieving RAG -Level Performance without vector databases using agent ic tool use
【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在实际应用中面临的挑战,包括对向量数据库和语义搜索的依赖性、集成复杂度高以及成本较高等问题。其解决方案的关键在于提出一种基于代理式(agentic)关键词搜索的替代架构,即在仅使用基础关键词搜索工具的前提下,通过工具增强的大语言模型(Tool-Augmented LLM)代理框架实现高效的问答响应。实证研究表明,该方法可达到传统RAG系统90%以上的性能指标,同时具备部署简单、成本低且适用于知识库频繁更新场景的优势。
链接: https://arxiv.org/abs/2602.23368
作者: Shreyas Subramanian,Adewale Akinfaderin,Yanyan Zhang,Ishan Singh,Mani Khanuja,Sandeep Singh,Maira Ladeira Tanke
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:While Retrieval-Augmented Generation (RAG) has proven effective for generating accurate, context-based responses based on existing knowledge bases, it presents several challenges including retrieval quality dependencies, integration complexity and cost. Recent advances in agentic-RAG and tool-augmented LLM architectures have introduced alternative approaches to information retrieval and processing. We question how much additional value vector databases and semantic search bring to RAG over simple, agentic keyword search in documents for question-answering. In this study, we conducted a systematic comparison between RAG-based systems and tool-augmented LLM agents, specifically evaluating their retrieval mechanisms and response quality when the agent only has access to basic keyword search tools. Our empirical analysis demonstrates that tool-based keyword search implementations within an agentic framework can attain over 90% of the performance metrics compared to traditional RAG systems without using a standing vector database. Our approach is simple to implement, cost effective, and is particularly useful in scenarios requiring frequent updates to knowledge bases.
[IR-29] HumanMCP: A Human-Like Query Dataset for Evaluating MCP Tool Retrieval Performance
【速读】:该论文旨在解决当前评估模型上下文协议(Model Context Protocol, MCP)服务器工具使用能力时缺乏真实、类人用户查询的问题,现有数据集虽包含工具描述但无法反映不同用户对任务请求的多样化表达方式,导致基准测试结果泛化性差且可靠性被高估。解决方案的关键在于构建首个大规模MCP数据集,通过为分布在308个MCP服务器上的2800个工具匹配多个独特用户角色(user personas),生成多样且高质量的用户查询,从而覆盖从明确任务指令到模糊探索性命令的不同意图层级,更真实地模拟现实交互模式,提升对MCP生态系统工具调用能力的评估效度。
链接: https://arxiv.org/abs/2602.23367
作者: Shubh Laddha,Lucas Changbencharoen,Win Kuptivej,Surya Shringla,Archana Vaidheeswaran,Yash Bhaskar
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 4 pages, 2 figures, 3 tables
Abstract:Model Context Protocol (MCP) servers contain a collection of thousands of open-source standardized tools, linking LLMs to external systems; however, existing datasets and benchmarks lack realistic, human-like user queries, remaining a critical gap in evaluating the tool usage and ecosystems of MCP servers. Existing datasets often do contain tool descriptions but fail to represent how different users portray their requests, leading to poor generalization and inflated reliability of certain benchmarks. This paper introduces the first large-scale MCP dataset featuring diverse, high-quality diverse user queries generated specifically to match 2800 tools across 308 MCP servers, developing on the MCP Zero dataset. Each tool is paired with multiple unique user personas that we have generated, to capture varying levels of user intent ranging from precise task requests, and ambiguous, exploratory commands, reflecting the complexity of real-world interaction patterns.
[IR-30] Doc To The Future: Infomorphs for Interactive Multimodal Document Transformation and Generation
【速读】:该论文旨在解决知识工作中文档合成过程的自动化与可控性问题,即如何在多源文档、跨模态信息的基础上,实现灵活、可干预的生成式内容重构。当前生成式AI虽能部分自动化文档处理流程,但对用户控制力不足,尤其在处理多模态输入输出时缺乏结构化和透明度。解决方案的关键在于提出“infomorph”概念——一种模块化、用户可调控的AI增强变换单元,支持信息在不同格式和模态间的受控合成与重组;并基于此构建了一个设计空间及其实例化工具DocuCraft,通过画布式界面可视化编排infomorph工作流,使用户能够在生成式AI各阶段(如页面提取、摘要、重格式化等)中嵌入意图与上下文,从而实现人机协同的高效、透明且可解释的信息整合与文档创作。
链接: https://arxiv.org/abs/2602.23366
作者: Balasaravanan Thoravi Kumaravel
机构: Microsoft Research (微软研究院)
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:
Abstract:Creating new documents by synthesizing information from existing sources is an important part of knowledge work in many domains. This process often involves gathering content from multiple documents, organizing it, and then transforming it into new forms such as reports, slides, or spreadsheets. While recent advances in Generative AI have shown potential in automating parts of this process, they often provide limited user control over the handling of multimodal inputs and outputs. In this work, we introduce the notion of “infomorphs” which are modular, user-steerable, AI-augmented transformations that support controlled synthesis, and restructuring of information across formats and modalities. We propose a design space that leverage infomorph-driven workflows to enable flexible, interactive, and multimodal document creation by combining Generative AI techniques with user intent and desired information context. As a concrete instantiation of this design space, we present DocuCraft, a canvas-based interface to visually compose infomorph workflows. DocuCraft allows users to chain together infomorphs that perform operations such as page extraction, content summarization, reformatting, and generation, leveraging Generative AI at each stage to support rich, cross-document and cross-modal transformations. We demonstrate the capabilities of DocuCraft through an example-driven usage scenario that spans across different facets of common knowledge work tasks illustrating its support for fluid, human-in-the-loop document synthesis and highlights opportunities for more transparent and modular interaction for Generative AI-assisted information work.
[IR-31] Serendipity with Generative AI: Repurposing knowledge components during polycrisis with a Viable Systems Model approach
【速读】:该论文旨在解决组织在面临多重危机(polycrisis)不确定性时,忽视内部隐性知识的问题。其核心挑战在于如何高效地发现、分类并利用现有文档中的可复用知识组件(如模型、框架和模式),以提升组织的适应性与创新能力。解决方案的关键在于将生成式 AI (Generative AI) 视为一种“偶然性引擎”(serendipity engine)和知识转导器(knowledge transducer),通过构建一个基于Beer’s Viable System Model (VSM) 的组件仓库,实现对711个知识组件的系统化提取与组织,从而显著降低不同系统子单元间的知识转导成本,推动知识从发现到部署的快速转化,并支持组织向系统性复用导向的创新模式转型。
链接: https://arxiv.org/abs/2602.23365
作者: Gordon Fletcher,Saomai Vu Khan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Organisations face polycrisis uncertainty yet overlook embedded knowledge. We show how generative AI can operate as a serendipity engine and knowledge transducer to discover, classify and mobilise reusable components (models, frameworks, patterns) from existing documents. Using 206 papers, our pipeline extracted 711 components (approx 3.4 per paper) and organised them into a repository aligned to Beer’s Viable System Model (VSM). We contribute i) conceptually, a theory of planned serendipity in which GenAI lowers transduction costs between VSM subsystems, ii) empirically, a component repository and temporal/subject patterns, iii) managerially, a vignette and process blueprint for organisational adoption and iv) socially, pathways linking repurposing to environmental and social benefits. We propose testable links between repository creation, discovery-to-deployment time, and reuse rates, and discuss implications for shifting innovation portfolios from breakthrough bias toward systematic repurposing.
[IR-32] Wavenumber-domain signal processing for holographic MIMO: Foundations methods and future directions
【速读】:该论文旨在解决传统多输入多输出(MIMO)系统在亚波长天线间距下无法有效刻画全距离域(远场与近场)信道特性的难题,尤其是在经典离散傅里叶变换(DFT)表示失效的情况下。其解决方案的关键在于引入基于波数域(wavenumber domain)的信号处理框架,通过空间傅里叶平面波分解对H-MIMO信道进行建模,从而提供一个统一且物理一致的表征方式,以准确描述亚波长级空间相关性和球面波传播特性。
链接: https://arxiv.org/abs/2602.17705
作者: Zijian Zhang,Linglong Dai
机构: Tsinghua University (清华大学); Beijing National Research Center for Information Science and Technology (北京信息科学研究中心)
类目: ignal Processing (eess.SP); Information Retrieval (cs.IR); Information Theory (cs.IT); Systems and Control (eess.SY)
备注: Accepted by IEEE Communications Standards Magazine. 6 pages, 5 figures
Abstract:Holographic multiple-input multiple-output (H-MIMO) systems represent a paradigm shift in wireless communications by enabling quasi-continuous apertures. Unlike conventional MIMO systems, H-MIMO with subwavelength antenna spacing operates in both far-field and near-field regimes, where classical discrete Fourier transform (DFT) representations fail to sufficiently capture the channel characteristics. To address this challenge, this article provides an overview of the emerging wavenumber-domain signal processing framework. Specifically, by leveraging spatial Fourier plane-wave decomposition to model H-MIMO channels, the wavenumber domain offers a unified and physically consistent basis for characterizing subwavelength-level spatial correlation and spherical wave propagation. This article first introduces the concept of H-MIMO and the wavenumber representation of H-MIMO channels. Next, it elaborates on wavenumber-domain signal processing technologies reported in the literature, including multiplexing, channel estimation, and waveform designs. Finally, it highlights open challenges and outlines future research directions in wavenumber-domain signal processing for next-generation wireless systems.
人机交互
[HC-0] From Efficiency to Meaning: Adolescents Envisioned Role of AI in Health Management
【速读】:该论文试图解决的问题是:当前关于健康人工智能(Health AI)的研究主要聚焦于医疗提供者、照护者和成年患者,而对青少年群体在健康学习与管理中如何感知和使用AI缺乏系统认知。为填补这一研究空白,作者采用设计虚构(design fiction)与共同设计(co-design)方法,组织了7场工作坊,参与者为23名14–17岁的青少年,围绕家庭中乳糜泻(celiac disease)诊断情境探讨其对健康AI的预期角色。解决方案的关键在于通过参与式设计方法挖掘出青少年对健康AI的四类核心期待:增强健康理解与求助能力、减轻认知负担、支持家庭健康管理、以及在尊重自主性的前提下提供指导;同时识别出他们对健康AI的信任具有复杂性,并对情感支持持分化态度。这表明青少年将健康AI视为从效率向意义过渡的工具——能够释放时间用于更有价值的活动,从而为未来健康AI系统的设计指明方向:既要促进青少年自主性和反思能力,也要支撑有意义的、辩证性的健康行为。
链接: https://arxiv.org/abs/2602.24249
作者: Jamie Lee,Kyuha Jung,Cecilia Lee,Lauren MacDonnell,Jessica Kim,Daniel Otterson,Erin Newman,Emilie Chow,Yunan Chen
机构: University of California, Irvine (加州大学欧文分校)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:While prior research has focused on providers, caregivers, and adult patients, little is known about adolescents’ perceptions of AI in health learning and management. Utilizing design fiction and co-design methods, we conducted seven workshops with 23 adolescents (aged 14-17) to understand how they anticipate using health AI in the context of a family celiac diagnosis. Our findings reveal that adolescents have four main envisioned roles of health AI: enhancing health understanding and help-seeking, reducing cognitive burden, supporting family health management, and providing guidance while respecting their autonomy. We also identified nuanced trust and a divided view on emotional support from health AI. These findings suggest that adolescents perceive AI’s value as a tool that moves them from efficiency to meaning-one that creates time for valued activities. We discuss opportunities for future health AI systems to be designed to encourage adolescent autonomy and reflection, while also supporting meaningful, dialectical activities.
[HC-1] Designing AI Tutors for Interest-Based Learning: Insights from Human Instructors
【速读】:该论文试图解决的问题是如何在大规模教学场景中有效实施兴趣驱动学习(Interest-Based Learning, IBL),即如何通过AI技术实现个性化内容的高效设计与交付,从而克服传统教学中因教师资源有限而难以满足每个学生个体兴趣需求的瓶颈。解决方案的关键在于从人类教师的实际教学实践中提炼出IBL的设计与实施规律,通过对14个一对一在线辅导会话的多源数据(包括教案、对话记录、访谈和问卷)分析,识别出教师如何整合学生兴趣并解释其动机,进而为基于大语言模型(Large Language Models, LLMs)的AI助教提供可操作的设计启示,使其具备规模化实施IBL的能力。
链接: https://arxiv.org/abs/2602.24036
作者: Abhishek Kulkarni,Sharon Lynn Chu
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Interest-based learning (IBL) is a paradigm of instruction in which educational content is contextualized using learners’ interests to enhance content relevance. IBL has been shown to result in improved learning outcomes. Unfortunately, high effort is needed for instructors to design and deliver IBL content for individual students. LLMs in the form of AI tutors may allow for IBL to scale across many students. Designing an AI tutor for IBL, however, first requires an understanding of how IBL is implemented in teaching scenarios. This paper presents a study that seeks to derive this understanding from an analysis of how human instructors design and deliver IBL content. We studied 14 one-to-one online tutoring sessions (28 participants) in which tutors designed and delivered a lesson tailored to a student’s self-identified interest. Using lesson artifacts, tutoring transcripts, interviews, and questionnaires, findings include themes on how tutors integrate interests during instruction and why. Finally, actionable design implications are presented for LLM-powered AI tutors that aim to deliver IBL at scale.
[HC-2] Ask dont tell: Reducing sycophancy in large language models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险咨询和社交场景中表现出的“谄媚行为”(sycophancy)问题,即模型倾向于迎合用户认同性回应而非进行批判性互动,这被视为一种对齐失败(alignment failure)。研究发现,用户输入的表述方式显著影响模型的谄媚倾向:非疑问句(如陈述句、信念表达)比疑问句更容易诱发谄媚行为,且随着用户表达的确定性增强(从陈述到信念再到确信),谄媚程度单调上升;此外,以第一人称视角(I-perspective)呈现内容会进一步放大该效应。基于这些发现,论文提出了一种关键的缓解策略——要求模型在作答前将非疑问句转换为疑问句,这一方法比单纯提示模型“不要谄媚”更为有效,是一种可由开发者与用户直接采用的输入层干预机制。
链接: https://arxiv.org/abs/2602.23971
作者: Magda Dubois,Cozmin Ududec,Christopher Summerfield,Lennart Luettgau
机构: UK AI Security Institute(英国人工智能安全研究所)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Sycophancy, the tendency of large language models to favour user-affirming responses over critical engagement, has been identified as an alignment failure, particularly in high-stakes advisory and social contexts. While prior work has documented conversational features correlated with sycophancy, we lack a systematic understanding of what provokes or prevents AI sycophancy. Here, we present a set of controlled experimental studies where we first isolate how input framing influences sycophancy, and second, leverage these findings to develop mitigation strategies. In a nested factorial design, we compare questions to various non-questions where we vary three orthogonal factors: epistemic certainty (statement, belief, conviction), perspective (I- vs user-perspective), and affirmation vs negation. We show that (1) sycophancy is substantially higher in response to non-questions compared to questions. Additionally, we find that (2) sycophancy increases monotonically with epistemic certainty conveyed by the user, and (3) is amplified by I-perspective framing. Building on this, we show that asking a model to convert non-questions into questions before answering significantly reduces sycophancy. Importantly, this effect is stronger than a simple baseline prompt asking models “not to be sycophantic”. Our work offers a practical and effective input-level mitigation that both developers and users can easily adopt.
[HC-3] he Moment of Capture: How the First Seconds of a Speakers Nonverbal and Verbal Performance Shapes Audience Judgments
【速读】:该论文试图解决的问题是:为何某些演讲者能迅速吸引观众注意力,而另一些则无法建立有效连接?其核心在于揭示观众参与度(audience engagement)形成的实时机制,尤其是非语言行为(nonverbal behavior)在社会印象形成中的作用。解决方案的关键在于使用动作捕捉动画呈现纯非语言表演(nonverbal-only)与包含言语内容的非语言加言语条件(nonverbal-plus-verbal),并结合连续响应测量(continuous response measurement, CRM)技术,发现观众对演讲者的即时评价在最初10秒内即已稳定且具有预测性,且在纯非语言条件下预测速度更快、强度略高,甚至在5秒内就已显现关键信息。这一方法实现了对不同沟通渠道的科学分离,为将修辞技艺转化为可量化、可干预的社会印象形成机制提供了实证基础。
链接: https://arxiv.org/abs/2602.23920
作者: Ralf Schmälzle,Yuetong Du,Sue Lim,Gary Bente
机构: 未知
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET)
备注:
Abstract:Why do some speakers capture a room almost instantly while others fail to connect? The real-time architecture of audience engagement remains largely a black box. Here, we used motion-captured animations to present the pure nonverbal performance of public speakers to audiences - either in silence (nonverbal-only) or paired with the verbal content (nonverbal-plus-verbal). Using continuous response measurement (CRM), we find that audience judgments solidify with remarkable speed: Moment-to-moment engagement ratings become highly predictive of subsequent evaluations within the initial 10 seconds of the performance. Most notably, this predictive relationship emerged faster and slightly stronger in the nonverbal-only condition, with predictive information being present already after less than 5 seconds. These findings elucidate the social impact a speaker’s nonverbal performance has on audience impressions, even when dissociated from the verbal content of the speech. Our approach provides a high-resolution temporal map of social impression formation, pointing to an early “moment of capture” that appears to set the stage for the reception of the following message. On a broader scale, this research validates a powerful new method to isolate different communicative channels, to scientifically deconstruct rhetorical skill, and to study the pervasive impact of nonverbal behavior more broadly. It also enables us to translate the ancient art of rhetoric into a modern science of social impression formation, yielding an empirical basis that can inform human-centered feedback, develop AI-based augmentation tools, and guide the design of engaging, socially present avatars in an increasingly AI-mediated and virtual world.
[HC-4] he Topology of Recovery: Using Persistent Homology to Map Individual Mental Health Journeys in Online Communities
【速读】:该论文旨在解决传统方法难以捕捉个体在心理健康挑战中动态恢复轨迹的问题,现有社区层面的快照式分析无法揭示用户随时间演变的心理状态变化。其解决方案的关键在于引入拓扑数据分析(Topological Data Analysis, TDA)中的持久同调(persistent homology)方法,将用户的长期发帖历史建模为语义嵌入空间中的轨迹,并通过拓扑特征识别恢复模式:环状结构表示状态循环(停滞),尖峰状结构表示探索新应对策略(成长)。同时提出语义恢复速度(Semantic Recovery Velocity, SRV)作为量化指标,衡量用户从初始痛苦导向内容向更积极状态迁移的速度,实证表明该方法在预测自我报告改善方面达到78.3%准确率,显著优于情感基线模型。
链接: https://arxiv.org/abs/2602.23886
作者: Joydeep Chandra,Satyam Kumar Navneet,Yong Zhang
机构: BNRIST(清华大学网络研究院); Tsinghua University(清华大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Understanding how individuals navigate mental health challenges over time is critical yet methodologically challenging. Traditional approaches analyze community-level snapshots, failing to capture dynamic individual recovery trajectories. We introduce a novel framework applying Topological Data Analysis (TDA) specifically persistent homology to model users’ longitudinal posting histories as trajectories in semantic embedding space. Our approach reveals topological signatures of trajectory patterns: loops indicate cycling back to similar states (stagnation), while flares suggest exploring new coping strategies (growth). We propose Semantic Recovery Velocity (SRV), a novel metric quantifying the rate users move away from initial distress-focused posts in embedding space. Analyzing 15,847 r/depression trajectories and validating against multiple proxies, we demonstrate topological features predict self-reported improvement with 78.3% accuracy, outperforming sentiment baselines. This work contributes: (1) a TDA methodology for HCI mental health research, (2) interpretable topological signatures, and (3) design implications for adaptive mental health platforms with ethical guardrails.
[HC-5] Human-Centered Multimodal Fusion for Sexism Detection in Memes with Eye-Tracking Heart Rate and EEG Signals LREC2026
【速读】:该论文旨在解决自动化识别网络迷因(meme)中性别歧视内容的难题,其核心挑战在于多模态歧义、文化语境的细微差异以及幽默手法带来的合理否认空间,传统仅依赖内容特征的模型难以准确捕捉人类感知的复杂性。解决方案的关键在于引入以人为中心的范式,通过采集受试者在观看3984张来自EXIST 2025数据集迷因时的眼动(Eye-Tracking, ET)、心率(Heart Rate, HR)和脑电图(Electroencephalography, EEG)等生理信号,揭示性别歧视内容与非歧视内容在认知负荷和神经活动上的显著差异,并构建融合生理信号与视觉-语言模型(Vision-Language Model, VLM)提取的增强文本-视觉特征的多模态融合模型。该方法在二分类任务中达到AUC 0.794,较强基线提升3.4%,尤其在细粒度类别“厌女与非性暴力”上F1-score提升26.3%,证明生理响应可作为客观感知信号,显著提升系统对在线性别歧视内容识别的准确性与人本意识。
链接: https://arxiv.org/abs/2602.23862
作者: Iván Arcos,Paolo Rosso,Elena Gomis-Vicent
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to appear in the Proceedings of LREC 2026
Abstract:The automated detection of sexism in memes is a challenging task due to multimodal ambiguity, cultural nuance, and the use of humor to provide plausible deniability. Content-only models often fail to capture the complexity of human perception. To address this limitation, we introduce and validate a human-centered paradigm that augments standard content features with physiological data. We created a novel resource by recording Eye-Tracking (ET), Heart Rate (HR), and Electroencephalography (EEG) from 16 subjects (8 per experiment) while they viewed 3984 memes from the EXIST 2025 dataset. Our statistical analysis reveals significant physiological differences in how subjects process sexist versus non-sexist content. Sexist memes were associated with higher cognitive load, reflected in increased fixation counts and longer reaction times, as well as differences in EEG spectral power across the Alpha, Beta, and Gamma bands, suggesting more demanding neural processing. Building on these findings, we propose a multimodal fusion model that integrates physiological signals with enriched textual-visual features derived from a Vision-Language Model (VLM). Our final model achieves an AUC of 0.794 in binary sexism detection, a statistically significant 3.4% improvement over a strong VLM-based baseline. The fusion is particularly effective for nuanced cases, boosting the F1-score for the most challenging fine-grained category, Misogyny and Non-Sexual Violence, by 26.3%. These results show that physiological responses provide an objective signal of perception that enhances the accuracy and human-awareness of automated systems for countering online sexism. Comments: Accepted to appear in the Proceedings of LREC 2026 Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.23862 [cs.HC] (or arXiv:2602.23862v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.23862 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-6] Feelings Not Feel: Affective Audio-Visual Pseudo-Haptics in Hand-Tracked XR
【速读】:该论文旨在解决无控制器扩展现实(XR)交互中缺乏触觉反馈的问题,其核心挑战在于如何通过非物理接触的方式提升用户对虚拟交互的感知质量。解决方案的关键在于将伪触觉(pseudo-haptic)线索视为情感反馈通道而非直接的触觉替代品:研究者设计了一个混合现实原型,在保持接触表面视觉中立的前提下,通过手部运动调制呈现纹理、颜色发光和与动作同步的声音等多模态伪触觉提示。实验结果表明,参与者虽未报告持续的触觉或温度感,但情绪状态显著变化——粗糙-热刺激降低愉悦度并增加唤醒度,而光滑-冷刺激则带来更平静愉悦的感受,这说明伪触觉在XR中更应被理解为一种影响主观感受的情感调节机制。
链接: https://arxiv.org/abs/2602.23747
作者: Kristian Paolo David,Tyrone Justin Sta Maria,Mikkel Dominic Gamboa,Jordan Aiko Deja
机构: University of the Philippines Los Baños (菲律宾大学洛斯巴ños分校); De La Salle University (德拉萨大学)
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 2 figures, 1 table, CHI EA Posters
Abstract:Hand-tracking enables controller-free XR interaction but does not have the tactile feedback controllers provide. Rather than treating this solely as a missing-sensation problem, we explore whether pseudo-haptic cues on an embodied virtual hand act as tactile or as affect substitutes that shape how interactions feel. We used a mixed reality prototype that keeps the contacted surface visually neutral, rendering cues on the hand with motion modulation for texture, color glow, and movement-coupled sound. In a within-subjects study (n=12), participants experienced 12 conditions (4 effects x 3 modalities: audio, visual, both) and reported subjective affect and cognitive demand. Participants rarely reported sustained tactile, thermal sensations, yet affect shifted systematically: rough-hot lowered valence increasing arousal, while smooth-cold produced calmer pleasant states. These findings suggest that pseudo-haptics in XR may be better understood as an affective feedback channel rather than a direct replacement for physical touch in controller-free systems.
[HC-7] Shape vs. Context: Examining Human–AI Gaps in Ambiguous Japanese Character Recognition
【速读】:该论文试图解决的问题是:高文本识别性能的视觉-语言模型(Vision-Language Models, VLMs)是否具备与人类相似的决策模式,尤其是在处理歧义时。为探究这一行为差异,研究者提出了一种基于连续插值的日本字符形状生成方法(通过β-VAE实现),并直接比较人类与VLM在单字符识别任务(仅依赖形状信息)和词级上下文中的决策边界。解决方案的关键在于利用形状插值生成具有渐变语义边界的测试样本,从而量化评估VLM响应与人类判断的一致性,并发现形状上下文可以改善部分条件下的对齐效果。这一方法揭示了VLM与人类在决策逻辑上的定性差异,为构建更贴近人类认知的VLM对齐基准提供了基础洞见。
链接: https://arxiv.org/abs/2602.23746
作者: Daichi Haraguchi
机构: CyberAgent( CyberAgent)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CHI 2026 Poster track
Abstract:High text recognition performance does not guarantee that Vision-Language Models (VLMs) share human-like decision patterns when resolving ambiguity. We investigate this behavioral gap by directly comparing humans and VLMs using continuously interpolated Japanese character shapes generated via a \beta -VAE. We estimate decision boundaries in a single-character recognition (shape-only task) and evaluate whether VLM responses align with human judgments under shape in context (i.e., embedding an ambiguous character near the human decision boundary in word-level context). We find that human and VLM decision boundaries differ in the shape-only task, and that shape in context can improve human alignment in some conditions. These results highlight qualitative behavioral differences, offering foundational insights toward human–VLM alignment benchmarking.
[HC-8] Does Personalized Nudging Wear Off? A Longitudinal Study of AI Self-Modeling for Behavioral Engagement
【速读】:该论文旨在解决行为改变技术(Behavior Change Technologies, BCTs)在长期使用中效果难以维持的核心挑战。其解决方案的关键在于采用人工智能自体建模(AI Self-Modeling),通过生成用户理想自我的个性化表征来激发行为改变动机。研究发现,视觉自体建模(Visual Self-Modeling, VSM)在短期内能显著提升健身参与度,并在4周内持续维持较高表现水平,尽管改善速率随时间递减;访谈进一步揭示了“催化剂效应”——即初期通过清晰、可实现的目标激发早期动机,随后经历习惯化与内在化过程,从而稳定行为表现。这一发现强调了个性化提示(personalized nudging)的时间动态特性,为设计具备长期用户参与能力的行为改变技术提供了实证依据与设计指导。
链接: https://arxiv.org/abs/2602.23688
作者: Qing He,Zeyu Wang,Yuzhou Du,Jiahuan Ding,Yuanchun Shi,Yuntao Wang
机构: University of Pennsylvania (宾夕法尼亚大学); Tsinghua University (清华大学); Virginia Tech (弗吉尼亚理工学院); Xi’an Jiaotong University (西安交通大学); Qinghai University (青海大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to CHI 2026 (ACM Conference on Human Factors in Computing Systems). The final version will appear in the ACM Digital Library
Abstract:Sustaining the effectiveness of behavior change technologies remains a key challenge. AI self-modeling, which generates personalized portrayals of one’s ideal self, has shown promise for motivating behavior change, yet prior work largely examines short-term effects. We present one of the first longitudinal evaluations of AI self-modeling in fitness engagement through a two-stage empirical study. A 1-week, three-arm experiment (visual self-modeling (VSM), auditory self-modeling (ASM), Control; N=28) revealed that VSM drove initial performance gains, while ASM showed no significant effects. A subsequent 4-week study (VSM vs. Control; N=31) demonstrated that VSM sustained higher performance levels but exhibited diminishing improvement rates after two weeks. Interviews uncovered a catalyst effect that fostered early motivation through clear, attainable goals, followed by habituation and internalization which stabilized performance. These findings highlight the temporal dynamics of personalized nudging and inform the design of behavior change technologies for long-term engagement.
[HC-9] he Compulsory Imaginary: AGI and Corporate Authority
【速读】:该论文试图解决的问题是:在生成式人工智能(Generative AI)领域,两家领先的通用人工智能(AGI)企业——OpenAI与Anthropic——如何通过一致的修辞策略构建社会技术想象(sociotechnical imaginaries),从而塑造公众对AGI未来发展的认知与合法性。其解决方案的关键在于识别出四种共享的修辞操作机制:自我豁免(self-exemption move)、目的论自然化(teleological naturalization)、有条件承认(qualified acknowledgment)以及隐含不可或缺性(implicit indispensability)。这些机制共同构成了一个结构性而非个体差异驱动的叙事框架,揭示了私营企业在推动颠覆性技术创新过程中如何系统性地投射和稳定其对未来的技术主导权,进而提出需重新审视现有制度安排以实现对该权威的外部制衡。
链接: https://arxiv.org/abs/2602.23679
作者: Emilio Barkett
机构: Columbia University (哥伦比亚大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper argues that the two leading AGI firms – OpenAI and Anthropic – construct sociotechnical imaginaries through a structurally consistent rhetorical strategy, despite meaningful differences in execution. Drawing on Jasanoff (2015)'s framework of sociotechnical imaginaries, the paper analyzes two essays published in late 2024: Sam Altman’s “The Intelligence Age” and Dario Amodei’s “Machines of Loving Grace.” Close comparative reading identifies four shared rhetorical operations: the self-exemption move, which disavows prophetic authority while exercising it; teleological naturalization, which embeds AGI’s arrival in narratives of historical inevitability; qualified acknowledgment, which absorbs concessions to risk into an optimistic frame; and implicit indispensability, which positions each firm as central to the imagined future without naming it as a commercial actor. That two competing institutions with different cultures, risk philosophies, and leaders with notably different public personae converge on the same rhetorical architecture suggests the imaginary reflects not only firm-level strategy but the institutional position these firms occupy. The paper extends the sociotechnical imaginaries framework from nation-states to private firms at the frontier of transformative technology development, identifies the discursive mechanism through which corporate authority over technological futures is projected and stabilized, and demonstrates that this mechanism is at minimum structural rather than idiosyncratic. The findings raise the question of what institutional arrangements would make that authority contestable from outside the firms that produce it.
[HC-10] Assessment of Display Performance and Comparative Evaluation of Web Map Libraries for Extensive 3D Geospatial Data
【速读】:该论文旨在解决大规模三维地理空间数据可视化在数字社会基础设施建设中的性能瓶颈问题,特别是在日本背景下对WebGL-based网页地图库(CesiumJS与MapLibre GL JS)的性能差异进行系统性评估。其关键解决方案在于采用标准化的3D Tiles 1.1和Mapbox Vector Tiles (MVT)格式,在不同数据尺度(第2级和第3级网格层级)下,基于Core Web Vitals指标(如FCP、LCP、TBT、CLS等)量化比较两种技术栈的渲染效率与用户体验表现,从而为特定应用场景提供可复现的技术选型依据,尤其揭示了MVT结合MapLibre GL JS在建筑模型可视化中具备最优性能(FCP 0.8s,TBT 0ms),而MapLibre GL JS在大规模点云处理上显著优于CesiumJS(TBT分别为3ms vs 21,357ms)。
链接: https://arxiv.org/abs/2602.23660
作者: Toshikazu Seto,Yohei Shiwaku,Takayuki Miyauchi,Daisuke Yoshida,Yuichiro Nishimura
机构: 未知
类目: Computers and Society (cs.CY); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: 6 pages, 5 figures, 1 table
Abstract:Large-scale 3D geospatial data visualization has become increasingly critical for the development of the digital society infrastructure in Japan. This study conducted a comprehensive performance evaluation of two major WebGL-based web mapping libraries, CesiumJS and MapLibre GL JS, using large-scale 3D point-cloud data from the VIRTUAL SHIZUOKA and PLATEAU building models. The research employs standardized 3D Tiles 1.1, and Mapbox Vector Tiles (MVT) formats, comparing performance across different data scales (2nd and 3rd grid levels) using Core Web Vitals metrics, including First Contentful Paint (FCP), Largest Contentful Paint (LCP), Speed Index, Total Blocking Time (TBT), and Cumulative Layout Shift (CLS). The results demonstrate that MVT-based building visualization with MapLibre GL JS achieves optimal performance (FCP 0.8s, TBT 0ms), whereas MapLibre GL JS combined with this http URL shows superior performance for large-scale point cloud processing (TBT: 3ms, CesiumJS: 21,357ms). This study provides data-driven selection guidelines for appropriate technology choices according to use cases, establishing reproducible performance evaluation frameworks for 3D web mapping technologies during the WebGPU and OGC 3D Tiles 1.1 standardization era. Comments: 6 pages, 5 figures, 1 table Subjects: Computers and Society (cs.CY); Graphics (cs.GR); Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.23660 [cs.CY] (or arXiv:2602.23660v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2602.23660 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-11] Critical Infrastructure in the Multi-Cloud Strategy: Use of Cloud Computing in SMEs
【速读】:该论文旨在解决中小企业(SMEs)在关键基础设施领域中采用云计算(Cloud computing)时面临的风险识别与部署模式优化问题。其解决方案的关键在于通过系统性地整合学术研究、行业实践、政府报告及在线文献的调查数据,揭示了SMEs在关键基础设施中广泛采用云计算的不同部署模型,并识别出其中存在的风险因素,从而为SMEs制定云策略提供实证依据,同时为政策制定者和业务支持机构设计更有效的云服务推广与风险管理方案提供参考。
链接: https://arxiv.org/abs/2602.23658
作者: Ruwan Nagahawatta(School of Accounting, Information Systems and Supply Chain RMIT University Victoria, Australia),Sachithra Lokuge(School of Business University of Southern Queensland Queensland, Australia),Matthew Warren(Centre for Cyber Security Research and Innovation RMIT University Victoria, Australia University of Johannesburg, South Africa),Scott Salzman(Department of Information Systems and Business Analytics Deakin University Victoria, Australia)
机构: 未知
类目: Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:
Abstract:Cloud computing enables cost-effective on-demand network access to a shared pool of configurable computing resources. The purpose of this paper is to examine and identifying the use of Cloud computing in the critical infrastructure domain among small and medium sized enterprises (SMEs). The data for this study were gathered from a survey of different academic, industry, governmental and online literature related to the use of Cloud computing in SMEs. The result revealed that there are risks involved in the use of Cloud computing, SMEs are deploying Cloud computing using different deployment models and reaching a high level of deployment within the critical infrastructure. The research findings are useful for SMEs that are planning or are in the use of Cloud computing, as well as for SMEs policymakers and business support community that engaged with Cloud computing initiatives.
[HC-12] When LLM s Help – and Hurt – Teaching Assistants in Proof-Based Courses
【速读】:该论文试图解决的问题是:在证明类课程(proof-based courses)中,助教(Teaching Assistants, TAs)承担的评分与反馈任务耗时且难以扩展,而大型语言模型(Large Language Models, LLMs)在该场景下的有效性尚不明确。为设计基于LLM的评分与反馈系统,首要前提是验证LLMs能否对TAs的评分和反馈工作提供有意义的支持。解决方案的关键在于通过一个多部分的案例研究(multi-part case study),作为技术探针(technology probe),比较LLM与不同经验水平的TAs在基于评分标准(rubric-based grading)上的评分一致性,并评估TAs对LLM生成反馈的感知价值。研究发现,LLM与TAs在评分决策上存在显著分歧,但LLM生成的反馈对存在重大错误的作业仍具实用价值,从而为未来人机协作的评分与反馈系统设计提供了实证依据和关键设计启示。
链接: https://arxiv.org/abs/2602.23635
作者: Romina Mahinpei,Sofiia Druchyna,Manoel Horta Ribeiro
机构: Princeton University (普林斯顿大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Teaching assistants (TAs) are essential to grading and feedback provision in proof-based courses, yet these tasks are time-intensive and difficult to scale. Although Large Language Models (LLMs) have been studied for grading and feedback, their effectiveness in proof-based courses is still unknown. Before designing LLM-based systems for this context, a necessary prerequisite is to understand whether LLMs can meaningfully assist TAs with grading and feedback. As such, we present a multi-part case study functioning as a technology probe in an undergraduate proof-based course. We compare rubric-based grading decisions made by an LLM and TAs with varying levels of expertise and examine TAs’ perceptions of feedback generated by an LLM. We find substantial disagreement between LLMs and TAs on grading decisions but that LLM-generated feedback can still be useful to TAs for submissions with major errors. We conclude by discussing design implications for human-AI grading and feedback systems in proof-based courses.
[HC-13] Improving Family Co-Play Experiences through Family-Centered Design
【速读】:该论文旨在解决用户生成虚拟世界(User-Generated Virtual Worlds, UGVWs)中实时危害对家庭协同游戏(co-play)体验的破坏问题,尤其关注父母难以干预的动态风险,如诱导性付费机制、未受监管的社交互动、游戏中涌现的行为以及可能强化有害意识形态的叙事设计。解决方案的关键在于重新设计UGVWs及其平台架构,以系统性识别和最小化这些交互式与涌现性危害,从而保障家庭共玩的安全性和教育价值,超越传统基于静态内容审核的社会媒体治理范式。
链接: https://arxiv.org/abs/2602.23596
作者: Zinan Zhang,Xinning Gui,Yubo Kou
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 4 pages, research statement for Family Centered Design Workshop at CHI 2024
Abstract:Cooperative play (co-play) is often positioned as a family-beneficial practice that can strengthen parent-child bonds and support parental mediation in games. Yet co-play in user-generated virtual worlds (UGVWs) can be disrupted by real-time harms that parents cannot easily prevent. Roblox, a platform with millions of user-generated virtual worlds and a large child player base, illustrates this challenge. Prior work on harmful UGVW design highlights risks beyond content problems, including manipulative monetization prompts, unmoderated social interactions, emergent in-world behaviors, and narrative designs that may normalize harmful ideologies. Current governance and moderation approaches, largely adapted from social media, focus on static artifacts and often fail to capture interactive and emergent harms in virtual worlds. This workshop paper asks: how might UGVWs and their platforms be designed to minimize harms that specifically impair family co-play experiences?
[HC-14] Exploring the Effect of Heights and User Stance on User Experience in Extended Reality Climbing
【速读】:该论文旨在解决虚拟环境(Virtual Environment, VE)中高度感知与用户姿态(坐姿或站姿)对用户体验(User Experience, UX)影响的问题,特别是在沉浸式攀岩模拟等高要求场景下的表现差异。其关键解决方案在于通过控制变量的实验设计,系统性地比较不同用户姿态在多种复杂虚拟场景中的沉浸感、感知高度及晕动症症状变化,结果表明:坐姿虽能提升部分沉浸体验,但显著增加运动病风险,尤其在环境规模大、视觉复杂度高的情况下;而站姿则保持了更稳定且一致的用户体验,验证了用户物理姿态是优化VE交互设计的核心参数之一。
链接: https://arxiv.org/abs/2602.23500
作者: Tanja Kojić,Nathan Kirchner,Maurizio Vergari,Maximilian Warsinke,Sebastian Möller,Jan-Niklas Voigt-Antons
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: Paper presented at 2025 17th International Conference on Quality of Multimedia Experience (QoMEX)
Abstract:Virtual environments (VEs) are increasingly used for immersive experiences, training simulations, and entertainment, yet factors such as height perception and user stance can significantly influence user experience (UX). Height perception in VEs plays a crucial role in shaping UX, particularly in immersive applications such as climbing simulations. This study investigates the effects of height in various VEs and examines how user stance, sitting or standing, impacts immersion, perceived height, and motion sickness. A user study was conducted with 25 participants who played through five randomized climbing scenarios, ranging from indoor climbing gyms to outdoor cityscapes and mountainous terrains. Participants’ UX was assessed using standardized questionnaires, including the IPQ for general presence, spatial presence, involvement, and experienced realism, as well as the SSQ to evaluate motion sickness symptoms such as nausea, oculomotor strain, and disorientation. Results indicate that seated participants experienced slightly higher immersion but were also more susceptible to motion sickness compared to those standing. While standing participants maintained consistent scores across different environments, seated participants reported increased immersion and discomfort as the VEs became larger, more physically demanding, and visually complex. Comments: Paper presented at 2025 17th International Conference on Quality of Multimedia Experience (QoMEX) Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2602.23500 [cs.HC] (or arXiv:2602.23500v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.23500 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/QoMEX65720.2025.11219977 Focus to learn more DOI(s) linking to related resources
[HC-15] oo Immersive for the Field? Addressing Safety Risks in Extended Reality User Studies
【速读】:该论文旨在解决扩展现实(Extended Reality, XR)用户测试从实验室环境向家庭、学校及公共场所等真实场景迁移过程中所面临的安全风险问题,包括物理伤害、心理不适和可及性障碍等。其解决方案的关键在于提出一套实用的安全策略,以帮助研究人员在不同环境中安全且包容地开展XR研究,同时明确安全责任与操作标准,避免因缺乏统一指导而导致的安全隐患和伦理争议。
链接: https://arxiv.org/abs/2602.23497
作者: Tanja Kojić,Sara Srebot,Maurizio Vergari,Mirta Moslavac,Maximilian Warsinke,Sebastian Möller,Lea Skorin-Kapov,Jan-Niklas Voigt-Antons
机构: Quality and Usability Lab, TU Berlin (柏林工业大学质量与可用性实验室); University of Zagreb Faculty of Electrical Engineering and Computing (萨格勒布大学电气工程与计算学院); Immersive Reality Lab, Hochschule Hamm-Lippstadt (哈姆-利普施塔特应用技术大学沉浸式现实实验室)
类目: Human-Computer Interaction (cs.HC)
备注: Paper presented at the Mensch und Computer 2025 (MuC 2025)
Abstract:Extended Reality (XR) technologies are increasingly tested outside the lab, in homes, schools, and public spaces. While this shift enables more realistic user insights, it also introduces safety challenges that are often overlooked. Physical risks, psychological distress, and accessibility issues can be increased in field studies and unsupervised testing, such as at home or crowdsourced trials. Without clear instructions, safety decisions are left to individual researchers, raising questions of responsibility and consistency. This position paper outlines key safety risks in XR user testing beyond the lab and calls for practical strategies that are needed to help researchers run XR studies in a safe and inclusive way across different environments.
[HC-16] dynote: Always-Clear Notebook Authoring
【速读】:该论文旨在解决笔记本(Notebook)在探索性分析过程中缺乏对清晰度(Clarity)支持的问题,从而影响用户在撰写和调试代码时的效率与可读性。为应对这一挑战,作者提出“始终清晰的笔记本编写”(Always-clear Notebook Authoring)范式,并开发了名为 Tidynote 的 Jupyter 插件实现该理念。其解决方案的关键在于三点:(1)引入一个用于探索的侧边栏(scratchpad sidebar),(2)允许单元格在主笔记本与侧边栏之间自由移动以维持结构化组织,以及(3)采用线性执行结合状态分叉(linear execution with state forks)机制来明确程序状态变化,从而在整个笔记本生命周期中系统性地提升清晰度。
链接: https://arxiv.org/abs/2602.23490
作者: Ruanqianqian Huang,Brian Hempel,Yining Cao,James D. Hollan,Haijun Xia,Sorin Lerner
机构: University of California San Diego (加州大学圣地亚哥分校); Cornell University (康奈尔大学)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at CHI 2026
Abstract:Recent work identified clarity as one of the top quality attributes that notebook users value, but notebooks lack support for maintaining clarity throughout the exploratory phases of the notebook authoring workflow. We propose always-clear notebook authoring that supports both clarity and exploration, and present a Jupyter implementation called Tidynote. The key to Tidynote is three-fold: (1) a scratchpad sidebar to facilitate exploration, (2) cells movable between the notebook and the scratchpad to maintain organization, and (3) linear execution with state forks to clarify program state. An exploratory study (N=13) of open-ended data analysis tasks shows that Tidynote features holistically promote clarity throughout a notebook’s lifecycle, support realistic notebook tasks, and enable novel strategies for notebook clarity. These results suggest that Tidynote supports maintaining clarity throughout the entirety of notebook authoring.
[HC-17] Walking with Robots: Video Analysis of Human-Robot Interactions in Transit Spaces
【速读】:该论文旨在解决公共空间中机器人与人类互动能力不足的问题,特别是在缺乏对人类行为社会性理解的情况下,机器人可能扰乱既有社会秩序。解决方案的关键在于探索一种“社会感知的移动”(socially-aware movement)新设计空间,通过构建“强概念”(strong concepts),将移动视为一种交互与协作的成果,从而提升机器人在日常公共生活节奏中的融合能力。
链接: https://arxiv.org/abs/2602.23475
作者: Barry Brown,Hannah Pelikan,Mathaius Broth
机构: University of Copenhagen (哥本哈根大学); Stockholm University (斯德哥尔摩大学); Linköping University (林雪平大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:The proliferation of robots in public spaces necessitates a deeper understanding of how these robots can interact with those they share the space with. In this paper, we present findings from video analysis of publicly deployed cleaning robots in a transit space, a major commercial airport, using their navigational ‘troubles’ as a tool to document what robots currently lack in interactional competence. We demonstrate that these robots, while technically proficient, can disrupt the social order of a space due to their inability to understand core aspects of human movement: mutual adjustment to others, the significance of understanding social groups, and the purpose of different locations. In discussion we argue for exploring a new design space of movement: socially-aware movement. By developing “strong concepts” that treat movement as an interactional and collaborative accomplishment, we can create systems that better integrate into the everyday rhythms of public life.
[HC-18] Challenges in Automatic Speech Recognition for Adults with Cognitive Impairment
【速读】:该论文旨在解决语音识别(ASR)在认知障碍人群中的性能下降问题,特别是针对阿尔茨海默病及相关痴呆(ADRD)患者使用语音助手时的可用性差距。研究表明,患有痴呆的个体ASR错误率显著高于认知正常和轻度认知障碍者,而这种差异主要由声学特征如说话强度、语音质量及停顿比例决定。解决方案的关键在于通过声学特征分析识别影响ASR准确性的核心因素,并据此提出人机交互(HCI)设计改进策略,包括个性化语音识别模型、人工介入校正ASR结果以及基于用户能力的交互层级自适应机制,从而提升AgeTech语音界面对于认知障碍用户的适用性和有效性。
链接: https://arxiv.org/abs/2602.23436
作者: Michelle Cohn,Alyssa Lanzi,Yui Ishihara,Chen-Nee Chuah,Georgia Zellou,Alyssa Weakley
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Millions of people live with cognitive impairment from Alzheimer’s disease and related dementias (ADRD). Voice-enabled smart home systems offer promise for supporting daily living but rely on automatic speech recognition (ASR) to transcribe their speech to text. Prior work has shown reduced ASR performance for adults with cognitive impairment; however, the acoustic factors underlying these disparities remain poorly understood. This paper evaluates ASR performance for 83 older adults across cognitive groups (cognitively normal, mild cognitive impairment, dementia) reading commands to a voice assistant (Amazon Alexa). Results show that ASR errors are significantly higher for individuals with dementia, revealing a critical usability gap. To better understand these disparities, we conducted an acoustic analysis of speech features and found that a speaker’s intensity, voice quality, and pause ratio predicted ASR accuracy. Based on these findings, we outline HCI design implications for AgeTech and voice interfaces, including speaker-personalized ASR, human-in-the-loop correction of ASR transcripts, and interaction-level personalization to support ability-based adaptation.
[HC-19] Now You See Me: Designing Responsible AI Dashboards for Early-Stage Health Innovation
【速读】:该论文旨在解决早期健康科技(HealthTech)团队在资源受限环境下,如何平衡伦理期望与组织优先事项以实现负责任的人工智能(Responsible AI)实践的问题。当前的 Responsible AI 实践常因抽象或脱离实际运营场景而难以落地,导致弱势项目和创始团队被边缘化,限制了医疗AI系统中问题领域、解决方案、利益相关者视角及人群数据集的多样性。其核心解决方案是通过以创新导向的人机交互(Human-Computer Interaction, HCI)方法为基础,设计并开发结构化的可视化治理工具(governance-oriented visualization systems),这些工具需基于领域知识、协同共创、贴合组织成熟度与具体情境,并支持多元角色与任务需求。关键在于将可视化作为社会技术治理机制(sociotechnical governance artifacts),从而在AI生命周期中促进决策透明性、问责制与公平性,推动更广泛且可持续的医疗AI创新生态发展。
链接: https://arxiv.org/abs/2602.23378
作者: Svitlana Surodina,Sinem Görücü,Lili Golmohammadi,Emelia Delaney,Rita Borgo
机构: King’s College London(伦敦国王学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Innovative HealthTech teams develop Artificial Intelligence (AI) systems in contexts where ethical expectations and organizational priorities must be balanced under severe resource constraints. While Responsible AI practices are expected to guide the design and evaluation of such systems, they frequently remain abstract or poorly aligned with the operational realities of early-stage innovation. At the ecosystem level, this misalignment disproportionately affects disadvantaged projects and founders, therefore limiting the diversity of problem-areas under consideration, solutions, stakeholder perspectives, and population datasets represented in AI-enabled healthcare systems. Visualization provides a practical mechanism for supporting decision-making across the AI lifecycle. When developed via a rigorous and collaborative design process, structured on domain knowledge and designed around real-world constraints, visual interfaces can operate as effective sociotechnical governance artifacts enabling responsible decision-making. Grounded in innovation-oriented Human-Centered Computing methodologies, we synthesize insights from a series of design studies conducted via a longitudinal visualization research program, a case study centered on governance dashboard design in a translational setting, and a survey of a cohort of early-stage HealthTech startups. Based on these findings, we articulate design process implications for governance-oriented visualization systems: co-creation with stakeholders, alignment with organizational maturity and context, and support for heterogeneous roles and tasks among others. This work contributes actionable guidance for designing Responsible AI governance dashboards that support decision-making and accountability in early-stage health innovation, and suggests that ecosystem-level coordination can enable more scalable and diverse AI innovation in healthcare. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2602.23378 [cs.HC] (or arXiv:2602.23378v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.23378 Focus to learn more arXiv-issued DOI via DataCite
[HC-20] Complex Cognition: A New Theoretical Foundation for the Design and Evaluation of Visual Analytics Systems
【速读】:该论文试图解决当前视觉分析(Visual Analytics)系统研究中存在的重要理论矛盾:即现有研究方法多基于人机交互(Human-Computer Interaction, HCI)范式,侧重于简单认知行为(如颜色感知、空间关系感知)的分析,但其研究目标却聚焦于复杂分析行为(如推理、问题解决与决策制定),导致研究设计在内部效度(internal validity)和方法外部推广性(external validity)方面受限。解决方案的关键在于突破传统HCI理论基础,引入复杂认知理论(Complex Cognition Theories),以构建新的理论框架,从而更有效地指导视觉分析系统的系统设计与评估,提升其对复杂认知任务的支持能力与研究成果的科学严谨性。
链接: https://arxiv.org/abs/2602.23377
作者: Xiaolong Zhang(Luke)
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 27 pages, 1 figure and 1 table
Abstract:Current research on visual analytics systems largely follows the research paradigm of interactive system design in the field of Human-Computer Interaction (HCI), and includes key methodologies including design requirement development based on user needs, interactive system design, and system evaluation. However, most studies under this paradigm have a contradiction: there is a significant mismatch between the research methods developed for simple cognitive behaviors (e.g., color perception, the perception of spatial relationship among interactive artifacts) and research goals targeting for complex analytical behaviors (e.g., reasoning, problem-solving, decision-making). This mismatch may hurt the theoretical contributions of research studies, in particularly the internal validity of a designed system and the external validity of design methods. To address this challenge, this paper argues for a need to go beyond traditional HCI theoretical foundations and proposes to adopt complex cognition theories to build new theoretical foundations. Specifically, this paper analyzes how current design and evaluation methods in research on visual analytics systems constrain the internal and external validity of research, discusses the connections between complex cognition theories and visual analytics tasks, and explores how problem-solving theories from complex cognition can guide research on visual analytics systems.
[HC-21] Dynamic Personalization Through Continuous Feedback Loops in Interactive AI Systems
【速读】:该论文旨在解决传统交互式人工智能(Interactive AI)系统在个性化推荐中因依赖静态用户画像和预设规则而难以捕捉用户偏好与情境动态变化的问题。其解决方案的关键在于引入持续反馈循环机制,通过实时收集并分析用户反馈,使个人化算法能够动态调整推荐内容、响应策略和交互方式,从而更贴合用户的当前情境与偏好;同时,作者提供了该自适应个性化算法的收敛性和后悔边界理论保障,并在推荐系统、虚拟助手和自适应学习平台三个场景中验证了该方法在提升用户满意度(15–23%)的同时保持计算效率的可行性。
链接: https://arxiv.org/abs/2602.23376
作者: Liu He
机构: Beijing Institute of Technology (北京理工大学)
类目: Human-Computer Interaction (cs.HC)
备注: 14 pages,1 figures
Abstract:Interactive AI systems, such as recommendation engines and virtual assistants, commonly use static user profiles and predefined rules to personalize interactions. However, these methods often fail to capture the dynamic nature of user preferences and context. This study proposes a theoretical framework and practical implementation for integrating continuous feedback loops into personalization algorithms to enable real-time adaptation. By continuously collecting and analyzing user feedback, the AI system can dynamically adjust its recommendations, responses, and interactions to better align with the user’s current context and preferences. We provide theoretical guarantees for the convergence and regret bounds of our adaptive personalization algorithm. Our experimental evaluation across three domains-recommendation systems, virtual assistants, and adaptive learning platforms-demonstrates that dynamic personalization improves user satisfaction by 15-23% compared to static methods while maintaining computational efficiency. We investigated the implementation challenges of continuous feedback mechanisms, evaluated their impact on user experience and satisfaction, and provided a comprehensive analysis of the trade-offs between personalization quality, computational overhead, and user fatigue.
[HC-22] From Continuous sEMG Signals to Discrete Muscle State Tokens: A Robust and Interpretable Representation Framework
【速读】:该论文旨在解决表面肌电信号(surface electromyography, sEMG)在跨个体间存在显著差异且易受噪声干扰的问题,从而影响其解码的鲁棒性和可解释性。解决方案的关键在于提出一种基于生理信息启发的离散化表示方法,通过与最小肌肉收缩周期对齐的滑动窗口分离单个肌肉激活事件,并从中提取10个时频特征(如均方根值RMS和中位频率MDF),再利用K-means聚类生成具有代表性的肌肉状态标记(token)。该方法不仅显著提升了模型在动作识别任务中的性能(如ViT模型Top-1准确率达75.5%),同时实现了96%的输入维度压缩,且保持了高跨个体一致性(Cohen’s Kappa = 0.82±0.09)和对运动质量评估的直观可解释性。
链接: https://arxiv.org/abs/2602.23738
作者: Yuepeng Chen,Kaili Zheng,Ji Wu,Zhuangzhuang Li,Ye Ma,Dongwei Liu,Chenyi Guo,Xiangling Fu
机构: Tsinghua University (清华大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC)
备注:
Abstract:Surface electromyography (sEMG) signals exhibit substantial inter-subject variability and are highly susceptible to noise, posing challenges for robust and interpretable decoding. To address these limitations, we propose a discrete representation of sEMG signals based on a physiology-informed tokenization framework. The method employs a sliding window aligned with the minimal muscle contraction cycle to isolate individual muscle activation events. From each window, ten time-frequency features, including root mean square (RMS) and median frequency (MDF), are extracted, and K-means clustering is applied to group segments into representative muscle-state tokens. We also introduce a large-scale benchmark dataset, ActionEMG-43, comprising 43 diverse actions and sEMG recordings from 16 major muscle groups across the body. Based on this dataset, we conduct extensive evaluations to assess the inter-subject consistency, representation capacity, and interpretability of the proposed sEMG tokens. Our results show that the token representation exhibits high inter-subject consistency (Cohen’s Kappa = 0.82±0.09), indicating that the learned tokens capture consistent and subject-independent muscle activation patterns. In action recognition tasks, models using sEMG tokens achieve Top-1 accuracies of 75.5% with ViT and 67.9% with SVM, outperforming raw-signal baselines (72.8% and 64.4%, respectively), despite a 96% reduction in input dimensionality. In movement quality assessment, the tokens intuitively reveal patterns of muscle underactivation and compensatory activation, offering interpretable insights into neuromuscular control. Together, these findings highlight the effectiveness of tokenized sEMG representations as a compact, generalizable, and physiologically meaningful feature space for applications in rehabilitation, human-machine interaction, and motor function analysis.
计算机视觉
[CV-0] UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images ICLR2026
【速读】:该论文旨在解决从无姿态约束的图像对中进行密集4D重建的难题,现有方法通常依赖于耗时的测试阶段优化或碎片化、任务特定的前馈模型。其解决方案的关键在于提出UFO-4D——一种统一的前馈框架,直接从一对未标注图像中估计动态3D高斯斑点(Dynamic 3D Gaussian Splats),从而在前馈过程中联合且一致地恢复三维几何结构、三维运动和相机位姿。核心创新在于通过可微渲染同一动态3D高斯表示中的多种信号(如外观、深度与运动),实现自监督图像合成损失,并紧密耦合多模态信息;由于所有模态共享相同的几何基元,某一模态的监督能自然正则化并提升其他模态性能,从而有效缓解数据稀缺问题,使UFO-4D在几何、运动和相机位姿联合估计上相较以往方法提升达3倍。
链接: https://arxiv.org/abs/2602.24290
作者: Junhwa Hur,Charles Herrmann,Songyou Peng,Philipp Henzler,Zeyu Ma,Todd Zickler,Deqing Sun
机构: Google(谷歌); Princeton University (普林斯顿大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026, Project page: this https URL
Abstract:Dense 4D reconstruction from unposed images remains a critical challenge, with current methods relying on slow test-time optimization or fragmented, task-specific feedforward models. We introduce UFO-4D, a unified feedforward framework to reconstruct a dense, explicit 4D representation from just a pair of unposed images. UFO-4D directly estimates dynamic 3D Gaussian Splats, enabling the joint and consistent estimation of 3D geometry, 3D motion, and camera pose in a feedforward manner. Our core insight is that differentiably rendering multiple signals from a single Dynamic 3D Gaussian representation offers major training advantages. This approach enables a self-supervised image synthesis loss while tightly coupling appearance, depth, and motion. Since all modalities share the same geometric primitives, supervising one inherently regularizes and improves the others. This synergy overcomes data scarcity, allowing UFO-4D to outperform prior work by up to 3 times in joint geometry, motion, and camera pose estimation. Our representation also enables high-fidelity 4D interpolation across novel views and time. Please visit our project page for visual results: this https URL
[CV-1] Mode Seeking meets Mean Seeking for Fast Long Video Generation ECAI
【速读】:该论文旨在解决视频生成从秒级扩展到分钟级时面临的核心瓶颈问题:短时视频数据虽丰富且保真度高,但长时视频数据稀缺且应用场景受限。其解决方案的关键在于提出一种“模式寻找(Mode Seeking)与均值寻找(Mean Seeking)相结合”的训练范式,通过统一表示框架下的解耦扩散Transformer(Decoupled Diffusion Transformer),将局部保真度与长期连贯性分离建模:其中全局流匹配(Flow Matching)头利用有限长视频进行监督学习以捕获叙事结构,同时局部分布匹配(Distribution Matching)头通过模式导向的反KL散度对齐滑动窗口段至冻结的短时视频教师模型,从而在少量长视频监督下实现分钟级视频的高效生成,显著提升局部清晰度、运动流畅性及长程一致性。
链接: https://arxiv.org/abs/2602.24289
作者: Shengqu Cai,Weili Nie,Chao Liu,Julius Berner,Lvmin Zhang,Nanye Ma,Hansheng Chen,Maneesh Agrawala,Leonidas Guibas,Gordon Wetzstein,Arash Vahdat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project website: this https URL
Abstract:Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: this https URL.
[CV-2] Hierarchical Action Learning for Weakly-Supervised Action Segmentation
【速读】:该论文旨在解决视频理解中层次化推理难以实现的问题,特别是机器在弱监督场景下对动作分割时容易过度切分(over-segment)的挑战。其核心问题是:人类通过关键转换(key transitions)在多抽象层级上感知动作,而现有方法依赖视觉特征易导致低层过细分割,缺乏对高层动作语义的稳定建模。解决方案的关键在于提出一种层次化动作学习(Hierarchical Action Learning, HAL)模型,该模型基于一个关键观察——低层视觉变量变化快、高层动作变量演化慢,从而设计了具有不同时间尺度的因果生成过程:高层动作潜变量(latent action variable)控制低层视觉特征的动态演化,并引入确定性过程对齐不同时序的潜变量;同时采用分层金字塔Transformer架构捕获多尺度特征,并施加稀疏转换约束以强化高层动作变量的缓慢变化特性,从而提升其可识别性。理论证明在温和假设下,高层动作潜变量是严格可识别的,实验表明HAL显著优于现有弱监督动作分割方法。
链接: https://arxiv.org/abs/2602.24275
作者: Junxian Huang,Ruichu Cai,Hao Zhu,Juntao Fang,Boyan Xu,Weilin Chen,Zijian Li,Shenghua Gao
机构: Guangdong University of Technology (广东工业大学); Carnegie Mellon University (卡内基梅隆大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans perceive actions through key transitions that structure actions across multiple abstraction levels, whereas machines, relying on visual features, tend to over-segment. This highlights the difficulty of enabling hierarchical reasoning in video understanding. Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\textbfHAL) model for weakly-supervised action segmentation. Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features. To model these varying timescales effectively, we introduce deterministic processes to align these latent variables over time. The \textbfHAL model employs a hierarchical pyramid transformer to capture both visual features and latent variables, and a sparse transition constraint is applied to enforce the slower dynamics of high-level action variables. This mechanism enhances the identification of these latent variables over time. Under mild assumptions, we prove that these latent action variables are strictly identifiable. Experimental results on several benchmarks show that the \textbfHAL model significantly outperforms existing methods for weakly-supervised action segmentation, confirming its practical effectiveness in real-world applications.
[CV-3] Compositional Generalization Requires Linear Orthogonal Representations in Vision Embedding Models
【速读】:该论文旨在解决现代神经网络模型在面对未见过的组合输入时,如何实现**组合泛化(compositional generalization)的问题。尽管当前模型基于大规模数据集训练,但其覆盖的输入组合仅占全部可能组合的一小部分,因此亟需理解表征结构应具备何种特性才能支持对新组合的有效识别。论文提出三个必要条件:可分性(divisibility)、可迁移性(transferability)和稳定性(stability),并证明这些条件强制要求表征必须线性分解为每个概念的独立分量,且不同概念的分量之间需正交。这一理论推导揭示了线性表征假设(Linear Representation Hypothesis)**的必要性——即广泛观测到的神经表征线性结构实为组合泛化的必然结果。关键解决方案在于从理论上建立组合泛化与表征几何之间的约束关系,并通过实验验证现代视觉模型(如CLIP、SigLIP、DINO)中存在低秩、近正交的局部线性因子结构,且该结构程度与组合泛化能力显著相关。
链接: https://arxiv.org/abs/2602.24264
作者: Arnas Uselis,Andrea Dittadi,Seong Joon Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Compositional generalization, the ability to recognize familiar parts in novel contexts, is a defining property of intelligent systems. Although modern models are trained on massive datasets, they still cover only a tiny fraction of the combinatorial space of possible inputs, raising the question of what structure representations must have to support generalization to unseen combinations. We formalize three desiderata for compositional generalization under standard training (divisibility, transferability, stability) and show they impose necessary geometric constraints: representations must decompose linearly into per-concept components, and these components must be orthogonal across concepts. This provides theoretical grounding for the Linear Representation Hypothesis: the linear structure widely observed in neural representations is a necessary consequence of compositional generalization. We further derive dimension bounds linking the number of composable concepts to the embedding geometry. Empirically, we evaluate these predictions across modern vision models (CLIP, SigLIP, DINO) and find that representations exhibit partial linear factorization with low-rank, near-orthogonal per-concept factors, and that the degree of this structure correlates with compositional generalization on unseen combinations. As models continue to scale, these conditions predict the representational geometry they may converge to. Code is available at this https URL.
[CV-4] Histopathology Image Normalization via Latent Manifold Compaction
【速读】:该论文旨在解决组织病理学图像中因染色协议、扫描仪及采集流程等技术差异导致的批次效应(batch effects)问题,此类效应严重阻碍了计算病理模型在不同批次间的泛化能力,限制其在多临床场景下的可靠部署。解决方案的关键在于提出一种无监督表示学习框架——潜在流形压缩(Latent Manifold Compaction, LMC),通过显式压缩由染色诱导的潜在特征流形,从单一源数据集中学习与批次无关的嵌入表示(batch-invariant embeddings),从而实现对训练阶段未见的目标域数据的良好泛化能力。
链接: https://arxiv.org/abs/2602.24251
作者: Xiaolong Zhang,Jianwei Zhang,Selim Sevim,Emek Demir,Ece Eksi,Xubo Song
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:Batch effects arising from technical variations in histopathology staining protocols, scanners, and acquisition pipelines pose a persistent challenge for computational pathology, hindering cross-batch generalization and limiting reliable deployment of models across clinical sites. In this work, we introduce Latent Manifold Compaction (LMC), an unsupervised representation learning framework that performs image harmonization by learning batch-invariant embeddings from a single source dataset through explicit compaction of stain-induced latent manifolds. This allows LMC to generalize to target domain data unseen during training. Evaluated on three challenging public and in-house benchmarks, LMC substantially reduces batch-induced separations across multiple datasets and consistently outperforms state-of-the-art normalization methods in downstream cross-batch classification and detection tasks, enabling superior generalization.
[CV-5] Joint Geometric and Trajectory Consistency Learning for One-Step Real-World Super-Resolution
【速读】:该论文针对扩散模型在真实世界图像超分辨率(Real-ISR)任务中因迭代采样导致的高计算成本问题,以及现有蒸馏方法受限于教师模型能力边界和参数量过大的缺陷,提出了一种轻量级一致性模型训练范式GTASR(Geometric Trajectory Alignment Super-Resolution)。其关键在于:一是引入轨迹对齐(Trajectory Alignment, TA)策略,通过全路径投影修正切向量场以缓解一致性漂移;二是设计双参考结构校正(Dual-Reference Structural Rectification, DRSR)机制,强制约束生成轨迹的结构一致性,从而解决“几何解耦”现象,即像素级对齐但结构失真问题。该方案在保持极低延迟的同时显著提升了重建质量与结构保真度。
链接: https://arxiv.org/abs/2602.24240
作者: Chengyan Deng,Zhangquan Chen,Li Yu,Kai Zhang,Xue Zhou,Wang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion-based Real-World Image Super-Resolution (Real-ISR) achieves impressive perceptual quality but suffers from high computational costs due to iterative sampling. While recent distillation approaches leveraging large-scale Text-to-Image (T2I) priors have enabled one-step generation, they are typically hindered by prohibitive parameter counts and the inherent capability bounds imposed by teacher models. As a lightweight alternative, Consistency Models offer efficient inference but struggle with two critical limitations: the accumulation of consistency drift inherent to transitive training, and a phenomenon we term “Geometric Decoupling” - where the generative trajectory achieves pixel-wise alignment yet fails to preserve structural coherence. To address these challenges, we propose GTASR (Geometric Trajectory Alignment Super-Resolution), a simple yet effective consistency training paradigm for Real-ISR. Specifically, we introduce a Trajectory Alignment (TA) strategy to rectify the tangent vector field via full-path projection, and a Dual-Reference Structural Rectification (DRSR) mechanism to enforce strict structural constraints. Extensive experiments verify that GTASR delivers superior performance over representative baselines while maintaining minimal latency. The code and model will be released at this https URL.
[CV-6] Enhancing Spatial Understanding in Image Generation via Reward Modeling ALT2 CVPR2026
【速读】:该论文旨在解决当前文本到图像生成(text-to-image generation)模型在处理复杂空间关系时表现不足的问题,尤其是在编码精细空间结构方面存在困难,导致生成结果不理想且需多次采样才能获得满意效果。其解决方案的关键在于构建了一个包含超过80,000对偏好样本的SpatialReward-Dataset,并基于此数据集训练出一个名为SpatialScore的奖励模型(reward model),用于精准评估生成图像中空间关系的准确性。该奖励模型不仅性能超越主流商业模型,在空间理解任务上表现优异,还能有效支持在线强化学习(online reinforcement learning),从而显著提升图像生成模型在复杂空间布局上的理解和生成能力。
链接: https://arxiv.org/abs/2602.24233
作者: Zhenyu Tang,Chaoran Feng,Yufan Deng,Jie Wu,Xiaojie Li,Rui Wang,Yunpeng Chen,Daquan Zhou
机构: Peking University (北京大学); ByteDance Seed (字节跳动种子团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026. Github: this https URL Project website: this https URL
Abstract:Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.
[CV-7] MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy CVPR2026
【速读】:该论文旨在解决显微成像中多尺度结构分析的难题,即如何有效融合不同空间分辨率下的图像信息以提升模型对复杂组织结构的理解能力。传统视觉模型通常仅在单一分辨率下运行或从单视角提取多尺度特征,难以充分利用显微图像固有的多分辨率特性。其解决方案的关键在于提出MuViT架构,通过将所有图像块(patch)映射到统一的世界坐标系(world-coordinate system),并扩展旋转位置编码(rotary positional embeddings)至该坐标系,使注意力机制能够在单一编码器内整合宽视野上下文与高分辨率细节,从而实现真正意义上的多分辨率信息融合。
链接: https://arxiv.org/abs/2602.24222
作者: Albert Dominguez Mantes,Gioele La Manno,Martin Weigert
机构: Swiss Federal Institute of Technology (EPFL); Technische Universität Dresden (德累斯顿工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR 2026
Abstract:Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader tissue organization. Many analysis tasks require combining these scales, yet most vision models operate at a single resolution or derive multi-scale features from one view, limiting their ability to exploit the inherently multi-resolution nature of microscopy data. We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image. MuViT embeds all patches into a shared world-coordinate system and extends rotary positional embeddings to these coordinates, enabling attention to integrate wide-field context with high-resolution detail within a single encoder. Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines. Multi-resolution MAE pretraining further produces scale-consistent representations that enhance downstream tasks. These results demonstrate that explicit world-coordinate modelling provides a simple yet powerful mechanism for leveraging multi-resolution information in large-scale microscopy analysis.
[CV-8] SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching
【速读】:该论文旨在解决扩散模型(Diffusion Models)在视频生成任务中推理成本高昂的问题,尤其是由于需要大量顺序去噪步骤导致的计算开销。现有训练-free 加速方法中的缓存(caching)策略依赖启发式准则选择缓存或重用的时间步(timestep),且需大量调参,缺乏理论依据和自适应能力。解决方案的关键在于提出了一种基于敏感性感知的缓存框架——SenCache,其核心是通过分析模型输出对去噪输入(噪声潜变量与时间步)扰动的敏感性,将该敏感性作为缓存误差的预测指标,并据此动态地为每个样本自适应选择最优缓存时间步,从而在保持视觉质量的前提下显著提升计算效率。
链接: https://arxiv.org/abs/2602.24208
作者: Yasaman Haghighi,Alexandre Alahi
机构: École Polytechnique Fédérale de Lausanne (EPFL)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Diffusion models achieve state-of-the-art video generation quality, but their inference remains expensive due to the large number of sequential denoising steps. This has motivated a growing line of research on accelerating diffusion inference. Among training-free acceleration methods, caching reduces computation by reusing previously computed model outputs across timesteps. Existing caching methods rely on heuristic criteria to choose cache/reuse timesteps and require extensive tuning. We address this limitation with a principled sensitivity-aware caching framework. Specifically, we formalize the caching error through an analysis of the model output sensitivity to perturbations in the denoising inputs, i.e., the noisy latent and the timestep, and show that this sensitivity is a key predictor of caching error. Based on this analysis, we propose Sensitivity-Aware Caching (SenCache), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis. Our framework provides a theoretical basis for adaptive caching, explains why prior empirical heuristics can be partially effective, and extends them to a dynamic, sample-specific approach. Experiments on Wan 2.1, CogVideoX, and LTX-Video show that SenCache achieves better visual quality than existing caching methods under similar computational budgets.
[CV-9] A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification
【速读】:该论文旨在解决医疗图像分类模型在实际应用中安全性与可靠性不足的问题,尤其是现有审计方法依赖单一模态特征或基于元数据的子群分析时,在可解释性方面存在局限且难以发现隐匿的系统性失效。其解决方案的关键在于提出首个自动化审计框架,该框架将切片发现(slice discovery)方法扩展至多模态表示空间,专门用于医疗场景下的模型审计;实验表明,该方法能有效提升故障发现能力和解释生成质量,且多模态信息相较单模态输入更具全面性和有效性,尤其在资源受限场景下,非图像模态的补充输入也展现出显著潜力。
链接: https://arxiv.org/abs/2602.24183
作者: Yixuan Liu,Kanwal K. Bhatia,Ahmed E. Fetit
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Despite advances in machine learning-based medical image classifiers, the safety and reliability of these systems remain major concerns in practical settings. Existing auditing approaches mainly rely on unimodal features or metadata-based subgroup analyses, which are limited in interpretability and often fail to capture hidden systematic failures. To address these limitations, we introduce the first automated auditing framework that extends slice discovery methods to multimodal representations specifically for medical applications. Comprehensive experiments were conducted under common failure scenarios using the MIMIC-CXR-JPG dataset, demonstrating the framework’s strong capability in both failure discovery and explanation generation. Our results also show that multimodal information generally allows more comprehensive and effective auditing of classifiers, while unimodal variants beyond image-only inputs exhibit strong potential in scenarios where resources are constrained.
[CV-10] A Mixed Diet Makes DINO An Omnivorous Vision Encoder CVPR2026
【速读】:该论文旨在解决预训练视觉编码器(如DINOv2)在跨模态场景下特征表示对齐不佳的问题,即同一场景的不同模态输入(如RGB图像与深度图)所生成的特征嵌入之间相似度接近随机无关图像的水平。其解决方案的关键在于提出一种名为“Omnivorous Vision Encoder”的新框架,通过双目标训练策略实现:一是最大化同一场景不同模态间的特征对齐,二是引入蒸馏目标将学习到的表征锚定至一个完全冻结的教师模型(如DINOv2)输出,从而构建一个模态无关的特征空间,使编码器无论输入何种模态(RGB、深度、分割等)都能输出一致且强大的场景嵌入,兼顾跨模态理解能力与原基础模型的判别语义特性。
链接: https://arxiv.org/abs/2602.24181
作者: Rishabh Kabra,Maks Ovsjanikov,Drew A. Hudson,Ye Xia,Skanda Koppula,Andre Araujo,Joao Carreira,Niloy J. Mitra
机构: Google DeepMind; University College London
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026
Abstract:Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes “omnivorous” by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.
[CV-11] GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction
【速读】:该论文旨在解决从单张肖像图像中重建高保真、可动画化的4D头部虚拟形象(4D head avatars)这一计算机视觉中的基础挑战。现有方法主要依赖2D先验,难以保证3D几何的一致性。其解决方案的关键在于提出一种基于几何感知扩散模型(geometry-aware diffusion)的新框架,通过联合生成肖像图像与对应的表面法向量(surface normals),并引入无需姿态信息的表达编码器(pose-free expression encoder)来捕捉隐式的表情表征;这些合成数据与表达潜变量被整合进基于3D高斯的虚拟形象中,从而实现兼具精确几何结构和逼真渲染效果的高质量头部重建,并支持实时渲染。
链接: https://arxiv.org/abs/2602.24161
作者: Chao Xu,Xiaochen Zhao,Xiang Deng,Jingxiang Sun,Zhuo Su,Donglin Di,Yebin Liu
机构: Tsinghua University (清华大学); Bytedance (字节跳动); Li Auto (理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages
Abstract:Reconstructing photorealistic and animatable 4D head avatars from a single portrait image remains a fundamental challenge in computer vision. While diffusion models have enabled remarkable progress in image and video generation for avatar reconstruction, existing methods primarily rely on 2D priors and struggle to achieve consistent 3D geometry. We propose a novel framework that leverages geometry-aware diffusion to learn strong geometry priors for high-fidelity head avatar reconstruction. Our approach jointly synthesizes portrait images and corresponding surface normals, while a pose-free expression encoder captures implicit expression representations. Both synthesized images and expression latents are incorporated into 3D Gaussian-based avatars, enabling photorealistic rendering with accurate geometry. Extensive experiments demonstrate that our method substantially outperforms state-of-the-art approaches in visual quality, expression fidelity, and cross-identity generalization, while supporting real-time rendering.
[CV-12] Manifold-Preserving Superpixel Hierarchies and Embeddings for the Exploration of High-Dimensional Images
【速读】:该论文旨在解决高维图像(high-dimensional images)在探索过程中,传统分层降维方法因仅基于属性信息构建层次结构而忽略像素空间布局的问题,导致图像空间中的兴趣区域与属性抽象之间缺乏一致性。其解决方案的关键在于提出一种图像引导的超像素层次结构(superpixel hierarchy),在构建过程中同时考虑高维属性流形(attribute manifold)和图像的空间拓扑关系,从而实现图像空间与属性空间的协同一致探索。
链接: https://arxiv.org/abs/2602.24160
作者: Alexander Vieth,Boudewijn Lelieveldt,Elmar Eisemann,Anna Vilanova,Thomas Höllt
机构: Leiden University Medical Center (莱顿大学医学中心); Delft University of Technology (代尔夫特理工大学); Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages main paper, 8 pages supplemental material
Abstract:High-dimensional images, or images with a high-dimensional attribute vector per pixel, are commonly explored with coordinated views of a low-dimensional embedding of the attribute space and a conventional image representation. Nowadays, such images can easily contain several million pixels. For such large datasets, hierarchical embedding techniques are better suited to represent the high-dimensional attribute space than flat dimensionality reduction methods. However, available hierarchical dimensionality reduction methods construct the hierarchy purely based on the attribute information and ignore the spatial layout of pixels in the images. This impedes the exploration of regions of interest in the image space, since there is no congruence between a region of interest in image space and the associated attribute abstractions in the hierarchy. In this paper, we present a superpixel hierarchy for high-dimensional images that takes the high-dimensional attribute manifold into account during construction. Through this, our method enables consistent exploration of high-dimensional images in both image and attribute space. We show the effectiveness of this new image-guided hierarchy in the context of embedding exploration by comparing it with classical hierarchical embedding-based image exploration in two use cases.
[CV-13] RAViT: Resolution-Adaptive Vision Transformer
【速读】:该论文旨在解决视觉Transformer(Vision Transformer)在图像分类任务中计算成本过高这一问题,相较于卷积神经网络(Convolutional Neural Networks, CNNs)而言,其推理复杂度显著增加。解决方案的关键在于提出一种多分支架构(multi-branch network)——RAViT,该架构通过并行处理同一图像的不同分辨率副本,在保持整体准确率的同时降低计算开销;同时引入早期退出机制(early exit mechanism),使模型能够在运行时动态选择精度与计算成本之间的平衡点,从而实现更高效的推理过程。例如,在双分支结构中,低分辨率图像首先由一个轻量级Transformer进行预测,并将结果与原始尺寸图像结合,用于次级Transformer的最终预测,显著减少浮点运算次数(FLOPs),实验表明在CIFAR-10、Tiny ImageNet和ImageNet数据集上可达到与传统Vision Transformer相当的精度,仅需约70%的FLOPs。
链接: https://arxiv.org/abs/2602.24159
作者: Martial Guidez,Stefan Duffner,Christophe Garcia
机构: INSA Lyon (法国国立科学与技术学院); CNRS (法国国家科学研究中心); Université Claude Bernard Lyon 1 (克莱蒙-奥弗涅大学); LIRIS, UMR5205 (信息、机器人与智能系统实验室,联合研究单位5205)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications. However, their computational cost is very high compared to alternative approaches such as Convolutional Neural Networks. To address this problem, we propose a novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy. Furthermore, our framework includes an early exit mechanism that makes our model adaptive and allows to choose the appropriate trade-off between accuracy and computational cost at run-time. For example in a two-branch architecture, the original image is first resized to reduce its resolution, then a prediction is performed on it using a first transformer and the resulting prediction is reused together with the original-size image to perform a final prediction on a second transformer with less computation than a classical Vision transformer architecture. The early-exit process allows the model to make a final prediction at intermediate branches, saving even more computation. We evaluated our approach on CIFAR-10, Tiny ImageNet, and ImageNet. We obtained an equivalent accuracy to the classical Vision transformer model with only around 70% of FLOPs.
[CV-14] HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation CVPR2026
【速读】:该论文旨在解决从单张输入图像生成完整360°环绕视频时存在的视角不一致和身份失真问题(即现有基于图像的扩散模型在多视角合成中难以保持几何一致性与人物身份的一致性)。其解决方案的关键在于提出HumanOrbit——一个专门用于多视角人体图像生成的视频扩散模型(video diffusion model),该模型能够合成连续的相机旋转视频,确保新视角之间几何一致并保留人物外观与身份特征;进一步地,作者设计了一个重建流水线,利用生成的多视角帧恢复出具有更高完整性和保真度的纹理化三维网格模型。
链接: https://arxiv.org/abs/2602.24148
作者: Keito Suzuki,Kunyao Chen,Lei Wang,Bang Du,Runfa Blark Li,Peng Liu,Ning Bi,Truong Nguyen
机构: University of California, San Diego (加州大学圣地亚哥分校); Qualcomm (高通)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Findings
Abstract:We present a method for generating a full 360° orbit video around a person from a single input image. Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity. In contrast, recent video diffusion models have demonstrated their ability in generating photorealistic results that align well with the given prompts. Inspired by these results, we propose HumanOrbit, a video diffusion model for multi-view human image generation. Our approach enables the model to synthesize continuous camera rotations around the subject, producing geometrically consistent novel views while preserving the appearance and identity of the person. Using the generated multi-view frames, we further propose a reconstruction pipeline that recovers a textured mesh of the subject. Experimental results validate the effectiveness of HumanOrbit for multi-view image generation and that the reconstructed 3D models exhibit superior completeness and fidelity compared to those from state-of-the-art baselines.
[CV-15] Fixed Anchors Are Not Enough: Dynamic Retrieval and Persistent Homology for Dataset Distillation CVPR2026
【速读】:该论文旨在解决解耦式数据蒸馏(Decoupled Dataset Distillation, DD)中存在的拟合复杂度差距(fit-complexity gap)和拉锚效应(pull-to-anchor effect)问题,这些问题导致合成图像的类内多样性不足并损害模型泛化性能。其解决方案的关键在于提出一种名为RETA(Retrieval and Topology Alignment)的框架:首先通过动态检索连接(Dynamic Retrieval Connection, DRC)在教师特征空间中最小化拟合复杂度得分,选择最优真实图像块进行残差注入,从而在提升特征拟合精度的同时控制引入的复杂度;其次利用持久同调(Persistent Homology)对真实与合成样本集的拓扑结构进行一致性约束,通过构建互为k近邻的特征图并计算持久图像来惩罚组件与环路的拓扑差异,有效缓解拉锚效应,提升类内多样性与泛化能力。
链接: https://arxiv.org/abs/2602.24144
作者: Muquan Li,Hang Gou,Yingyi Ma,Rongzheng Wang,Ke Qin,Tao He
机构: UESTC(电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Decoupled dataset distillation (DD) compresses large corpora into a few synthetic images by matching a frozen teacher’s statistics. However, current residual-matching pipelines rely on static real patches, creating a fit-complexity gap and a pull-to-anchor effect that reduce intra-class diversity and hurt generalization. To address these issues, we introduce RETA – a Retrieval and Topology Alignment framework for decoupled DD. First, Dynamic Retrieval Connection (DRC) selects a real patch from a prebuilt pool by minimizing a fit-complexity score in teacher feature space; the chosen patch is injected via a residual connection to tighten feature fit while controlling injected complexity. Second, Persistent Topology Alignment (PTA) regularizes synthesis with persistent homology: we build a mutual k-NN feature graph, compute persistence images of components and loops, and penalize topology discrepancies between real and synthetic sets, mitigating pull-to-anchor effect. Across CIFAR-100, Tiny-ImageNet, ImageNet-1K, and multiple ImageNet subsets, RETA consistently outperforms various baselines under comparable time and memory, especially reaching 64.3% top-1 accuracy on ImageNet-1K with ResNet-18 at 50 images per class, +3.1% over the best prior.
[CV-16] Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics
【速读】:该论文旨在解决手术阶段与步骤识别任务中依赖大规模预训练所带来的计算和数据收集成本过高的问题。现有方法通常需要在数千个标注的手术视频上进行预训练,再迁移至特定术式,但这种策略效率低下且资源消耗大。解决方案的关键在于提出一种无监督的多模态最优传输方法——文本增强的动作分割最优传输(Text-Augmented Action Segmentation Optimal Transport, TASOT),其核心创新是将视频帧的视觉特征与直接从视频中生成的文本语义信息联合建模为一个统一的最优传输问题,通过加权融合视觉相似性和文本语义代价,并引入时序一致性的非平衡Gromov-Wasserstein正则化,实现无需手术特异性预训练或外部大规模监督即可精确对齐视频帧与手术动作。实验表明,TASOT在多个基准数据集上显著优于现有零样本方法,验证了利用标准视觉与文本表示即可实现精细手术理解的可行性。
链接: https://arxiv.org/abs/2602.24138
作者: Omar Mohamed,Edoardo Fazzari,Ayah Al-Naji,Hamdan Alhadhrami,Khalfan Hableel,Saif Alkindi,Cesare Stefanini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recognizing surgical phases and steps from video is a fundamental problem in computer-assisted interventions. Recent approaches increasingly rely on large-scale pre-training on thousands of labeled surgical videos, followed by zero-shot transfer to specific procedures. While effective, this strategy incurs substantial computational and data collection costs. In this work, we question whether such heavy pre-training is truly necessary. We propose Text-Augmented Action Segmentation Optimal Transport (TASOT), an unsupervised method for surgical phase and step recognition that extends Action Segmentation Optimal Transport (ASOT) by incorporating textual information generated directly from the videos. TASOT formulates temporal action segmentation as a multimodal optimal transport problem, where the matching cost is defined as a weighted combination of visual and text-based costs. The visual term captures frame-level appearance similarity, while the text term provides complementary semantic cues, and both are jointly regularized through a temporally consistent unbalanced Gromov-Wasserstein formulation. This design enables effective alignment between video frames and surgical actions without surgical-specific pretraining or external web-scale supervision. We evaluate TASOT on multiple benchmark surgical datasets and observe consistent and substantial improvements over existing zero-shot methods, including StrasBypass70 (+23.7), BernBypass70 (+4.5), Cholec80 (+16.5), and AutoLaparo (+19.6). These results demonstrate that fine-grained surgical understanding can be achieved by exploiting information already present in standard visual and textual representations, without resorting to increasingly complex pre-training pipelines. The code will be available at this https URL.
[CV-17] Prune Wisely Reconstruct Sharply: Compact 3D Gaussian Splatting via Adaptive Pruning and Difference-of-Gaussian Primitives CVPR2026
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在复杂或大规模场景中因高数量的高斯基元(primitives)导致的冗余表示与资源消耗问题,从而限制其可扩展性。解决方案的关键在于提出一种重建感知的自适应剪枝策略,该策略根据重建质量动态决定剪枝时机和精修间隔,在降低模型规模的同时提升渲染质量;同时引入一种3D Difference-of-Gaussians(3D DoG)基元,通过单个基元联合建模正负密度,显著增强高斯基元在紧凑配置下的表达能力。
链接: https://arxiv.org/abs/2602.24136
作者: Haoran Wang,Guoxi Huang,Fan Zhang,David Bull,Nantheera Anantrasirichai
机构: University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026
Abstract:Recent significant advances in 3D scene representation have been driven by 3D Gaussian Splatting (3DGS), which has enabled real-time rendering with photorealistic quality. 3DGS often requires a large number of primitives to achieve high fidelity, leading to redundant representations and high resource consumption, thereby limiting its scalability for complex or large-scale scenes. Consequently, effective pruning strategies and more expressive primitives that can reduce redundancy while preserving visual quality are crucial for practical deployment. We propose an efficient, integrated reconstruction-aware pruning strategy that adaptively determines pruning timing and refining intervals based on reconstruction quality, thus reducing model size while enhancing rendering quality. Moreover, we introduce a 3D Difference-of-Gaussians primitive that jointly models both positive and negative densities in a single primitive, improving the expressiveness of Gaussians under compact configurations. Our method significantly improves model compactness, achieving up to 90% reduction in Gaussian-count while delivering visual quality that is similar to, or in some cases better than, that produced by state-of-the-art methods. Code will be made publicly available.
[CV-18] FocusTrack: One-Stage Focus-and-Suppress Framework for 3D Point Cloud Object Tracking ACM-MM2025
【速读】:该论文旨在解决3D点云目标跟踪中现有两阶段运动基方法的两个核心问题:一是由于先验前景分割与运动估计解耦导致的误差累积;二是序列化处理带来的计算瓶颈。其解决方案的关键在于提出一种新颖的一阶段范式框架FocusTrack,通过两项核心技术实现运动与语义的联合建模:一是基于时差孪生编码器的帧间运动建模(Inter-frame Motion Modeling, IMM),用于捕捉相邻帧间的全局运动模式;二是聚焦-抑制注意力机制(Focus-and-Suppress Attention),利用运动显著特征门控增强前景语义,并基于IMM提供的时序感知运动上下文抑制背景噪声,从而在无需显式分割的情况下实现端到端训练与高效推理,最终在KITTI、nuScenes和Waymo等主流3D跟踪基准上达到新的SOTA性能并保持105 FPS的高速运行。
链接: https://arxiv.org/abs/2602.24133
作者: Sifan Zhou,Jiahao Nie,Ziyu Zhao,Yichao Cao,Xiaobo Lu
机构: Southeast University (东南大学); Zhejiang University of Finance and Economics (浙江财经大学); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Acceptted in ACM MM 2025
Abstract:In 3D point cloud object tracking, the motion-centric methods have emerged as a promising avenue due to its superior performance in modeling inter-frame motion. However, existing two-stage motion-based approaches suffer from fundamental limitations: (1) error accumulation due to decoupled optimization caused by explicit foreground segmentation prior to motion estimation, and (2) computational bottlenecks from sequential processing. To address these challenges, we propose FocusTrack, a novel one-stage paradigms tracking framework that unifies motion-semantics co-modeling through two core innovations: Inter-frame Motion Modeling (IMM) and Focus-and-Suppress Attention. The IMM module employs a temp-oral-difference siamese encoder to capture global motion patterns between adjacent frames. The Focus-and-Suppress attention that enhance the foreground semantics via motion-salient feature gating and suppress the background noise based on the temporal-aware motion context from IMM without explicit segmentation. Based on above two designs, FocusTrack enables end-to-end training with compact one-stage pipeline. Extensive experiments on prominent 3D tracking benchmarks, such as KITTI, nuScenes, and Waymo, demonstrate that the FocusTrack achieves new SOTA performance while running at a high speed with 105 FPS.
[CV-19] DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer DATE
【速读】:该论文旨在解决基于神经渲染(Neural Reconstruction)的仿真系统在生成新视角时存在视觉伪影、动态物体插入不真实以及跨场景光照不一致等问题,这些问题限制了自动驾驶等应用中仿真环境的真实性和可用性。其解决方案的关键在于提出DiffusionHarmonizer,一个在线生成增强框架,核心是一个从预训练多步图像扩散模型转换而来的单步时序条件增强器(temporally-conditioned enhancer),能够在单张GPU上实时运行;同时,通过定制的数据整理流水线构建合成-真实配对数据集,重点强化外观一致性、伪影修正和光照真实性,从而显著提升仿真系统的保真度与实用性。
链接: https://arxiv.org/abs/2602.24096
作者: Yuxuan Zhang,Katarína Tóthová,Zian Wang,Kangxue Yin,Haithem Turki,Riccardo de Lutio,Yen-Yu Chang,Or Litany,Sanja Fidler,Zan Gojcic
机构: NVIDIA(英伟达); University of Toronto (多伦多大学); Cornell University (康奈尔大学); Technion (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: For more details and updates, please visit our project website: this https URL
Abstract:Simulation is essential to the development and evaluation of autonomous robots such as self-driving vehicles. Neural reconstruction is emerging as a promising solution as it enables simulating a wide variety of scenarios from real-world data alone in an automated and scalable way. However, while methods such as NeRF and 3D Gaussian Splatting can produce visually compelling results, they often exhibit artifacts particularly when rendering novel views, and fail to realistically integrate inserted dynamic objects, especially when they were captured from different scenes. To overcome these limitations, we introduce DiffusionHarmonizer, an online generative enhancement framework that transforms renderings from such imperfect scenes into temporally consistent outputs while improving their realism. At its core is a single-step temporally-conditioned enhancer that is converted from a pretrained multi-step image diffusion model, capable of running in online simulators on a single GPU. The key to training it effectively is a custom data curation pipeline that constructs synthetic-real pairs emphasizing appearance harmonization, artifact correction, and lighting realism. The result is a scalable system that significantly elevates simulation fidelity in both research and production environments.
[CV-20] FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting CVPR2026
【速读】:该论文旨在解决当前基于边界表示(B-rep)的学习方法在面对任意旋转时性能显著下降的问题。现有方法依赖于绝对坐标和法向量来编码全局上下文,导致对旋转高度敏感,实验表明模型在标准对齐数据集上可达95%以上准确率,但在任意SO(3)旋转下准确率可骤降至10%。解决方案的关键在于提出FoV-Net,其通过引入局部参考系(Local Reference Frame, LRF)UV网格捕捉每个面的局部表面几何特征,并利用视场(Field-of-View, FoV)网格通过射线投射记录邻近面的交点信息,从而建模全局结构上下文;同时采用轻量级CNN提取面级特征,并通过图注意力网络在B-rep图结构上传播这些特征,实现旋转不变性的学习,显著提升鲁棒性并减少训练数据需求。
链接: https://arxiv.org/abs/2602.24084
作者: Matteo Ballegeer,Dries F. Benoit
机构: Ghent University (根特大学); CVAMO, FlandersMake@UGent (CVAMO,弗拉芒制造@UGent)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Manuscript accepted at CVPR 2026
Abstract:Learning directly from boundary representations (B-reps) has significantly advanced 3D CAD analysis. However, state-of-the-art B-rep learning methods rely on absolute coordinates and normals to encode global context, making them highly sensitive to rotations. Our experiments reveal that models achieving over 95% accuracy on aligned benchmarks can collapse to as low as 10% under arbitrary \mathbfSO(3) rotations. To address this, we introduce FoV-Net, the first B-rep learning framework that captures both local surface geometry and global structural context in a rotation-invariant manner. Each face is represented by a Local Reference Frame (LRF) UV-grid that encodes its local surface geometry, and by Field-of-View (FoV) grids that capture the surrounding 3D context by casting rays and recording intersections with neighboring faces. Lightweight CNNs extract per-face features, which are propagated over the B-rep graph using a graph attention network. FoV-Net achieves state-of-the-art performance on B-rep classification and segmentation benchmarks, demonstrating robustness to arbitrary rotations while also requiring less training data to achieve strong results.
[CV-21] EvalMVX: A Unified Benchmarking for Neural 3D Reconstruction under Diverse Multiview Setups
【速读】:该论文旨在解决当前多视角三维重建方法(MVX)中缺乏统一、真实世界数据集进行定量评估的问题,尤其是针对基于RGB的多视角立体视觉(MVS)、多视角偏振形状恢复(MVSfP)和多视角光度立体视觉(MVPS)三类技术在复杂几何与反射特性下的性能对比缺失问题。其解决方案的关键在于提出EvalMVX数据集,该数据集包含25个物体、20个视角和17种光照条件(包括OLAT和自然光照),共8500张图像,并为每个物体提供对齐的真值3D网格,从而支持对13种近期MVX方法的系统性定量比较,明确各类方法在不同几何细节和材质反射类型下的适用范围与局限性。
链接: https://arxiv.org/abs/2602.24065
作者: Zaiyan Yang,Jieji Ren,Xiangyi Wang,zonglin li,Xu Cao,Heng Guo,Zhanyu Ma,Boxin Shi
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Shanghai Jiao Tong University (上海交通大学); Independent Researcher; Xiong’an Aerospace Information Research Institute (雄安航空航天信息研究院); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in neural surface reconstruction have significantly enhanced 3D reconstruction. However, current real world datasets mainly focus on benchmarking multiview stereo (MVS) based on RGB inputs. Multiview photometric stereo (MVPS) and multiview shape from polarization (MVSfP), though indispensable on high-fidelity surface reconstruction and sparse inputs, have not been quantitatively assessed together with MVS. To determine the working range of different MVX (MVS, MVSfP, and MVPS) techniques, we propose EvalMVX, a real-world dataset containing 25 objects, each captured with a polarized camera under 20 varying views and 17 light conditions including OLAT and natural illumination, leading to 8,500 images. Each object includes aligned ground-truth 3D mesh, facilitating quantitative benchmarking of MVX methods simultaneously. Based on our EvalMVX, we evaluate 13 MVX methods published in recent years, record the best-performing methods, and identify open problems under diverse geometric details and reflectance types. We hope EvalMVX and the benchmarking results can inspire future research on multiview 3D reconstruction.
[CV-22] Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization CVPR2026
【速读】:该论文旨在解决现有后训练量化(Post-Training Quantization, PTQ)方法在视觉语言模型(Vision-Language Models, VLMs)压缩中因忽略重要通道分布差异而导致的量化精度不足问题。当前方法多依赖静态识别和全局补偿敏感或异常通道,未能考虑不同模态及token间重要通道分布与出现频率的动态变化。解决方案的关键在于提出Quant Experts (QE),一种基于专家混合(mixture-of-experts)的token感知自适应误差补偿机制:将重要通道划分为token无关和token相关两类——前者由共享专家通过低秩适配器补偿全局量化误差,后者则通过路由专家(含多个低秩适配器)针对特定token补偿局部量化误差,从而实现更精准的量化性能提升。
链接: https://arxiv.org/abs/2602.24059
作者: Chenwei Jia,Baoting Li,Xuchong Zhang,Mingzhuo Wei,Bochen Lin,Hongbin Sun
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures, including appendix, Accepted at CVPR 2026
Abstract:Post-Training Quantization (PTQ) has emerged as an effective technique for alleviating the substantial computational and memory overheads of Vision-Language Models (VLMs) by compressing both weights and activations without retraining the full model. Existing PTQ methods primarily rely on static identification and global compensation of sensitive or outlier channels, yet they often overlook the distributional differences of these important channels across inputs, leading to unsatisfactory quantization. In this work, we observe that the distributions and occurrence frequencies of important channels vary significantly both across modalities and among tokens, even within the same modality. Accordingly, we propose \textbfQuant Experts (QE), a token-aware adaptive error compensation with mixture-of-experts for VLMs quantization. QE divides the important channels into token-independent and token-dependent groups. For the former, a shared expert is designed for most tokens to compensate for global quantization error using a low-rank adapter. For the latter, routed experts including multiple routed low-rank adapters are elaborated to compensate for local quantization error related to specific tokens. Extensive experiments demonstrate that QE consistently enhances task accuracy across various quantization settings and model scales, ranging from 2B to 70B parameters, while maintaining performance comparable to full-precision models.
[CV-23] Spatio-Temporal Garment Reconstruction Using Diffusion Mapping via Pattern Coordinates
【速读】:该论文旨在解决从单目图像和视频中高保真重建三维穿衣人体的难题,尤其针对宽松衣物的几何结构重建这一开放挑战。其核心解决方案在于提出一个统一框架,结合隐式缝合图(Implicit Sewing Patterns, ISP)与生成扩散模型,学习二维UV空间中的表达性服装形状先验;进而设计映射模型建立图像像素、UV坐标与三维几何之间的对应关系,实现单图下精确且细节丰富的服装重建;同时通过时空扩散机制与测试时引导策略扩展至动态重建,并引入基于投影的解析约束以保持可见区域的图像对齐几何并协同完成遮挡区域的时序一致性重建。
链接: https://arxiv.org/abs/2602.24043
作者: Yingxuan You,Ren Li,Corentin Dumery,Cong Cao,Hao Li,Pascal Fua
机构: CVLab, École Polytechnique Fédérale de Lausanne (EPFL) (洛桑联邦理工学院); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Southern University of Science and Technology (南方科技大学); Pinscreen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2504.08353
Abstract:Reconstructing 3D clothed humans from monocular images and videos is a fundamental problem with applications in virtual try-on, avatar creation, and mixed reality. Despite significant progress in human body recovery, accurately reconstructing garment geometry, particularly for loose-fitting clothing, remains an open challenge. We propose a unified framework for high-fidelity 3D garment reconstruction from both single images and video sequences. Our approach combines Implicit Sewing Patterns (ISP) with a generative diffusion model to learn expressive garment shape priors in 2D UV space. Leveraging these priors, we introduce a mapping model that establishes correspondences between image pixels, UV pattern coordinates, and 3D geometry, enabling accurate and detailed garment reconstruction from single images. We further extend this formulation to dynamic reconstruction by introducing a spatio-temporal diffusion scheme with test-time guidance to enforce long-range temporal consistency. We also develop analytic projection-based constraints that preserve image-aligned geometry in visible regions while enforcing coherent completion in occluded areas over time. Although trained exclusively on synthetically simulated cloth data, our method generalizes well to real-world imagery and consistently outperforms existing approaches on both tight- and loose-fitting garments. The reconstructed garments preserve fine geometric detail while exhibiting realistic dynamic motion, supporting downstream applications such as texture editing, garment retargeting, and animation.
[CV-24] Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation ICLR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉-语言推理任务中易产生幻觉(hallucination)的问题,即生成内容与视觉证据不符的现象。现有方法或需昂贵的训练阶段监督,或在推理时引入额外延迟;而近期视觉增强方法虽通过强化解码过程中的视觉token来缓解问题,但常无差别地注入全部token,导致背景区域干扰和关键线索被分散注意力。本文提出无需训练的自适应视觉强化(Adaptive Visual Reinforcement, AIR)框架,其核心在于两个组件:基于原型的token精简(Prototype-based token reduction)用于压缩冗余视觉token,形成紧凑子集;基于最优传输(Optimal Transport, OT)引导的patch强化(OT-guided patch reinforcement)则量化隐藏状态与patch嵌入之间的对齐度,选择最一致的patch注入前馈层。AIR通过增强模型对显著视觉信息的依赖性,有效抑制幻觉,同时保持模型通用能力,为构建可靠MLLMs提供了高效解决方案。
链接: https://arxiv.org/abs/2602.24041
作者: Xingyu Zhu,Kesen Zhao,Liang Yi,Shuo Wang,Zhicai Wang,Beier Zhu,Hanwang Zhang
机构: University of Science and Technology of China (中国科学技术大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2026
Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language reasoning, yet they remain vulnerable to hallucination, where generated content deviates from visual evidence. Existing mitigation strategies either require costly supervision during training or introduce additional latency at inference time. Recent vision enhancement methods attempt to address this issue by reinforcing visual tokens during decoding, but they typically inject all tokens indiscriminately, which causes interference from background regions and distracts the model from critical cues. To overcome this challenge, we propose Adaptive Visual Reinforcement (AIR), a training-free framework for MLLMs. AIR consists of two components. Prototype-based token reduction condenses the large pool of visual tokens into a compact subset to suppress redundancy. OT-guided patch reinforcement quantifies the alignment between hidden states and patch embeddings to selectively integrate the most consistent patches into feed-forward layers. As a result, AIR enhances the model’s reliance on salient visual information and effectively mitigates hallucination. Extensive experiments across representative MLLMs demonstrate that AIR substantially reduces hallucination while preserving general capabilities, establishing it as an effective solution for building reliable MLLMs.
[CV-25] GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models ICLR2026
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在视觉-语言推理任务中面临的安全部署问题,特别是现有输入侧防御方法在复杂场景下检测不准确以及生成过程中安全信号不稳定的问题。解决方案的关键在于提出一种无需训练的防御框架GuardAlign,其核心创新包括:(1) 基于最优传输(Optimal Transport, OT)的安全检测机制,通过度量图像块与不安全语义之间的分布距离实现对恶意区域的精准识别,且无额外计算开销;(2) 跨模态注意力校准策略,通过自适应重分配各层注意力权重,强化安全前缀的影响,确保安全信号在生成过程中持续激活,从而显著提升安全性并保持模型性能。
链接: https://arxiv.org/abs/2602.24027
作者: Xingyu Zhu,Beier Zhu,Junfeng Fang,Shuo Wang,Yin Zhang,Xiang Wang,Xiangnan He
机构: University of Science and Technology of China (中国科学技术大学); Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ICLR 2026
Abstract:Large vision-language models (LVLMs) have achieved remarkable progress in vision-language reasoning tasks, yet ensuring their safety remains a critical challenge. Recent input-side defenses detect unsafe images with CLIP and prepend safety prefixes to prompts, but they still suffer from inaccurate detection in complex scenes and unstable safety signals during decoding. To address these issues, we propose GuardAlign, a training-free defense framework that integrates two strategies. First, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost. Second, cross-modal attentive calibration strengthens the influence of safety prefixes by adaptively reallocating attention across layers, ensuring that safety signals remain consistently activated throughout generation. Extensive evaluations on six representative MLLMs demonstrate that GuardAlign reduces unsafe response rates by up to 39% on SPA-VL, while preserving utility, achieving an improvement on VQAv2 from 78.51% to 79.21%.
[CV-26] Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLM s for Video Anomaly Detection ICLR2026
【速读】:该论文旨在解决当前基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的视频异常检测(Video Anomaly Detection, VAD)方法在无微调(tuning-free)场景下性能受限的问题,具体表现为模型直接继承预训练偏置、难以适应特定视频上下文,从而在处理细微或模糊异常时表现不佳。解决方案的关键在于提出一种名为SteerVAD的干预框架,其核心创新是将被动读取模型内部表示转变为主动引导与修正:首先通过梯度无关的表征可分性分析(Gradient-Free Representational Separability Analysis, RSA)识别出对异常检测最具判别力的注意力头作为潜在异常专家(Latent Anomaly Experts, LAEs);随后设计分层元控制器(Hierarchical Meta-Controller, HMC),结合全局上下文信息和LAE输出生成动态校正信号,并对LAE表示流形实施定向各向异性缩放,增强异常相关维度的同时抑制固有偏置,实现精准的异常感知与鲁棒性提升。
链接: https://arxiv.org/abs/2602.24021
作者: Zhaolin Cai,Fan Li,Huiyu Duan,Lijun He,Guangtao Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026
Abstract:Video anomaly detection (VAD) aims to identify abnormal events in videos. Traditional VAD methods generally suffer from the high costs of labeled data and full training, thus some recent works have explored leveraging frozen multi-modal large language models (MLLMs) in a tuning-free manner to perform VAD. However, their performance is limited as they directly inherit pre-training biases and cannot adapt internal representations to specific video contexts, leading to difficulties in handling subtle or ambiguous anomalies. To address these limitations, we propose a novel intervention framework, termed SteerVAD, which advances MLLM-based VAD by shifting from passively reading to actively steering and rectifying internal representations. Our approach first leverages the gradient-free representational separability analysis (RSA) to identify top attention heads as latent anomaly experts (LAEs) which are most discriminative for VAD. Then a hierarchical meta-controller (HMC) generates dynamic rectification signals by jointly conditioning on global context and these LAE outputs. The signals execute targeted, anisotropic scaling directly upon the LAE representation manifolds, amplifying anomaly-relevant dimensions while suppressing inherent biases. Extensive experiments on mainstream benchmarks demonstrate our method achieves state-of-the-art performance among tuning-free approaches requiring only 1% of training data, establishing it as a powerful new direction for video anomaly detection. The code will be released upon the publication.
[CV-27] SR3R: Rethinking Super-Resolution 3D Reconstruction With Feed-Forward Gaussian Splatting CVPR2026
【速读】:该论文旨在解决3D超分辨率(3DSR)中因依赖密集低分辨率(LR)输入和逐场景优化而导致的高频细节重建 fidelity 低下、跨场景泛化能力弱以及实时性差的问题。现有方法受限于从预训练2D超分辨率(2DSR)模型继承的高频率先验,难以有效建模3D特定的几何与外观高频信息。其解决方案的关键在于将3DSR重构为一种直接从前稀疏LR视图到高分辨率(HR)3D高斯泼溅(3DGS)表示的前向映射,从而让模型能够从大规模多场景数据中自主学习3D特异性的高频几何与外观特征。具体而言,提出SR3R框架,通过学习的映射网络直接预测HR 3DGS表示,并引入高斯偏移学习(Gaussian offset learning)与特征精炼机制以提升重建稳定性和高频细节锐度,实现无需微调即可在未见场景上达到优于SOTA逐场景优化方法的性能。
链接: https://arxiv.org/abs/2602.24020
作者: Xiang Feng,Xiangbo Wang,Tieshi Zhong,Chengkai Wang,Yiting Zhao,Tianxiang Xu,Zhenzhong Kuang,Feiwei Qin,Xuefei Yin,Yanming Zhu
机构: Hangzhou Dianzi University (杭州电子科技大学); ShanghaiTech University (上海科技大学); Griffith University (格里菲斯大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:3D super-resolution (3DSR) aims to reconstruct high-resolution (HR) 3D scenes from low-resolution (LR) multi-view images. Existing methods rely on dense LR inputs and per-scene optimization, which restricts the high-frequency priors for constructing HR 3D Gaussian Splatting (3DGS) to those inherited from pretrained 2D super-resolution (2DSR) models. This severely limits reconstruction fidelity, cross-scene generalization, and real-time usability. We propose to reformulate 3DSR as a direct feed-forward mapping from sparse LR views to HR 3DGS representations, enabling the model to autonomously learn 3D-specific high-frequency geometry and appearance from large-scale, multi-scene data. This fundamentally changes how 3DSR acquires high-frequency knowledge and enables robust generalization to unseen scenes. Specifically, we introduce SR3R, a feed-forward framework that directly predicts HR 3DGS representations from sparse LR views via the learned mapping network. To further enhance reconstruction fidelity, we introduce Gaussian offset learning and feature refinement, which stabilize reconstruction and sharpen high-frequency details. SR3R is plug-and-play and can be paired with any feed-forward 3DGS reconstruction backbone: the backbone provides an LR 3DGS scaffold, and SR3R upscales it to an HR 3DGS. Extensive experiments across three 3D benchmarks demonstrate that SR3R surpasses state-of-the-art (SOTA) 3DSR methods and achieves strong zero-shot generalization, even outperforming SOTA per-scene optimization methods on unseen scenes.
[CV-28] Interpretable Debiasing of Vision-Language Models for Social Fairness CVPR2026
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中存在的社会偏见问题,尤其是其黑箱推理过程可能导致的隐性偏见。现有去偏方法主要依赖后验学习或测试时算法来缓解表层偏见信号,但忽视了模型内部动态机制。解决方案的关键在于提出一种可解释且与模型无关的去偏框架 DeBiasLens,通过在多模态编码器上应用稀疏自编码器(Sparse Autoencoders, SAEs)来定位与社会属性相关的神经元,无需标注社会属性即可发现对特定人口统计特征高度敏感的神经元,包括代表性不足群体。随后,通过选择性地抑制与偏见强关联的社会属性神经元,实现对VLMs社会偏见的有效缓解,同时保持其语义知识完整性。
链接: https://arxiv.org/abs/2602.24014
作者: Na Min An,Yoonna Jang,Yusuke Hirota,Ryo Hachiuma,Isabelle Augenstein,Hyunjung Shim
机构: KAIST AI; University of Copenhagen; NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages, 30 figures, 13 Tables Accepted to CVPR 2026
Abstract:The rapid advancement of Vision-Language models (VLMs) has raised growing concerns that their black-box reasoning processes could lead to unintended forms of social bias. Current debiasing approaches focus on mitigating surface-level bias signals through post-hoc learning or test-time algorithms, while leaving the internal dynamics of the model largely unexplored. In this work, we introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in VLMs through sparse autoencoders (SAEs) applied to multimodal encoders. Building upon the disentanglement ability of SAEs, we train them on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics, including those that are underrepresented. By selectively deactivating the social neurons most strongly tied to bias for each group, we effectively mitigate socially biased behaviors of VLMs without degrading their semantic knowledge. Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.
[CV-29] Ordinal Diffusion Models for Color Fundus Images
【速读】:该论文旨在解决现有条件扩散模型在医学影像生成中忽视疾病进展连续性的问题,特别是当疾病阶段仅以离散但有序的标签形式存在时(如糖尿病视网膜病变(Diabetic Retinopathy, DR)的分级)。传统方法将不同疾病阶段视为独立类别进行条件建模,导致生成图像无法体现病理过程的平滑演变。其解决方案的关键在于提出一种序数潜在扩散模型(Ordinal Latent Diffusion Model),通过使用标量形式的疾病表示替代分类标签,显式建模DR严重程度的有序结构,从而实现相邻疾病阶段之间的平滑过渡。这一设计使模型能够从粗粒度的有序标签中学习到连续的病理变化模式,并在EyePACS数据集上通过视觉真实性和临床一致性指标验证了其优越性。
链接: https://arxiv.org/abs/2602.24013
作者: Gustav Schmidt,Philipp Berens,Sarah Müller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:It has been suggested that generative image models such as diffusion models can improve performance on clinically relevant tasks by offering deep learning models supplementary training data. However, most conditional diffusion models treat disease stages as independent classes, ignoring the continuous nature of disease progression. This mismatch is problematic in medical imaging because continuous pathological processes are typically only observed through coarse, discrete but ordered labels as in ophthalmology for diabetic retinopathy (DR). We propose an ordinal latent diffusion model for generating color fundus images that explicitly incorporates the ordered structure of DR severity into the generation process. Instead of categorical conditioning, we used a scalar disease representation, enabling a smooth transition between adjacent stages. We evaluated our approach using visual realism metrics and classification-based clinical consistency analysis on the EyePACS dataset. Compared to a standard conditional diffusion model, our model reduced the Fréchet inception distance for four of the five DR stages and increased the quadratic weighted \kappa from 0.79 to 0.87. Furthermore, interpolation experiments showed that the model captured a continuous spectrum of disease progression learned from ordered, coarse class labels.
[CV-30] Accelerating Masked Image Generation by Learning Latent Controlled Dynamics
【速读】:该论文针对掩码图像生成模型(Masked Image Generation Models, MIGMs)在推理过程中因双向注意力机制导致的效率瓶颈问题展开研究,特别指出当前方法在加速时存在显著近似误差,根源在于特征缓存机制表达能力有限且未充分考虑采样信息。解决方案的关键在于提出一种轻量级预测模型——MIGM-Shortcut,该模型融合历史连续特征与已采样离散token,通过回归特征演化的平均速度场(average velocity field)来高效预测未来特征状态,从而在保持高质量的同时实现显著加速(如在Lumina-DiMOO上达到4倍以上加速),有效提升了掩码图像生成的效率-质量权衡边界(Pareto frontier)。
链接: https://arxiv.org/abs/2602.23996
作者: Kaiwen Zhu,Quansheng Zeng,Yuandong Pu,Shuo Cao,Xiaohui Li,Yi Xin,Qi Qin,Jiayang Li,Yu Qiao,Jinjin Gu,Yihao Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at this https URL.
[CV-31] MINT: Multimodal Imaging-to-Speech Knowledge Transfer for Early Alzheimers Screening
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)早期筛查中神经影像学成本高、难以大规模部署,而仅依赖语音分析的分类器缺乏生物学依据、在轻度认知障碍(Mild Cognitive Impairment, MCI)与正常老化(Cognitively Normal, CN)之间区分能力有限的问题。解决方案的关键在于提出一种三阶段跨模态知识迁移框架——MINT(Multimodal Imaging-to-Speech Knowledge Transfer),其核心机制是利用训练阶段从结构磁共振成像(structural MRI)中提取的生物标志物结构,通过一个冻结的MRI教师模型定义紧凑的神经影像嵌入空间,并借助残差投影头以几何损失对齐语音表征至该固定影像流形,从而将MRI决策边界“接地”到语音特征空间;推理时无需任何影像设备即可实现与纯语音模型相当甚至更优的性能(ADNI-4数据集上AUC达0.720),同时多模态融合进一步提升准确率(0.973),验证了基于影像先验引导语音表征的有效性与可靠性。
链接: https://arxiv.org/abs/2602.23994
作者: Vrushank Ahire,Yogesh Kumar,Anouck Girard,M. A. Ganaie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Alzheimer’s disease is a progressive neurodegenerative disorder in which mild cognitive impairment (MCI) marks a critical transition between aging and dementia. Neuroimaging modalities, such as structural MRI, provide biomarkers of this transition; however, their high costs and infrastructure needs limit their deployment at a population scale. Speech analysis offers a non-invasive alternative, but speech-only classifiers are developed independently of neuroimaging, leaving decision boundaries biologically ungrounded and limiting reliability on the subtle CN-versus-MCI distinction. We propose MINT (Multimodal Imaging-to-Speech Knowledge Transfer), a three-stage cross-modal framework that transfers biomarker structure from MRI into a speech encoder at training time. An MRI teacher, trained on 1,228 subjects, defines a compact neuroimaging embedding space for CN-versus-MCI classification. A residual projection head aligns speech representations to this frozen imaging manifold via a combined geometric loss, adapting speech to the learned biomarker space while preserving imaging encoder fidelity. The frozen MRI classifier, which is never exposed to speech, is applied to aligned embeddings at inference and requires no scanner. Evaluation on ADNI-4 shows aligned speech achieves performance comparable to speech-only baselines (AUC 0.720 vs 0.711) while requiring no imaging at inference, demonstrating that MRI-derived decision boundaries can ground speech representations. Multimodal fusion improves over MRI alone (0.973 vs 0.958). Ablation studies identify dropout regularization and self-supervised pretraining as critical design decisions. To our knowledge, this is the first demonstration of MRI-to-speech knowledge transfer for early Alzheimer’s screening, establishing a biologically grounded pathway for population-level cognitive triage without neuroimaging at inference.
[CV-32] Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping CVPR2026
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在计算美学领域中缺乏审美指导(Aesthetic Guidance, AG)能力的问题,即模型无法识别照片中的美学缺陷并提供可操作的拍摄建议,从而导致其在美学裁剪(aesthetic cropping)任务中表现不佳。解决方案的关键在于构建首个大规模AG数据集AesGuide(含10,748张带审美评分、分析与指导的图像),并提出两阶段框架Venus:第一阶段通过逐步复杂的审美问题训练MLLM获得AG能力,第二阶段借助思维链(Chain-of-Thought, CoT)推理激活模型的美学裁剪能力,实现从拍摄到后期的可解释、交互式美学优化。
链接: https://arxiv.org/abs/2602.23980
作者: Tianxiang Du,Hulingxiao He,Yuxin Peng
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:The widespread use of smartphones has made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers, who can identify aesthetic issues and provide actionable shooting guidance during capture. We define this capability as aesthetic guidance (AG) – an essential but largely underexplored domain in computational aesthetics. Existing multimodal large language models (MLLMs) primarily offer overly positive feedback, failing to identify issues or provide actionable guidance. Without AG capability, they cannot effectively identify distracting regions or optimize compositional balance, thus also struggling in aesthetic cropping, which aims to refine photo composition through reframing after capture. To address this, we introduce AesGuide, the first large-scale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and guidance. Building upon it, we propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions and then activates their aesthetic cropping power via CoT-based rationales. Extensive experiments show that Venus substantially improves AG capability and achieves state-of-the-art (SOTA) performance in aesthetic cropping, enabling interpretable and interactive aesthetic refinement across both stages of photo creation. Code is available at this https URL.
[CV-33] MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation
【速读】:该论文旨在解决当前视频生成评估方法在复杂多镜头叙事场景下的不足,现有基准仍停留在单镜头范式,缺乏对长视频连贯性和吸引力的全面衡量指标。其解决方案的关键在于提出首个综合性基准MSVBench,该基准包含分层脚本和参考图像,专为多镜头视频生成设计,并构建了一个融合大型多模态模型(LMMs)高层语义推理能力与领域专用专家模型精细感知精度的混合评估框架。通过此框架对20种视频生成方法的系统评估,揭示了当前模型主要表现为视觉插值器而非真实世界模型的本质局限,同时验证了基准的人类判断相关性高达94.4%的可靠性,并进一步展示了其作为可扩展监督信号的价值,可用于轻量级模型微调以达到接近商用模型(如Gemini-2.5-Flash)的人类对齐性能。
链接: https://arxiv.org/abs/2602.23969
作者: Haoyuan Shi,Yunxin Li,Nanhao Deng,Zhenran Xu,Xinyu Chen,Longyue Wang,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology (Shenzhen); Alibaba International Digital Commerce
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The evolution of video generation toward complex, multi-shot narratives has exposed a critical deficit in current evaluation methods. Existing benchmarks remain anchored to single-shot paradigms, lacking the comprehensive story assets and cross-shot metrics required to assess long-form coherence and appeal. To bridge this gap, we introduce MSVBench, the first comprehensive benchmark featuring hierarchical scripts and reference images tailored for Multi-Shot Video generation. We propose a hybrid evaluation framework that synergizes the high-level semantic reasoning of Large Multimodal Models (LMMs) with the fine-grained perceptual rigor of domain-specific expert models. Evaluating 20 video generation methods across diverse paradigms, we find that current models–despite strong visual fidelity–primarily behave as visual interpolators rather than true world models. We further validate the reliability of our benchmark by demonstrating a state-of-the-art Spearman’s rank correlation of 94.4% with human judgments. Finally, MSVBench extends beyond evaluation by providing a scalable supervisory signal. Fine-tuning a lightweight model on its pipeline-refined reasoning traces yields human-aligned performance comparable to commercial models like Gemini-2.5-Flash.
[CV-34] SpikeTrack: A Spike-driven Framework for Efficient Visual Tracking CVPR2026
【速读】:该论文旨在解决当前脉冲神经网络(Spiking Neural Networks, SNNs)在RGB视觉跟踪任务中难以兼顾能效与精度的问题:现有SNN跟踪框架或未完全适配脉冲驱动计算机制,或未能充分利用神经元的时空动态特性,导致效率与准确率之间存在权衡。解决方案的关键在于提出SpikeTrack,一个全脉冲驱动的高效RGB目标跟踪框架,其核心创新是采用新颖的非对称设计——通过不对称的时间步扩展和单向信息流,有效利用时空动态特性并显著减少计算量;同时引入受神经推理机制启发的记忆检索模块,通过递归查询由模板初始化的紧凑记忆来持续提取目标线索,增强目标感知能力,从而实现高精度与低能耗的统一。
链接: https://arxiv.org/abs/2602.23963
作者: Qiuyang Zhang,Jiujun Cheng,Qichao Mao,Cong Liu,Yu Fang,Yuhong Li,Mengying Ge,Shangce Gao
机构: Tongji University (同济大学); Nova University of Lisbon (里斯本新大学); Stockholm University (斯德哥尔摩大学); Shanghai University (上海大学); University of Toyama (Toyama大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:Spiking Neural Networks (SNNs) promise energy-efficient vision, but applying them to RGB visual tracking remains difficult: Existing SNN tracking frameworks either do not fully align with spike-driven computation or do not fully leverage neurons’ spatiotemporal dynamics, leading to a trade-off between efficiency and accuracy. To address this, we introduce SpikeTrack, a spike-driven framework for energy-efficient RGB object tracking. SpikeTrack employs a novel asymmetric design that uses asymmetric timestep expansion and unidirectional information flow, harnessing spatiotemporal dynamics while cutting computation. To ensure effective unidirectional information transfer between branches, we design a memory-retrieval module inspired by neural inference mechanisms. This module recurrently queries a compact memory initialized by the template to retrieve target cues and sharpen target perception over time. Extensive experiments demonstrate that SpikeTrack achieves the state-of-the-art among SNN-based trackers and remains competitive with advanced ANN trackers. Notably, it surpasses TransT on LaSOT dataset while consuming only 1/26 of its energy. To our knowledge, SpikeTrack is the first spike-driven framework to make RGB tracking both accurate and energy efficient. The code and models are available at this https URL.
[CV-35] hinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像区域推理中因依赖文本化坐标或固定粒度图像块而导致的模态错位、语义碎片化以及区域选择不精确的问题。其核心解决方案是提出数值化视觉思维链(Numerical Visual Chain-of-Thought, NV-CoT),通过将MLLM的动作空间从离散词汇标记扩展至连续欧氏空间,使模型能够直接生成边界框坐标作为动作,仅需最小架构改动即可实现高精度区域定位。该方法采用高斯或拉普拉斯分布策略替代传统分类策略,并引入重参数化采样以增强随机性,从而兼容GRPO类策略优化,显著提升定位精度与最终答案准确率,同时加速训练收敛。
链接: https://arxiv.org/abs/2602.23959
作者: Kesen Zhao,Beier Zhu,Junbao Zhou,Xingyu Zhu,Zhongqi Yue,Hanwang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent multimodal large language models (MLLMs) increasingly rely on visual chain-of-thought to perform region-grounded reasoning over images. However, existing approaches ground regions via either textified coordinates-causing modality mismatch and semantic fragmentation or fixed-granularity patches that both limit precise region selection and often require non-trivial architectural changes. In this paper, we propose Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables MLLMs to reason over images using continuous numerical coordinates. NV-CoT expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space, allowing models to directly generate bounding-box coordinates as actions with only minimal architectural modification. The framework supports both supervised fine-tuning and reinforcement learning. In particular, we replace categorical token policies with a Gaussian (or Laplace) policy over coordinates and introduce stochasticity via reparameterized sampling, making NV-CoT fully compatible with GRPO-style policy optimization. Extensive experiments on three benchmarks against eight representative visual reasoning baselines demonstrate that NV-CoT significantly improves localization precision and final answer accuracy, while also accelerating training convergence, validating the effectiveness of continuous-action visual reasoning in MLLMs. The code is available in this https URL.
[CV-36] SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls CVPR2026
【速读】:该论文旨在解决当前文本到视频扩散模型在处理多事件提示时,因缺乏显式的时间锚定而导致生成场景混杂或坍塌、破坏叙事连贯性的问题。其核心解决方案是提出一种无需训练的框架 SwitchCraft,关键创新在于引入事件对齐查询引导(Event-Aligned Query Steering, EAQS),通过引导帧级注意力机制与相关事件提示对齐,从而提升事件与视频帧之间的语义对应关系;同时提出自适应平衡强度求解器(Auto-Balance Strength Solver, ABSS),动态调节引导强度以兼顾时间一致性与视觉保真度,实现高质量多事件视频生成。
链接: https://arxiv.org/abs/2602.23956
作者: Qianxun Xu,Chenxi Song,Yujun Cai,Chi Zhang
机构: Westlake University (西湖大学); Duke Kunshan University (昆山杜克大学); The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Recent advances in text-to-video diffusion models have enabled high-fidelity and temporally coherent videos synthesis. However, current models are predominantly optimized for single-event generation. When handling multi-event prompts, without explicit temporal grounding, such models often produce blended or collapsed scenes that break the intended narrative. To address this limitation, we present SwitchCraft, a training-free framework for multi-event video generation. Our key insight is that uniform prompt injection across time ignores the correspondence between events and frames. To this end, we introduce Event-Aligned Query Steering (EAQS), which steers frame-level attention to align with relevant event prompts. Furthermore, we propose Auto-Balance Strength Solver (ABSS), which adaptively balances steering strength to preserve temporal consistency and visual fidelity. Extensive experiments demonstrate that SwitchCraft substantially improves prompt alignment, event clarity, and scene consistency compared with existing baselines, offering a simple yet effective solution for multi-event video generation.
[CV-37] GDA-YOLO11: Amodal Instance Segmentation for Occlusion-Robust Robotic Fruit Harvesting
【速读】:该论文旨在解决机器人采摘过程中因果实遮挡(occlusion)导致的检测与定位失败问题,此类问题常引发显著的作物损失。解决方案的关键在于提出一种基于新型非模态分割(amodal segmentation)模型 GDA-YOLO11 的采摘框架,该模型通过架构改进和更新的不对称掩码损失函数提升对被遮挡果实的完整边界预测能力;在此基础上,利用欧氏距离变换估算拾取点并将其投影至三维坐标系,实现从感知到执行的闭环集成。实验表明,GDA-YOLO11 在多个指标上优于基准模型 YOLO11n,且在不同遮挡水平下均保持较高采摘成功率,首次实现了非模态实例分割在农业机器人采摘中的实用化验证。
链接: https://arxiv.org/abs/2602.23953
作者: Caner Beldek,Emre Sariyildiz,Son Lam Phung,Gursel Alici
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, journal pre-print
Abstract:Occlusion remains a critical challenge in robotic fruit harvesting, as undetected or inaccurately localised fruits often results in substantial crop losses. To mitigate this issue, we propose a harvesting framework using a new amodal segmentation model, GDA-YOLO11, which incorporates architectural improvements and an updated asymmetric mask loss. The proposed model is trained on a modified version of a public citrus dataset and evaluated on both the base dataset and occlusion-sensitive subsets with varying occlusion levels. Within the framework, full fruit masks, including invisible regions, are inferred by GDA-YOLO11, and picking points are subsequently estimated using the Euclidean distance transform. These points are then projected into 3D coordinates for robotic harvesting execution. Experiments were conducted using real citrus fruits in a controlled environment simulating occlusion scenarios. Notably, to the best of our knowledge, this study provides the first practical demonstration of amodal instance segmentation in robotic fruit harvesting. GDA-YOLO11 achieves a precision of 0.844, recall of 0.846, mAP@50 of 0.914, and mAP@50:95 of 0.636, outperforming YOLO11n by 5.1%, 1.3%, and 1.0% in precision, mAP@50, and mAP@50:95, respectively. The framework attains harvesting success rates of 92.59%, 85.18%, 48.14%, and 22.22% at zero to high occlusion levels, improving success by 3.5% under medium and high occlusion. These findings demonstrate that GDA-YOLO11 enhances occlusion robust segmentation and streamlines perception-to-action integration, paving the way for more reliable autonomous systems in agriculture.
[CV-38] CC-VQA: Conflict- and Correlation-Aware Method for Mitigating Knowledge Conflict in Knowledge-Based Visual Question Answering CVPR2026
【速读】:该论文旨在解决知识增强型视觉问答(Knowledge-based Visual Question Answering, KB-VQA)中静态参数化知识与动态检索信息之间的冲突问题,即模型输出往往忽略检索到的上下文或与其参数化知识不一致,导致性能受限。现有方法多借鉴语言模型中的冲突缓解策略,但忽视了视觉信息在冲突识别中的关键作用,并存在冗余检索内容干扰准确判断的问题。其解决方案的核心在于提出一种无需训练的、兼顾冲突感知与相关性引导的新方法CC-VQA,包含两个关键组件:一是以视觉为中心的上下文冲突推理机制,实现对内部和外部知识语义层面的冲突分析;二是基于相关性的编码与解码机制,通过位置编码压缩低相关性陈述并利用相关性加权的冲突评分进行自适应解码,从而有效提升KB-VQA任务的准确性与鲁棒性。
链接: https://arxiv.org/abs/2602.23952
作者: Yuyang Hong,Jiaqi Gu,Yujin Lou,Lubin Fan,Qi Yang,Ying Wang,Kun Ding,Yue Wu,Shiming Xiang,Jieping Ye
机构: UCAS; Institute of Automation; Alibaba Cloud Computing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2026
Abstract:Knowledge-based visual question answering (KB-VQA) demonstrates significant potential for handling knowledge-intensive tasks. However, conflicts arise between static parametric knowledge in vision language models (VLMs) and dynamically retrieved information due to the static model knowledge from pre-training. The outputs either ignore retrieved contexts or exhibit inconsistent integration with parametric knowledge, posing substantial challenges for KB-VQA. Current knowledge conflict mitigation methods primarily adapted from language-based approaches, focusing on context-level conflicts through engineered prompting strategies or context-aware decoding mechanisms. However, these methods neglect the critical role of visual information in conflicts and suffer from redundant retrieved contexts, which impair accurate conflict identification and effective mitigation. To address these limitations, we propose \textbfCC-VQA: a novel training-free, conflict- and correlation-aware method for KB-VQA. Our method comprises two core components: (1) Vision-Centric Contextual Conflict Reasoning, which performs visual-semantic conflict analysis across internal and external knowledge contexts; and (2) Correlation-Guided Encoding and Decoding, featuring positional encoding compression for low-correlation statements and adaptive decoding using correlation-weighted conflict scoring. Extensive evaluations on E-VQA, InfoSeek, and OK-VQA benchmarks demonstrate that CC-VQA achieves state-of-the-art performance, yielding absolute accuracy improvements of 3.3% to 6.4% compared to existing methods. Code is available at this https URL.
[CV-39] AHAP: Reconstructing Arbitrary Humans from Arbitrary Perspectives with Geometric Priors
【速读】:该论文旨在解决从任意视角图像中重建三维人体的问题,尤其针对传统方法依赖预标定(如棋盘格或MVS算法)导致的可扩展性和适用性受限问题。其解决方案的关键在于提出AHAP框架,通过多视角几何信息的有效融合来实现无需相机标定的人体关联、重建与定位;具体包括:利用可学习的人体查询和软分配机制结合对比学习监督的跨视角身份关联模块,以及基于跨视角重投影损失引导的SMPL参数预测模块,同时借助多视角三角化消除单目方法中的深度歧义,从而提升3D人体定位精度。
链接: https://arxiv.org/abs/2602.23951
作者: Xiaozhen Qiao,Wenjia Wang,Zhiyuan Zhao,Jiacheng Sun,Ping Luo,Hongyuan Zhang,Xuelong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing 3D humans from images captured at multiple perspectives typically requires pre-calibration, like using checkerboards or MVS algorithms, which limits scalability and applicability in diverse real-world scenarios. In this work, we present \textbfAHAP (Reconstructing \textbfArbitrary \textbfHumans from \textbfArbitrary \textbfPerspectives), a feed-forward framework for reconstructing arbitrary humans from arbitrary camera perspectives without requiring camera calibration. Our core lies in the effective fusion of multi-view geometry to assist human association, reconstruction and localization. Specifically, we use a Cross-View Identity Association module through learnable person queries and soft assignment, supervised by contrastive learning to resolve cross-view human identity association. A Human Head fuses cross-view features and scene context for SMPL prediction, guided by cross-view reprojection losses to enforce body pose consistency. Additionally, multi-view geometry eliminates the depth ambiguity inherent in monocular methods, providing more precise 3D human localization through multi-view triangulation. Experiments on EgoHumans and EgoExo4D demonstrate that AHAP achieves competitive performance on both world-space human reconstruction and camera pose estimation, while being 180 \times faster than optimization-based approaches.
[CV-40] Micro-expression Recognition Based on Dual-branch Feature Extraction and Fusion
【速读】:该论文旨在解决微表情(micro-expression)识别中因表情持续时间短、表现细微而导致现有基于光流(optical flow)的方法性能受限的问题。其解决方案的关键在于提出一种集成并行注意力机制的双分支特征提取网络:通过残差网络缓解梯度消失与网络退化问题,利用Inception结构增强模型表征能力并抑制无关区域干扰,同时设计自适应特征融合模块以有效整合双分支特征,从而在CASME II数据集上实现74.67%的识别准确率,显著优于LBP-TOP和MSMMT等方法。
链接: https://arxiv.org/abs/2602.23950
作者: Mingjie Zhang,Bo Li,Wanting Liu,Hongyan Cui,Yue Li,Qingwen Li,Hong Li,Ge Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 4 figures,conference paper
Abstract:Micro-expressions, characterized by transience and subtlety, pose challenges to existing optical flow-based recognition methods. To address this, this paper proposes a dual-branch micro-expression feature extraction network integrated with parallel attention. Key contributions include: 1) a residual network designed to alleviate gradient anishing and network degradation; 2) an Inception network constructed to enhance model representation and suppress interference from irrelevant regions; 3) an adaptive feature fusion module developed to integrate dual-branch features. Experiments on the CASME II dataset demonstrate that the proposed method achieves 74.67% accuracy, outperforming LBP-TOP (by 11.26%), MSMMT (by 3.36%), and other comparative methods.
[CV-41] PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLs)在3D点云理解中缺乏显式几何推理能力的问题。当前方法通常将几何推理视为隐式映射过程,导致模型产生与结构细节不符的几何幻觉(geometric hallucinations),即生成看似合理但不准确的答案。解决方案的关键在于提出PointCoT框架,采用“看(Look)、思考(Think)、回答(Answer)”的范式,通过监督模型先生成基于几何信息的链式思维(Chain-of-Thought, CoT)推理路径,再输出最终答案,从而实现几何感知的显式逻辑推理。该方法结合了大规模标注数据集Point-Reason-Instruct和双流多模态架构,有效融合语义外观与几何真实性,在复杂推理任务上达到最先进性能。
链接: https://arxiv.org/abs/2602.23945
作者: Dongxu Zhang,Yiding Sun,Pengcheng Li,Yumou Liu,Hongqiang Lin,Haoran Xu,Xiaoxuan Mu,Liang Lin,Wenbiao Yan,Ning Yang,Chaowei Fang,Juanjuan Zhao,Jihua Zhu,Conghui He,Cheng Tan
机构: Xi’an Jiaotong University (西安交通大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Institute of Automation, CASIA (中国科学院自动化研究所); Taiyuan University of Technology (太原理工大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:While Multimodal Large Language Models (MLLMs) demonstrate proficiency in 2D scenes, extending their perceptual intelligence to 3D point cloud understanding remains a significant challenge. Current approaches focus primarily on aligning 3D features with pre-trained models. However, they typically treat geometric reasoning as an implicit mapping process. These methods bypass intermediate logical steps and consequently suffer from geometric hallucinations. They confidently generate plausible responses that fail to ground in precise structural details. To bridge this gap, we present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data. We advocate for a \textitLook, Think, then Answer paradigm. In this approach, the model is supervised to generate geometry-grounded rationales before predicting final answers. To facilitate this, we construct Point-Reason-Instruct, a large-scale benchmark comprising \sim 86k instruction-tuning samples with hierarchical CoT annotations. By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth. Extensive experiments demonstrate that PointCoT achieves state-of-the-art performance on complex reasoning tasks.
[CV-42] Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos
【速读】:该论文旨在解决视觉语言导航(Vision-Language Navigation, VLN)任务中代理在未见环境中进行长程推理时面临的挑战,尤其是面对模糊、粗粒度指令时的表现瓶颈。其核心问题在于如何有效利用多模态事件知识来增强代理的因果推理能力与环境理解深度。解决方案的关键在于提出一种以事件为中心的知识增强策略:首先构建了首个大规模多模态时空知识图谱 YE-KG(包含超过86k节点和83k边),通过多模态大语言模型(如LLaVa、GPT-4)从真实室内视频中提取结构化的语义-动作-效果事件,作为显式的情景记忆;其次设计了STE-VLN模型,采用粗粒度到细粒度的分层检索机制,将因果事件序列动态融合进代理的自指视角视觉观测中,从而提升对复杂指令的理解与长期路径规划能力。实验表明,该方法在REVERIE、R2R及R2R-CE等多个基准上显著优于当前最先进方法。
链接: https://arxiv.org/abs/2602.23937
作者: Haoxuan Xu,Tianfu Li,Wenbo Chen,Yi Liu,Xingxing Zuo,Yaoxian Song,Haoang Li
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Tsinghua University (清华大学); Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学); Hangzhou City University (杭州城市大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Navigation (VLN) agents often struggle with long-horizon reasoning in unseen environments, particularly when facing ambiguous, coarse-grained instructions. While recent advances use knowledge graph to enhance reasoning, the potential of multimodal event knowledge inspired by human episodic memory remains underexplored. In this work, we propose an event-centric knowledge enhancement strategy for automated process knowledge mining and feature fusion to solve coarse-grained instruction and long-horizon reasoning in VLN task. First, we construct YE-KG, the first large-scale multimodal spatiotemporal knowledge graph, with over 86k nodes and 83k edges, derived from real-world indoor videos. By leveraging multimodal large language models (i.e., LLaVa, GPT4), we extract unstructured video streams into structured semantic-action-effect events to serve as explicit episodic memory. Second, we introduce STE-VLN, which integrates the above graph into VLN models via a Coarse-to-Fine Hierarchical Retrieval mechanism. This allows agents to retrieve causal event sequences and dynamically fuse them with egocentric visual observations. Experiments on REVERIE, R2R, and R2R-CE benchmarks demonstrate the efficiency of our event-centric strategy, outperforming state-of-the-art approaches across diverse action spaces. Our data and code are available on the project website this https URL.
[CV-43] Leverag ing Geometric Prior Uncertainty and Complementary Constraints for High-Fidelity Neural Indoor Surface Reconstruction ICRA2026
【速读】:该论文旨在解决神经隐式表面重建中因几何先验(geometric prior)不可靠或噪声导致的细粒度结构(如薄结构和复杂几何)恢复困难的问题。现有方法依赖优化过程中产生的隐式不确定性来过滤先验,这种方式间接且效率低下,同时在高不确定性区域施加掩码监督会进一步造成优化欠约束。解决方案的关键在于提出GPU-SDF框架,其核心创新包括:1)引入一个无需额外网络的自监督模块,显式估计几何先验不确定性;2)设计不确定性引导损失函数,动态调节先验影响而非直接丢弃,从而保留微弱但有信息量的线索;3)针对高不确定性区域,融合边缘距离场(edge distance field)以增强边界监督和多视角一致性正则化(multi-view consistency regularization)以保证几何一致性,从而显著提升细粒度细节重建质量。
链接: https://arxiv.org/abs/2602.23926
作者: Qiyu Feng,Jiwei Shan,Shing Shin Cheng,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2026
Abstract:Neural implicit surface reconstruction with signed distance function has made significant progress, but recovering fine details such as thin structures and complex geometries remains challenging due to unreliable or noisy geometric priors. Existing approaches rely on implicit uncertainty that arises during optimization to filter these priors, which is indirect and inefficient, and masking supervision in high-uncertainty regions further leads to under-constrained optimization. To address these issues, we propose GPU-SDF, a neural implicit framework for indoor surface reconstruction that leverages geometric prior uncertainty and complementary constraints. We introduce a self-supervised module that explicitly estimates prior uncertainty without auxiliary networks. Based on this estimation, we design an uncertainty-guided loss that modulates prior influence rather than discarding it, thereby retaining weak but informative cues. To address regions with high prior uncertainty, GPU-SDF further incorporates two complementary constraints: an edge distance field that strengthens boundary supervision and a multi-view consistency regularization that enforces geometric coherence. Extensive experiments confirm that GPU-SDF improves the reconstruction of fine details and serves as a plug-and-play enhancement for existing frameworks. Source code will be available at this https URL
[CV-44] he Geometry of Transfer: Unlocking Medical Vision Manifolds for Training-Free Model Ranking
【速读】:该论文旨在解决医学图像分割任务中如何高效选择最优自监督学习(Self-Supervised Learning, SSL)基础模型的问题,现有迁移性评估(Transferability Estimation, TE)方法主要面向分类任务,依赖全局统计假设,难以捕捉密集预测任务所需的拓扑复杂性。其解决方案的关键在于提出一种基于拓扑驱动的迁移性评估框架,通过三个核心组件实现:(1) 全局表示拓扑差异(Global Representation Topology Divergence, GRTD),利用最小生成树量化特征与标签之间的结构同构性;(2) 局部边界感知拓扑一致性(Local Boundary-Aware Topological Consistency, LBTC),专门评估关键解剖边界处流形的可分离性;(3) 任务自适应融合机制,根据目标任务的语义基数动态整合全局与局部指标。该方法在OpenMind大规模基准上显著优于现有最先进基线,相对加权肯德尔相关系数提升约31%,提供了一种无需微调的高效模型选择代理工具。
链接: https://arxiv.org/abs/2602.23916
作者: Jiaqi Tang,Shaoyang Zhang,Xiaoqi Wang,Jiaying Zhou,Yang Liu,Qingchao Chen
机构: Peking University (北京大学); Hohai University (河海大学); Beijing Normal-Hong Kong Baptist University (北京师范大学-香港浸会大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of large-scale self-supervised learning (SSL) has produced a vast zoo of medical foundation models. However, selecting optimal medical foundation models for specific segmentation tasks remains a computational bottleneck. Existing Transferability Estimation (TE) metrics, primarily designed for classification, rely on global statistical assumptions and fail to capture the topological complexity essential for dense prediction. We propose a novel Topology-Driven Transferability Estimation framework that evaluates manifold tractability rather than statistical overlap. Our approach introduces three components: (1) Global Representation Topology Divergence (GRTD), utilizing Minimum Spanning Trees to quantify feature-label structural isomorphism; (2) Local Boundary-Aware Topological Consistency (LBTC), which assesses manifold separability specifically at critical anatomical boundaries; and (3) Task-Adaptive Fusion, which dynamically integrates global and local metrics based on the semantic cardinality of the target task. Validated on the large-scale OpenMind benchmark across diverse anatomical targets and SSL foundation models, our approach significantly outperforms state-of-the-art baselines by around \textbf31% relative improvement in the weighted Kendall, providing a robust, training-free proxy for efficient model selection without the cost of fine-tuning. The code will be made publicly available upon acceptance.
[CV-45] Half-Truths Break Similarity-Based Retrieval
【速读】:该论文旨在解决CLIP类双编码器模型在图像-文本匹配任务中对“半真话”(half-truths)的敏感性问题,即当文本描述中添加一个看似合理但错误的对象或关系时,模型仍可能保持甚至提升相似度得分,违背了人类对语义一致性的直觉。这种现象反映出模型缺乏对句子组成部分(如实体和关系)的细粒度监督,仅通过全句级别的对比学习难以确保每个成分的语义准确性。解决方案的关键在于提出CS-CLIP(Component-Supervised CLIP),其核心机制是将文本描述分解为实体和关系单元,为每个单元构建最小修改的干扰样本(foil),并通过微调使模型在保持原有双编码器推理结构的前提下,能够将正确单元的得分高于其对应的干扰单元,从而增强对局部语义的精确建模能力。
链接: https://arxiv.org/abs/2602.23906
作者: Bora Kargi,Arnas Uselis,Seong Joon Oh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: this https URL
[CV-46] SegMate: Asymmetric Attention-Based Lightweight Architecture for Efficient Multi-Organ Segmentation
【速读】:该论文旨在解决当前医学图像分割模型虽具备优异准确性但计算资源消耗大、难以在资源受限的临床环境中部署的问题。其解决方案的关键在于提出一种高效的2.5D框架SegMate,通过精心集成非对称架构、注意力机制、多尺度特征融合、基于切片的位置条件化以及多任务优化等技术,在显著降低计算量(最高减少2.5倍GFLOPs)和显存占用(最高减少2.1倍VRAM)的同时,仍保持甚至提升分割性能(Dice分数提升约1%),并在多个数据集上展现出良好的零样本跨域泛化能力。
链接: https://arxiv.org/abs/2602.23903
作者: Andrei-Alexandru Bunea,Dan-Matei Popovici,Radu Tudor Ionescu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:State-of-the-art models for medical image segmentation achieve excellent accuracy but require substantial computational resources, limiting deployment in resource-constrained clinical settings. We present SegMate, an efficient 2.5D framework that achieves state-of-the-art accuracy, while considerably reducing computational requirements. Our efficient design is the result of meticulously integrating asymmetric architectures, attention mechanisms, multi-scale feature fusion, slice-based positional conditioning, and multi-task optimization. We demonstrate the efficiency-accuracy trade-off of our framework across three modern backbones (EfficientNetV2-M, MambaOut-Tiny, FastViT-T12). We perform experiments on three datasets: TotalSegmentator, SegTHOR and AMOS22. Compared with the vanilla models, SegMate reduces computation (GFLOPs) by up to 2.5x and memory footprint (VRAM) by up to 2.1x, while generally registering performance gains of around 1%. On TotalSegmentator, we achieve a Dice score of 93.51% with only 295MB peak GPU memory. Zero-shot cross-dataset evaluations on SegTHOR and AMOS22 demonstrate strong generalization, with Dice scores of up to 86.85% and 89.35%, respectively. We release our open-source code at this https URL.
[CV-47] ABPolicy: Asynchronous B-Spline Flow Policy for Real-Time and Smooth Robotic Manipulation
【速读】:该论文旨在解决机器人操作中策略在原始动作空间进行同步推理时所面临的三大挑战:块内抖动(intra-chunk jitter)、块间不连续性(inter-chunk discontinuities)以及停顿式执行(stop-and-go execution),这些问题会显著降低策略的平滑性和对环境变化的响应能力。其解决方案的关键在于提出ABPolicy,一种基于B样条控制点动作空间的异步流匹配策略(asynchronous flow-matching policy)。该方法通过B样条表示确保块内平滑性,引入双向动作预测与重拟合优化以保障块间连续性,并借助异步推理实现实时连续更新,从而有效降低轨迹急动度(jerk),提升运动平滑性与任务性能。
链接: https://arxiv.org/abs/2602.23901
作者: Fan Yang,Peiguang Jing,Kaihua Qu,Ningyuan Zhao,Yuting Su
机构: Tianjin University (天津大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robotic manipulation requires policies that are smooth and responsive to evolving observations. However, synchronous inference in the raw action space introduces several challenges, including intra-chunk jitter, inter-chunk discontinuities, and stop-and-go execution. These issues undermine a policy’s smoothness and its responsiveness to environmental changes. We propose ABPolicy, an asynchronous flow-matching policy that operates in a B-spline control-point action space. First, the B-spline representation ensures intra-chunk smoothness. Second, we introduce bidirectional action prediction coupled with refitting optimization to enforce inter-chunk continuity. Finally, by leveraging asynchronous inference, ABPolicy delivers real-time, continuous updates. We evaluate ABPolicy across seven tasks encompassing both static settings and dynamic settings with moving objects. Empirical results indicate that ABPolicy reduces trajectory jerk, leading to smoother motion and improved performance. Project website: this https URL.
[CV-48] Experience-Guided Self-Adaptive Cascaded Agents for Breast Cancer Screening and Diagnosis with Reduced Biopsy Referrals
【速读】:该论文旨在解决乳腺超声筛查中诊断过度 escalation(即不必要的进一步检查或活检)的问题,从而提升筛查效率并减少患者负担。其解决方案的关键在于提出一种基于经验引导的级联多智能体框架(BUSD-Agent),将筛查与诊断分为两个阶段的选择性决策流程:第一阶段由轻量级“筛查诊所”代理(agent)使用分类模型工具筛选出低风险的良性或正常病例,避免其进入后续复杂诊断;第二阶段由功能更丰富的“诊断诊所”代理整合多模态感知与影像描述工具,对高风险病例做出活检建议。该框架通过构建包含病理验证结果、图像嵌入、模型预测及历史动作的结构化决策轨迹记忆库,实现基于相似案例的检索条件式上下文适应(retrieval-conditioned in-context adaptation),动态调整模型置信度和分诊阈值,无需参数更新即可从历史经验中学习优化决策策略,最终显著降低诊断escalation率和活检推荐率,同时提升特异性。
链接: https://arxiv.org/abs/2602.23899
作者: Pramit Saha,Mohammad Alsharid,Joshua Strong,J. Alison Noble
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We propose an experience-guided cascaded multi-agent framework for Breast Ultrasound Screening and Diagnosis, called BUSD-Agent, that aims to reduce diagnostic escalation and unnecessary biopsy referrals. Our framework models screening and diagnosis as a two-stage, selective decision-making process. A lightweight screening clinic' agent, restricted to classification models as tools, selectively filters out benign and normal cases from further diagnostic escalation when malignancy risk and uncertainty are estimated as low. Cases that have higher risks are escalated to the diagnostic clinic’ agent, which integrates richer perception and radiological description tools to make a secondary decision on biopsy referral. To improve agent performance, past records of pathology-confirmed outcomes along with image embeddings, model predictions, and historical agent actions are stored in a memory bank as structured decision trajectories. For each new case, BUSD-Agent retrieves similar past cases based on image, model response and confidence similarity to condition the agent’s current decision policy. This enables retrieval-conditioned in-context adaptation that dynamically adjusts model trust and escalation thresholds from prior experiences without parameter updates. Evaluation across 10 breast ultrasound datasets shows that the proposed experience-guided workflow reduces diagnostic escalation in BUSD-Agent from 84.95% to 58.72% and overall biopsy referrals from 59.50% to 37.08%, compared to the same architecture without trajectory conditioning, while improving average screening specificity by 68.48% and diagnostic specificity by 6.33%.
[CV-49] SelfOccFlow: Towards end-to-end self-supervised 3D Occupancy Flow prediction
【速读】:该论文旨在解决自动驾驶中3D占用流(3D occupancy flow)估计的问题,即同时建模环境中静态和动态物体的空间分布及其运动信息,以提升对动态场景的感知能力。现有方法依赖昂贵的3D占用标注、光流标签或预训练光学流模型,限制了其泛化性和实用性。解决方案的关键在于提出一种自监督学习框架:首先将场景解耦为独立的静态与动态符号距离场(signed distance fields),并通过时间聚合隐式学习运动信息;同时引入基于特征余弦相似度的强自监督流线索,从而无需人工标注或外部流监督即可有效估计3D占用流。
链接: https://arxiv.org/abs/2602.23894
作者: Xavier Timoneda,Markus Herb,Fabian Duerr,Daniel Goehring
机构: CARIAD SE, Volkswagen Group (大众集团); Freie Universität Berlin (柏林自由大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted version. Final version is published in IEEE Robotics and Automation Letters, DOI: https://doi.org/10.1109/LRA.2026.3665447
Abstract:Estimating 3D occupancy and motion at the vehicle’s surroundings is essential for autonomous driving, enabling situational awareness in dynamic environments. Existing approaches jointly learn geometry and motion but rely on expensive 3D occupancy and flow annotations, velocity labels from bounding boxes, or pretrained optical flow models. We propose a self-supervised method for 3D occupancy flow estimation that eliminates the need for human-produced annotations or external flow supervision. Our method disentangles the scene into separate static and dynamic signed distance fields and learns motion implicitly through temporal aggregation. Additionally, we introduce a strong self-supervised flow cue derived from features’ cosine similarities. We demonstrate the efficacy of our 3D occupancy flow method on SemanticKITTI, KITTI-MOT, and nuScenes.
[CV-50] AoE: Always-on Egocentric Human Video Collection for Embodied AI
【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中具身基础模型(Embodied foundation models)因缺乏大规模、高质量真实世界交互数据而导致的预训练与扩展困难问题。现有数据采集方法受限于高昂的基础设施成本、复杂的硬件依赖以及有限的交互场景覆盖范围,难以实现规模化扩展。其解决方案的关键在于提出一种“始终在线的自我中心”(Always-on Egocentric, AoE)数据采集系统,通过利用人类自身及其智能手机作为低成本、可持续的数据采集代理,结合云-边协同架构实现高效、场景无关的真实世界交互数据收集:一方面采用佩戴式手机支架降低采集门槛,另一方面开发跨平台移动端应用实现本地实时处理与云端自动化标注过滤,从而显著提升数据质量并增强下游任务的现实泛化能力。
链接: https://arxiv.org/abs/2602.23893
作者: Bowen Yang,Zishuo Li,Yang Sun,Changtao Miao,Yifan Yang,Man Luo,Xiaotong Yan,Feng Jiang,Jinchuan Shi,Yankai Fu,Ning Chen,Junkai Zhao,Pengwei Wang,Guocai Yao,Shanghang Zhang,Hao Chen,Zhe Li,Kai Zhu
机构: Ant Digital Technologies, Ant Group (蚂蚁集团); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Zhejiang University (浙江大学); Peking University (北京大学); Beijing Academy of Artificial Intelligence (北京智源研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Embodied foundation models require large-scale, high-quality real-world interaction data for pre-training and scaling. However, existing data collection methods suffer from high infrastructure costs, complex hardware dependencies, and limited interaction scope, making scalable expansion challenging. In fact, humans themselves are ideal physically embodied agents. Therefore, obtaining egocentric real-world interaction data from globally distributed “human agents” offers advantages of low cost and sustainability. To this end, we propose the Always-on Egocentric (AoE) data collection system, which aims to simplify hardware dependencies by leveraging humans themselves and their smartphones, enabling low-cost, highly efficient, and scene-agnostic real-world interaction data collection to address the challenge of data scarcity. Specifically, we first employ an ergonomic neck-mounted smartphone holder to enable low-barrier, large-scale egocentric data collection through a cloud-edge collaborative architecture. Second, we develop a cross-platform mobile APP that leverages on-device compute for real-time processing, while the cloud hosts automated labeling and filtering pipelines that transform raw videos into high-quality training data. Finally, the AoE system supports distributed Ego video data collection by anyone, anytime, and anywhere. We evaluate AoE on data preprocessing quality and downstream tasks, demonstrating that high-quality egocentric data significantly boosts real-world generalization.
[CV-51] DACESR: Degradation-Aware Conditional Embedding for Real-World Image Super-Resolution
【速读】:该论文旨在解决多模态大模型在处理退化图像(degraded images)时性能受限的问题,尤其是在图像超分辨率(image super-resolution)任务中难以平衡保真度(fidelity)与感知质量(perceptual quality)的挑战。解决方案的关键在于提出两个核心模块:一是通过退化选择策略(degradation selection strategy)设计的Real Embedding Extractor (REE),利用对比学习(contrastive learning)在退化空间中有效提取图像语义嵌入,显著提升对退化图像内容的识别能力;二是引入条件特征调制器(Conditional Feature Modulator, CFM),将REE提取的高层语义信息注入基于Mamba架构的网络中,从而增强模型对像素级细节的恢复能力,实现更高质量的视觉重建效果。
链接: https://arxiv.org/abs/2602.23890
作者: Xiaoyan Lei,Wenlong Zhang,Biao Luo,Hui Liang,Weifeng Cao,Qiuting Lin
机构: Zhengzhou University of Light Industry (郑州轻工业大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Central South University (中南大学); Machinery Technology Development Co.,Ltd., China Academy of Machinery Science and Technology Group Co.,Ltd. (中国机械科学研究总院集团有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TIP
Abstract:Multimodal large models have shown excellent ability in addressing image super-resolution in real-world scenarios by leveraging language class as condition information, yet their abilities in degraded images remain limited. In this paper, we first revisit the capabilities of the Recognize Anything Model (RAM) for degraded images by calculating text similarity. We find that directly using contrastive learning to fine-tune RAM in the degraded space is difficult to achieve acceptable results. To address this issue, we employ a degradation selection strategy to propose a Real Embedding Extractor (REE), which achieves significant recognition performance gain on degraded image content through contrastive learning. Furthermore, we use a Conditional Feature Modulator (CFM) to incorporate the high-level information of REE for a powerful Mamba-based network, which can leverage effective pixel information to restore image textures and produce visually pleasing results. Extensive experiments demonstrate that the REE can effectively help image super-resolution networks balance fidelity and perceptual quality, highlighting the great potential of Mamba in real-world applications. The source code of this work will be made publicly available at: this https URL
[CV-52] Altitude-Aware Visual Place Recognition in Top-Down View
【速读】:该论文旨在解决在显著高度变化条件下,空中视觉定位(Aerial Visual Place Recognition, Aerial VPR)中因视角和地面特征尺度差异导致的识别精度下降问题。解决方案的关键在于提出一种高度自适应的VPR方法,其核心创新是通过分析图像中地面特征密度来估计相对高度,并基于此进行图像裁剪以生成标准化查询图像,进而结合图像分类策略实现鲁棒的定位。该方法无需额外硬件(如气压计或飞行时间传感器),仅依赖视觉信息即可在不同高度下提升检索准确性,在实验中相较传统方法平均R@1和R@5分别提升29.85%和60.20%,并显著优于单目度量深度估计(Monocular Metric Depth Estimation, MMDE)方法,展现出良好的实用性与可扩展性。
链接: https://arxiv.org/abs/2602.23872
作者: Xingyu Shao,Mengfan He,Chunyu Li,Liangzheng Sun,Ziyang Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:To address the challenge of aerial visual place recognition (VPR) problem under significant altitude variations, this study proposes an altitude-adaptive VPR approach that integrates ground feature density analysis with image classification techniques. The proposed method estimates airborne platforms’ relative altitude by analyzing the density of ground features in images, then applies relative altitude-based cropping to generate canonical query images, which are subsequently used in a classification-based VPR strategy for localization. Extensive experiments across diverse terrains and altitude conditions demonstrate that the proposed approach achieves high accuracy and robustness in both altitude estimation and VPR under significant altitude changes. Compared to conventional methods relying on barometric altimeters or Time-of-Flight (ToF) sensors, this solution requires no additional hardware and offers a plug-and-play solution for downstream applications, making it suitable for small- and medium-sized airborne platforms operating in diverse environments, including rural and urban areas. Under significant altitude variations, incorporating our relative altitude estimation module into the VPR retrieval pipeline boosts average R@1 and R@5 by 29.85% and 60.20%, respectively, compared with applying VPR retrieval alone. Furthermore, compared to traditional Monocular Metric Depth Estimation (MMDE) methods, the proposed method reduces the mean error by 202.1 m, yielding average additional improvements of 31.4% in R@1 and 44% in R@5. These results demonstrate that our method establishes a robust, vision-only framework for three-dimensional visual place recognition, offering a practical and scalable solution for accurate airborne platforms localization under large altitude variations and limited sensor availability.
[CV-53] Bandwidth-adaptive Cloud-Assisted 360-Degree 3D Perception for Autonomous Vehicles
【速读】:该论文旨在解决自动驾驶系统在复杂城市环境中因计算资源受限与严格延迟约束导致的实时环境感知难题。其核心解决方案是通过车联网(Vehicle-to-Everything, V2X)通信将部分计算任务卸载至云端,从而缓解车载端的处理压力;关键创新在于采用基于Transformer的多摄像头传感器数据融合方法生成全局鸟瞰图(Bird’s-Eye View, BEV)表示,并动态划分计算负载——根据本地处理层数和特征量化级别实现车云协同推理,同时结合特征向量裁剪与压缩技术降低网络传输开销。实验表明,该混合策略相较纯车载方案可减少72%的端到端延迟,且引入自适应优化算法可根据网络波动动态调整分割点与量化水平,在保持相同延迟性能下提升检测准确率最高达20%。
链接: https://arxiv.org/abs/2602.23871
作者: Faisal Hawladera,Rui Meireles,Gamal Elghazaly,Ana Aguiar,Raphaël Frank
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:A key challenge for autonomous driving lies in maintaining real-time situational awareness regarding surrounding obstacles under strict latency constraints. The high processing requirements coupled with limited onboard computational resources can cause delay issues, particularly in complex urban settings. To address this, we propose leveraging Vehicle-to-Everything (V2X) communication to partially offload processing to the cloud, where compute resources are abundant, thus reducing overall latency. Our approach utilizes transformer-based models to fuse multi-camera sensor data into a comprehensive Bird’s-Eye View (BEV) representation, enabling accurate 360-degree 3D object detection. The computation is dynamically split between the vehicle and the cloud based on the number of layers processed locally and the quantization level of the features. To further reduce network load, we apply feature vector clipping and compression prior to transmission. In a real-world experimental evaluation, our hybrid strategy achieved a 72 % reduction in end-to-end latency compared to a traditional onboard solution. To adapt to fluctuating network conditions, we introduce a dynamic optimization algorithm that selects the split point and quantization level to maximize detection accuracy while satisfying real-time latency constraints. Trace-based evaluation under realistic bandwidth variability shows that this adaptive approach improves accuracy by up to 20 % over static parameterization with the same latency performance.
[CV-54] Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition
【速读】:该论文旨在解决视觉语言模型(如CLIP)在遥感(Remote Sensing, RS)语义分割任务中因自注意力层内交互不当而导致的性能瓶颈问题。其解决方案的关键在于:首先提出一种分层掩码约束机制,利用Segment Anything Model (SAM) 生成的掩码在多尺度上限制自注意力交互;其次设计了一种模型组合方法,通过引入基于文本提示变化的表征质量评估权重方案,对多个针对遥感场景优化的CLIP变体参数进行加权平均,从而实现无需额外训练即可达到最优分割效果。
链接: https://arxiv.org/abs/2602.23869
作者: Mohammadreza Heidarianbaei,Mareike Dorozynski,Hubert Kanyamahanga,Max Mehltretter,Franz Rottensteiner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in the proceedings of the British Machine Vision Conference Workshops 2025
Abstract:In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.
[CV-55] APPO: Attention-guided Perception Policy Optimization for Video Reasoning
【速读】:该论文试图解决复杂视频推理任务中感知能力(perception ability)不足的问题,即当前模型性能提升更多依赖于细粒度感知而非高级推理能力。研究表明,在感知能力接近固定时,单纯提升推理模型规模(如从Qwen3-8B到OpenAI-o3)仅带来0.7%的性能改进,而感知模型规模从7B增至32B则可提升1.4%,说明增强感知比强化推理更为关键。为此,作者提出APPO(Attention-guided Perception Policy Optimization)算法,其核心在于利用token级密集奖励(token-level dense rewards)优化不同响应中聚焦于同一关键视频帧的“组内感知token”(intra-group perception tokens),从而在无需昂贵细粒度标注的前提下,通过推理机制有效提升模型的感知能力。
链接: https://arxiv.org/abs/2602.23823
作者: Henghui Du,Chang Zhou,Xi Chen,Di Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Complex video reasoning, actually, relies excessively on fine-grained perception rather than on expert (e.g., Ph.D, Science)-level reasoning. Through extensive empirical observation, we have recognized the critical impact of perception. In particular, when perception ability is almost fixed, enhancing reasoning from Qwen3-8B to OpenAI-o3 yields only 0.7% performance improvement. Conversely, even minimal change in perception model scale (from 7B to 32B) boosts performance by 1.4%, indicating enhancing perception, rather than reasoning, is more critical to improve performance. Therefore, exploring how to enhance perception ability through reasoning without the need for expensive fine-grained annotation information is worthwhile. To achieve this goal, we specially propose APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model’s fine-grained perception. The core idea behind APPO is to optimize those tokens from different responses that primarily focus on the same crucial video frame (called intra-group perception tokens). Experimental results on diverse video benchmarks and models with different scales (3/7B) demonstrate APPO consistently outperforms GRPO and DAPO (0.5%~4%). We hope our work provides a promising approach to effectively enhance model’s perception abilities through reasoning in a low-cost manner, serving diverse scenarios and demands.
[CV-56] Denoising-Enhanced YOLO for Robust SAR Ship Detection
【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)图像中复杂场景下舰船检测的挑战,包括杂波和斑点噪声导致的误报以及小目标易被遗漏的问题。其解决方案的关键在于提出CPN-YOLO框架,通过三项针对性改进实现高精度检测:一是引入可学习的大核去噪模块用于输入预处理,提升特征的判别性;二是设计基于PPA注意力机制的特征提取增强策略,强化多尺度建模能力并提高对小舰船的敏感性;三是采用基于归一化Wasserstein距离(Normalized Wasserstein Distance, NWD)的高斯相似性损失函数,优化边界框分布下的相似度度量,从而增强模型泛化性能。
链接: https://arxiv.org/abs/2602.23820
作者: Xiaojing Zhao,Shiyang Li,Zena Chu,Ying Zhang,Peinan Hao,Tianzi Yan,Jiajia Chen,Huicong Ning
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid advancement of deep learning, synthetic aperture radar (SAR) imagery has become a key modality for ship detection. However, robust performance remains challenging in complex scenes, where clutter and speckle noise can induce false alarms and small targets are easily missed. To address these issues, we propose CPN-YOLO, a high-precision ship detection framework built upon YOLOv8 with three targeted improvements. First, we introduce a learnable large-kernel denoising module for input pre-processing, producing cleaner representations and more discriminative features across diverse ship types. Second, we design a feature extraction enhancement strategy based on the PPA attention mechanism to strengthen multi-scale modeling and improve sensitivity to small ships. Third, we incorporate a Gaussian similarity loss derived from the normalized Wasserstein distance (NWD) to better measure similarity under complex bounding-box distributions and improve generalization. Extensive experiments on HRSID and SSDD demonstrate the effectiveness of our method. On SSDD, CPN-YOLO surpasses the YOLOv8 baseline, achieving 97.0% precision, 95.1% recall, and 98.9% mAP, and consistently outperforms other representative deep-learning detectors in overall performance.
[CV-57] Footprint-Guided Exemplar-Free Continual Histopathology Report Generation
【速读】:该论文旨在解决病理图像到报告生成任务中因临床场景持续演化(如新器官、机构及报告规范的出现)而导致的连续学习(continual learning)问题,特别是避免在模型更新过程中发生灾难性遗忘(catastrophic forgetting)。其解决方案的关键在于提出一种无需存储原始切片或补丁示例(exemplar-free)的持续学习框架:通过在冻结的补丁嵌入空间中构建紧凑的领域特征足迹(domain footprint),包括代表性的形态学标记代码本(morphology tokens)、滑块级别的共现统计摘要和轻量级补丁计数先验;该足迹支持生成式回放(generative replay),通过合成反映特定领域形态混合的伪全切片图像表示,并结合教师快照提供的伪报告进行监督训练,从而在不保留历史数据的前提下实现知识迁移。此外,为应对报告风格变化,该方法还将领域特异性语言特征蒸馏为紧凑的风格描述符(style descriptor),并在推理阶段直接从图像信号中识别最匹配的描述符,实现无需显式领域标识的域无关部署。
链接: https://arxiv.org/abs/2602.23817
作者: Pratibha Kumari,Daniel Reisenbüchler,Afshin Bozorgpour,yousef Sadegheih,Priyankar Choudhary,Dorit Merhof
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rapid progress in vision-language modeling has enabled pathology report generation from gigapixel whole-slide images, but most approaches assume static training with simultaneous access to all data. In clinical deployment, however, new organs, institutions, and reporting conventions emerge over time, and sequential fine-tuning can cause catastrophic forgetting. We introduce an exemplar-free continual learning framework for WSI-to-report generation that avoids storing raw slides or patch exemplars. The core idea is a compact domain footprint built in a frozen patch-embedding space: a small codebook of representative morphology tokens together with slide-level co-occurrence summaries and lightweight patch-count priors. These footprints support generative replay by synthesizing pseudo-WSI representations that reflect domain-specific morphological mixtures, while a teacher snapshot provides pseudo-reports to supervise the updated model without retaining past data. To address shifting reporting conventions, we distill domain-specific linguistic characteristics into a compact style descriptor and use it to steer generation. At inference, the model identifies the most compatible descriptor directly from the slide signal, enabling domain-agnostic setup without requiring explicit domain identifiers. Evaluated across multiple public continual learning benchmarks, our approach outperforms exemplar-free and limited-buffer rehearsal baselines, highlighting footprint-based generative replay as a practical solution for deployment in evolving clinical settings.
[CV-58] Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation CVPR2026
【速读】:该论文旨在解决双臂操作(bimanual manipulation)中因现有方法依赖二维特征导致空间感知能力有限,或需依赖难以在真实环境中可靠获取的显式点云数据而带来的挑战。其核心解决方案是基于预训练的3D几何基础模型构建端到端的策略框架:通过融合几何感知潜在表示、2D语义特征与本体感觉(proprioception),形成统一的状态表征,并利用扩散模型联合预测未来动作片段和对应的3D潜在变量——该变量可解码为稠密点云图。此设计使策略仅凭RGB图像即可显式预测场景演化与动作序列的协同关系,从而显著提升空间理解能力和预测精度,实现在仿真与真实机器人环境中的性能优势。
链接: https://arxiv.org/abs/2602.23814
作者: Chongyang Xu,Haipeng Li,Shen Cheng,Jingyu Hu,Haoqiang Fan,Ziliang Feng,Shuaicheng Liu
机构: Sichuan University (四川大学); University of Electronic Science and Technology of China (电子科技大学); Dexmal; The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in real-world settings. At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner. We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap. By explicitly predicting how the 3D scene will evolve together with the action sequence, the policy gains strong spatial understanding and predictive capability using only RGB observations. We evaluate our method both in simulation on the RoboTwin benchmark and in real-world robot executions. Our approach consistently outperforms 2D-based and point-cloud-based baselines, achieving state-of-the-art performance in manipulation success, inter-arm coordination, and 3D spatial prediction accuracy. Code is available at this https URL.
[CV-59] See Act Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent
【速读】:该论文旨在解决预训练感知模型(pre-trained perception models)在新环境(如室内场景)中性能显著下降的问题,传统方法通过微调(fine-tuning)虽可提升下游任务表现,但易引发灾难性遗忘(catastrophic forgetting)且需昂贵的场景特定标注。其解决方案的关键在于提出一种范式转变——Sea²(See, Act, Adapt),即不直接调整感知模块本身,而是通过一个智能位姿控制代理(pose-control agent)来优化感知模块的部署方式。该方法保持所有感知模块冻结,无需下游标签训练,并仅使用感知模块输出的标量反馈信号驱动代理朝向信息丰富的视角移动;具体实现上,通过两阶段训练将视觉语言模型(VLM)转化为低级位姿控制器:第一阶段在规则探索轨迹上微调,第二阶段利用无监督强化学习从感知模块输出和置信度构建奖励函数进行策略精调。此方案无需重新训练感知模型即可适配多种任务,显著提升了视觉定位、分割与3D框估计等任务性能。
链接: https://arxiv.org/abs/2602.23806
作者: Tianci Tang,Tielong Cai,Hongwei Wang,Gaoang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Pre-trained perception models excel in generic image domains but degrade significantly in novel environments like indoor scenes. The conventional remedy is fine-tuning on downstream data which incurs catastrophic forgetting of prior knowledge and demands costly, scene-specific annotations. We propose a paradigm shift through Sea ^2 (See, Act, Adapt): rather than adapting the perception modules themselves, we adapt how they are deployed through an intelligent pose-control agent. Sea ^2 keeps all perception modules frozen, requiring no downstream labels during training, and uses only scalar perceptual feedback to navigate the agent toward informative viewpoints. Specially, we transform a vision-language model (VLM) into a low-level pose controller through a two-stage training pipeline: first fine-tuning it on rule-based exploration trajectories that systematically probe indoor scenes, and then refining the policy via unsupervised reinforcement learning that constructs rewards from the perception module’s outputs and confidence. Unlike prior active perception methods that couple exploration with specific models or collect data for retraining them, Sea ^2 directly leverages off-the-shelf perception models for various tasks without the need for retraining. We conducted experiments on three visual perception tasks, including visual grounding, segmentation and 3D box estimation, with performance improvements of 13.54%, 15.92% and 27.68% respectively on dataset ReplicaCAD.
[CV-60] EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models CVPR2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在情感推理任务中难以捕捉人类情绪的复杂性和主观性的问题,以及现有基于监督微调方法泛化能力有限、可解释性差,而强化学习方法如Group Relative Policy Optimization又无法契合情感认知内在特征的局限。解决方案的关键在于提出Reflective Reinforcement Learning for Emotional Reasoning (EMO-R3)框架,其核心创新包括:引入结构化情感思维(Structured Emotional Thinking)以引导模型进行分步、可解释的情感推理;设计反思式情感奖励(Reflective Emotional Reward)机制,使模型能够基于视觉-文本一致性与情感连贯性对自身推理过程进行再评估,从而显著提升MLLMs的情感智能与可解释性。
链接: https://arxiv.org/abs/2602.23802
作者: Yiyang Fang,Wenke Huang,Pei Fu,Yihao Yang,Kehua Su,Zhenbo Luo,Jian Luan,Mang Ye
机构: Wuhan University (武汉大学); Xiaomi Inc. (小米公司)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual reasoning and understanding tasks but still struggle to capture the complexity and subjectivity of human emotions. Existing approaches based on supervised fine-tuning often suffer from limited generalization and poor interpretability, while reinforcement learning methods such as Group Relative Policy Optimization fail to align with the intrinsic characteristics of emotional cognition. To address these challenges, we propose Reflective Reinforcement Learning for Emotional Reasoning (EMO-R3), a framework designed to enhance the emotional reasoning ability of MLLMs. Specifically, we introduce Structured Emotional Thinking to guide the model to perform step-by-step emotional reasoning in a structured and interpretable manner, and design a Reflective Emotional Reward that enables the model to re-evaluate its reasoning based on visual-text consistency and emotional coherence. Extensive experiments demonstrate that EMO-R3 significantly improves both the interpretability and emotional intelligence of MLLMs, achieving superior performance across multiple visual emotional understanding benchmarks.
[CV-61] Fourier Angle Alignment for Oriented Object Detection in Remote Sensing CVPR2026
【速读】:该论文旨在解决遥感图像中旋转目标检测的两个核心瓶颈问题:检测器颈部(detector neck)的方向不一致性以及检测头(detecting head)的任务冲突。解决方案的关键在于引入傅里叶角度对齐(Fourier Angle Alignment, FAA),利用傅里叶旋转等变性(Fourier rotation equivariance)分析特征中的角度信息,并将主方向对齐至特定朝向。在此基础上,提出两个即插即用模块:FAAFusion用于在检测器颈部对齐高层特征与低层特征的主方向并进行融合;FAA Head作为新型检测头,在分类和回归前对感兴趣区域(RoI)特征进行预对齐至标准角度并融合原始特征。实验表明,该方法在DOTA-v1.0和HRSC2016等数据集上显著提升性能,达到新的最先进水平。
链接: https://arxiv.org/abs/2602.23790
作者: Changyu Gu,Linwei Chen,Lin Gu,Ying Fu
机构: Beijing Institute of Technology (北京理工大学); University of Hong Kong (香港大学); Tohoku University (东北大学); The Hong Kong University of Science and Technology (香港科技大学); Hong Kong Generative AI Research and Development Center (香港生成式人工智能研究与发展中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:In remote sensing rotated object detection, mainstream methods suffer from two bottlenecks, directional incoherence at detector neck and task conflict at detecting head. Ulitising fourier rotation equivariance, we introduce Fourier Angle Alignment, which analyses angle information through frequency spectrum and aligns the main direction to a certain orientation. Then we propose two plug and play modules : FAAFusion and FAA Head. FAAFusion works at the detector neck, aligning the main direction of higher-level features to the lower-level features and then fusing them. FAA Head serves as a new detection head, which pre-aligns RoI features to a canonical angle and adds them to the original features before classification and regression. Experiments on DOTA-v1.0, DOTA-v1.5 and HRSC2016 show that our method can greatly improve previous work. Particularly, our method achieves new state-of-the-art results of 78.72% mAP on DOTA-v1.0 and 72.28% mAP on DOTA-v1.5 datasets with single scale training and testing, validating the efficacy of our approach in remote sensing object detection. The code is made publicly available at this https URL .
[CV-62] Diffusion Probe: Generated Image Result Prediction Using CNN Probes
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在多生成场景下缺乏高效早期质量评估机制的问题,从而避免因盲目生成导致的计算资源浪费。其核心解决方案是提出 Diffusion Probe 框架,关键在于发现并利用扩散过程早期交叉注意力(cross-attention)分布与最终图像质量之间的强相关性,并设计一个轻量级预测器,将初始去噪步骤中提取的交叉注意力统计特征映射为对最终图像质量的准确预测,实现无需完整合成即可提前判断生成结果优劣,显著提升生成效率与输出质量。
链接: https://arxiv.org/abs/2602.23783
作者: Benlei Cui,Bukun Huang,Zhizeng Ye,Xuemei Dong,Tuo Chen,Hui Xue,Dingkang Yang,Longtao Huang,Jingqun Tang,Haiwen Hong
机构: Alibaba Group (阿里巴巴集团); Laboratory for Statistical Monitoring and Intelligent Governance of Common Prosperity, School of Statistics and Data Science, Zhejiang Gongshang University (浙江工商大学统计与数据科学学院统计监测与共同富裕智能治理实验室); Southeast University (东南大学); College of Intelligent Robotics and Advanced Manufacturing, Fudan University (复旦大学智能机器人与先进制造学院); ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image (T2I) diffusion models lack an efficient mechanism for early quality assessment, leading to costly trial-and-error in multi-generation scenarios such as prompt iteration, agent-based generation, and flow-grpo. We reveal a strong correlation between early diffusion cross-attention distributions and final image quality. Based on this finding, we introduce Diffusion Probe, a framework that leverages internal cross-attention maps as predictive signals. We design a lightweight predictor that maps statistical properties of early-stage cross-attention extracted from initial denoising steps to the final image’s overall quality. This enables accurate forecasting of image quality across diverse evaluation metrics long before full synthesis is complete. We validate Diffusion Probe across a wide range of settings. On multiple T2I models, across early denoising windows, resolutions, and quality metrics, it achieves strong correlation (PCC 0.7) and high classification performance (AUC-ROC 0.9). Its reliability translates into practical gains. By enabling early quality-aware decisions in workflows such as prompt optimization, seed selection, and accelerated RL training, the probe supports more targeted sampling and avoids computation on low-potential generations. This reduces computational overhead while improving final output quality. Diffusion Probe is model-agnostic, efficient, and broadly applicable, offering a practical solution for improving T2I generation efficiency through early quality prediction. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.23783 [cs.CV] (or arXiv:2602.23783v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.23783 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Bukun Huang [view email] [v1] Fri, 27 Feb 2026 08:24:47 UTC (15,119 KB) Full-text links: Access Paper: View a PDF of the paper titled Diffusion Probe: Generated Image Result Prediction Using CNN Probes, by Benlei Cui and 9 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-02 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-63] OPTIAGENT : A Physics-Driven Agent ic Framework for Automated Optical Design
【速读】:该论文旨在解决光学设计中高度非凸优化问题依赖人工经验与领域知识、且大型语言模型(LLM)在实际镜头系统设计中能力受限的问题。其关键解决方案在于构建了一个名为OptiDesignQA的综合性数据集,涵盖经典镜头系统和自动化算法生成的新颖配置,并通过全系统合成与镜片补全相结合的混合目标将领域专业知识注入LLM;同时引入基于光学词典奖励(Optical Lexicographic Reward)的Group Relative Policy Optimization Done Right(DrGRPO)策略,以结构格式、物理可行性、光调控精度及LLM启发式规则为多维奖励信号实现物理驱动的策略对齐;最终结合专用光学优化流程进行端到端微调与精调,显著提升了无光学训练背景用户设计功能性镜头系统的成功率。
链接: https://arxiv.org/abs/2602.23761
作者: Yuyu Geng,Lei Sun,Yao Gao,Xinxin Hu,Zhonghua Yi,Xiaolong Qian,Weijian Hu,Jian Bai,Kaiwei Wang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Optical design is the process of configuring optical elements to precisely manipulate light for high-fidelity imaging. It is inherently a highly non-convex optimization problem that relies heavily on human heuristic expertise and domain-specific knowledge. While Large Language Models (LLMs) possess extensive optical knowledge, their capabilities in leveraging the knowledge in designing lens system remain significantly constrained. This work represents the first attempt to employ LLMs in the field of optical design. We bridge the expertise gap by enabling users without formal optical training to successfully develop functional lens systems. Concretely, we curate a comprehensive dataset, named OptiDesignQA, which encompasses both classical lens systems sourced from standard optical textbooks and novel configurations generated by automated design algorithms for training and evaluation. Furthermore, we inject domain-specific optical expertise into the LLM through a hybrid objective of full-system synthesis and lens completion. To align the model with optical principles, we employ Group Relative Policy Optimization Done Right (DrGRPO) guided by Optical Lexicographic Reward for physics-driven policy alignment. This reward system incorporates structural format rewards, physical feasibility rewards, light-manipulation accuracy, and LLM-based heuristics. Finally, our model integrates with specialized optical optimization routines for end-to-end fine-tuning and precision refinement. We benchmark our proposed method against both traditional optimization-based automated design algorithms and LLM counterparts, and experimental results show the superiority of our method.
[CV-64] Learning Accurate Segmentation Purely from Self-Supervision
【速读】:该论文旨在解决无监督图像中目标分割问题,即在无需任何人工标注的情况下实现精确的前景-背景分离。其核心挑战在于如何从原始图像中自动学习具有判别性的对象表示并生成高质量的分割掩码。解决方案的关键在于提出一个完全自监督的框架Selfment,其创新性地结合了两个核心机制:首先,利用自监督特征构建patch级亲和图,并通过NCut算法获得初始粗粒度的前景-背景划分;其次,引入迭代patch优化(Iterative Patch Optimization, IPO)策略,在特征空间中通过逐轮patch聚类逐步增强空间一致性和语义一致性,从而生成更精细的掩码作为监督信号,训练轻量级分割头以学习稳定且可迁移的对象表征。这一设计使得模型在无需人工标注或预训练分割模型的前提下,实现了对多种基准数据集的显著性能提升,并展现出强大的零样本泛化能力。
链接: https://arxiv.org/abs/2602.23759
作者: Zuyao You,Zuxuan Wu,Yu-Gang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately segmenting objects without any manual annotations remains one of the core challenges in computer vision. In this work, we introduce Selfment, a fully self-supervised framework that segments foreground objects directly from raw images without human labels, pretrained segmentation models, or any post-processing. Selfment first constructs patch-level affinity graphs from self-supervised features and applies NCut to obtain an initial coarse foreground–background separation. We then introduce Iterative Patch Optimization (IPO), a feature-space refinement procedure that progressively enforces spatial coherence and semantic consistency through iterative patch clustering. The refined masks are subsequently used as supervisory signals to train a lightweight segmentation head with contrastive and region-consistency objectives, allowing the model to learn stable and transferable object representations. Despite its simplicity and complete absence of manual supervision, Selfment sets new state-of-the-art (SoTA) results across multiple benchmarks. It achieves substantial improvements on F_\max over previous unsupervised saliency detection methods on ECSSD ( +4.0% ), HKUIS ( +4.6% ), and PASCAL-S ( +5.7% ). Moreover, without any additional fine-tuning, Selfment demonstrates remarkable zero-shot generalization to camouflaged object detection tasks (e.g., 0.910 S_m on CHAMELEON and 0.792 F_\beta^\omega on CAMO), outperforming all existing unsupervised approaches and even rivaling the SoTA fully supervised methods.
[CV-65] Neural Image Space Tessellation
【速读】:该论文旨在解决高复杂度几何 tessellation(镶嵌)在实时渲染中计算开销大、难以扩展的问题,尤其是当场景规模较大时,传统基于几何的镶嵌方法会显著增加渲染负担。解决方案的关键在于提出 Neural Image-Space Tessellation (NIST),首次将镶嵌操作从预渲染的几何处理阶段迁移至屏幕空间的神经后处理阶段;其核心机制是利用几何法向量与着色法向量之间的差异作为视图相关的最小线索,通过多尺度卷积算子逐步变形图像空间轮廓,并结合隐式形变机制同步重分配外观信息,从而在保持纹理一致性和视觉保真度的同时,实现与几何复杂度无关的恒定帧成本,适用于大规模实时渲染场景。
链接: https://arxiv.org/abs/2602.23754
作者: Youyang Du(1 and 2),Junqiu Zhu(1),Zheng Zeng(3),Lu Wang(1),Lingqi Yan(2) ((1) Shandong University, (2) Mohamed bin Zayed University of Artificial Intelligence, (3) University of California, Santa Barbara)
机构: Shandong University (山东大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of California, Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Neural Image-Space Tessellation (NIST), a lightweight screen-space post-processing approach that produces the visual effect of tessellated geometry while rendering only the original low-polygon meshes. Inspired by our observation from Phong tessellation, NIST leverages the discrepancy between geometric normals and shading normals as a minimal, view-dependent cue for silhouette refinement. At its core, NIST performs multi-scale neural tessellation by progressively deforming image-space contours with convolutional operators, while jointly reassigning appearance information through an implicit warping mechanism to preserve texture coherence and visual fidelity. Experiments demonstrate that our approach produces smooth, visually coherent silhouettes comparable to geometric tessellation, while incurring a constant per-frame cost and fully decoupled from geometric complexity, making it well-suited for large-scale real-time rendering scenarios. To the best of our knowledge, our NIST is the first work to reformulate tessellation as a post-processing operation, shifting it from a pre-rendering geometry pipeline to a screen space neural post-processing stage.
[CV-66] U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation CVPR2026
【速读】:该论文旨在解决当前智能体在实时全栈多模态交互中存在的两大核心问题:一是现有系统多局限于单模态生成,难以实现语言、语音、动作与视频等多模态的协同生成;二是跨模态对齐能力差且推理性能下降,导致交互不连贯且缺乏感知基础。解决方案的关键在于提出U-Mind系统,其核心是一个统一对齐与推理框架(Unified Alignment and Reasoning Framework),通过分段对齐策略提升跨模态同步性,并借助回放驱动学习(Rehearsal-Driven Learning)保持推理能力。此外,U-Mind采用文本优先解码流程,在内部进行链式思维规划后实现多模态时序同步生成,并结合基于姿态和语音条件的实时视频渲染机制,从而实现高保真、感知一致的交互反馈。
链接: https://arxiv.org/abs/2602.23739
作者: Xiang Deng,Feng Gao,Yong Zhang,Youxin Pang,Xu Xiaoming,Zhuoliang Kang,Xiaoming Wei,Yebin Liu
机构: Tsinghua University (清华大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026
Abstract:Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop. At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses two key challenges: enhancing cross-modal synchronization via a segment-wise alignment strategy, and preserving reasoning abilities through Rehearsal-Driven Learning. During inference, U-Mind adopts a text-first decoding pipeline that performs internal chain-of-thought planning followed by temporally synchronized generation across modalities. To close the loop, we implement a real-time video rendering framework conditioned on pose and speech, enabling expressive and synchronized visual feedback. Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation, paving the way toward intelligent, immersive conversational agents.
[CV-67] A Difference-in-Difference Approach to Detecting AI-Generated Images
【速读】:该论文旨在解决生成式 AI(Generative AI)所生成图像与真实图像日益趋近导致现有检测方法失效的问题。当前多数检测器依赖重建误差(reconstruction error)作为判别依据,但随着生成图像质量提升,此类一阶差异特征变得不敏感。解决方案的关键在于提出一种新的“差中差”(difference-in-difference)方法:不再直接使用重建误差(一阶差异),而是计算重建误差的变化量(即二阶差异),通过降低方差来提升检测准确率,从而在生成式 AI 高质量图像泛滥的背景下实现更可靠的伪造图像识别。
链接: https://arxiv.org/abs/2602.23732
作者: Xinyi Qi,Kai Ye,Chengchun Shi,Ying Yang,Hongyi Zhou,Jin Zhu
机构: Tsinghua University (清华大学); The London School of Economics and Political Science (伦敦政治经济学院); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models are able to produce AI-generated images that are almost indistinguishable from real ones. This raises concerns about their potential misuse and poses substantial challenges for detecting them. Many existing detectors rely on reconstruction error – the difference between the input image and its reconstructed version – as the basis for distinguishing real from fake images. However, these detectors become less effective as modern AI-generated images become increasingly similar to real ones. To address this challenge, we propose a novel difference-in-difference method. Instead of directly using the reconstruction error (a first-order difference), we compute the difference in reconstruction error – a second-order difference – for variance reduction and improving detection accuracy. Extensive experiments demonstrate that our method achieves strong generalization performance, enabling reliable detection of AI-generated images in the era of generative AI.
[CV-68] StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation
【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中因依赖二维图像到动作的直接映射而难以建模三维空间结构和时序世界动态的问题,从而限制了其在复杂动态环境中的空间推理能力和长程决策性能。解决方案的关键在于提出StemVLA框架,该框架通过两个核心机制实现:一是预测结构化的未来三维空间几何知识,使模型能够提前感知场景结构与物体配置;二是利用预训练的视频几何Transformer骨干网络提取历史图像帧中的隐式三维表示,并通过时间注意力模块(VideoFormer)聚合形成统一的四维时空表征(4D spatiotemporal representation),从而联合建模二维观测、预测的三维未来结构及聚合的四维时序动态,显著增强机器人对环境的综合理解能力。
链接: https://arxiv.org/abs/2602.23721
作者: Jiasong Xiao,Yutao She,Kai Li,Yuyang Sha,Ziang Cheng,Ziang Tong
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Vision-language-action (VLA) models integrate visual observations and language instructions to predict robot actions, demonstrating promising generalization in manipulation tasks. However, most existing approaches primarily rely on direct mappings from 2D visual inputs to action sequences, without explicitly modeling the underlying 3D spatial structure or temporal world dynamics. Such representations may limit spatial reasoning and long-horizon decision-making in dynamic environments. To address this limitation, we propose StemVLA, a novel framework that explicitly incorporates both future-oriented 3D spatial knowledge and historical 4D spatiotemporal representations into action prediction. First, instead of relying solely on observed images, StemVLA forecasts structured 3D future spatial-geometric world knowledge, enabling the model to anticipate upcoming scene geometry and object configurations. Second, to capture temporal consistency and motion dynamics, we feed historical image frames into a pretrained video-geometry transformer backbone to extract implicit 3D world representations, and further aggregate them across time using a temporal attention module, termed VideoFormer [20], forming a unified 4D historical spatiotemporal representation. By jointly modeling 2D observations, predicted 3D future structure, and aggregated 4D temporal dynamics, StemVLA enables more comprehensive world understanding for robot manipulation. Extensive experiments in simulation demonstrate that StemVLA significantly improves long-horizon task success and achieves state-of-the-art performance on the CALVIN ABC-D benchmark [46], achieving an average sequence length of XXX.
[CV-69] Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?
【速读】:该论文旨在解决统一多模态大语言模型(Unified Multimodal Large Language Models, U-MLLMs)在跨模态语义一致性方面的缺陷问题,即现有模型虽能在文本模态中表现出较强的推理能力,但在生成图像模态答案时无法保持与文本模态一致的语义表达。解决方案的关键在于提出VGUBench框架,通过三个诊断任务——文本生成理解(Textual Generative Understanding)、视觉生成理解(Visual Generative Understanding)和视觉渲染控制任务(Visual Rendering control task),将推理逻辑与生成保真度解耦,从而精准识别性能瓶颈。实证结果表明,U-MLLMs的失败根源并非生成质量不足,而是跨模态语义对齐机制的失效,为未来统一生成与理解模型的设计提供了关键诊断依据。
链接: https://arxiv.org/abs/2602.23711
作者: Hongbo Jiang,Jie Li,Yunhang Shen,Pingyang Dai,Xing Sun,Haoyu Cao,Liujuan Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Equal contribution by Jie Li
Abstract:Unified Multimodal Large Language Models (U-MLLMs) integrate understanding and generation within a single architecture. However, existing evaluations typically assess these capabilities separately, overlooking semantic equivalence, i.e., the ability to manifest consistent reasoning results regardless of the output modality. In this work, we investigate whether current U-MLLMs satisfy this premise. We observe that while models demonstrate robust textual reasoning, they fail to maintain semantic equivalence when required to render the same results in the image modality. To rigorously diagnose this discrepancy, we introduce VGUBench, a framework to decouple reasoning logic from generation fidelity. VGUBench comprises three diagnostic tasks: (1)Textual Generative Understanding, establishing a baseline for reasoning accuracy in textual response; (2)Visual Generative Understanding, evaluating the ability to generate visual responses that represent the correct answer; and (3)a Visual Rendering control task, which assesses the ability to directly render explicit visual descriptions into images without complex reasoning. Our evaluation reveals a significant disparity: despite strong performance in textual understanding and visual rendering, U-MLLMs exhibit a marked performance collapse when required to generate visual answers to questions. Furthermore, we find a negligible correlation between visual answering performance and basic rendering quality. These results suggest that the failure stems not from insufficient generation fidelity, but from a breakdown in cross-modal semantic alignment. We provide diagnostic insights to address this challenge in future Unified Generation and Understanding Models.
[CV-70] EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding
【速读】:该论文旨在解决超长第一人称视频(ultra-long egocentric videos)在长时间跨度下视频理解的难题,现有方法受限于片段化局部处理和有限的时间建模能力,难以进行跨日级的复杂时序推理。其解决方案的关键在于提出EgoGraph框架,该框架无需训练即可动态构建知识图谱,通过新颖的第一人称schema统一提取与抽象人物、物体、地点和事件等核心实体,并结构化地建模其属性与交互关系;同时设计了时间关联建模策略,能够捕捉跨实体的时间依赖性并积累多日稳定的长期记忆,从而实现对超长视频序列的深度语义理解和复杂时序推理。
链接: https://arxiv.org/abs/2602.23709
作者: Shitong Sun,Ke Han,Yukai Huang,Weitong Cai,Jifei Song
机构: Queen Mary University of London (伦敦玛丽女王大学); University of Trento (特伦托大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:Ultra-long egocentric videos spanning multiple days present significant challenges for video understanding. Existing approaches still rely on fragmented local processing and limited temporal modeling, restricting their ability to reason over such extended sequences. To address these limitations, we introduce EgoGraph, a training-free and dynamic knowledge-graph construction framework that explicitly encodes long-term, cross-entity dependencies in egocentric video streams. EgoGraph employs a novel egocentric schema that unifies the extraction and abstraction of core entities, such as people, objects, locations, and events, and structurally reasons about their attributes and interactions, yielding a significantly richer and more coherent semantic representation than traditional clip-based video models. Crucially, we develop a temporal relational modeling strategy that captures temporal dependencies across entities and accumulates stable long-term memory over multiple days, enabling complex temporal reasoning. Extensive experiments on the EgoLifeQA and EgoR1-bench benchmarks demonstrate that EgoGraph achieves state-of-the-art performance on long-term video question answering, validating its effectiveness as a new paradigm for ultra-long egocentric video understanding.
[CV-71] A Reliable Indoor Navigation System for Humans Using AR-based Technique
【速读】:该论文旨在解决室内导航系统在校园及小型区域中缺乏可靠性和易用性的问题,传统依赖静态标识或平面地图的方式存在误导性强、效率低等缺陷。解决方案的关键在于将增强现实(AR)技术与路径规划算法相结合:利用Vuforia Area Target实现环境建模,通过AI导航中的NavMesh组件进行路径规划,并采用A*算法替代Dijkstra算法以提升计算效率——在小规模搜索空间中比Dijkstra快2–3倍,同时支持实时动态路径更新,从而显著提高导航精度和用户体验,验证了AR与现有路径规划算法融合的可行性与可扩展性。
链接: https://arxiv.org/abs/2602.23706
作者: Vijay U.Rathod,Manav S.Sharma,Shambhavi Verma,Aadi Joshi,Sachin Aage,Sujal Shahane
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 6 figures, 2 tables, Presented at 7th International Conference on Advances in Science and Technology (ICAST 2024-25)
Abstract:Reliable navigation systems are not available indoors, such as in campuses and small areas. Users must depend on confusing, time-consuming static signage or floor maps. In this paper, an AR-based technique has been applied to campus and small-site navigation, where Vuforia Area Target is used for environment modeling. AI navigation’s NavMesh component is used for navigation purposes, and the A* algorithm is used within this component for shortest path calculation. Compared to Dijkstra’s algorithm, it can reach a solution about two to three times faster for smaller search spaces. In many cases, Dijkstra’s algorithm has difficulty performing well in high-complexity environments where memory usage grows and processing times increase. Compared to older approaches such as GPS, real-time processing and AR overlays can be combined to provide intuitive directions for users while dynamically updating the path in response to environmental changes. Experimental results indicate significantly improved navigation accuracy, better user experience, and greater efficiency compared to traditional methods. These results show that AR technology integrated with existing pathfinding algorithms is feasible and scalable, making it a user-friendly solution for indoor navigation. Although highly effective in limited and defined indoor spaces, further optimization of NavMesh is required for large or highly dynamic environments.
[CV-72] owards Source-Aware Object Swapping with Initial Noise Perturbation CVPR2026
【速读】:该论文旨在解决对象交换(object swapping)任务中面临的三大挑战:保持目标对象的保真度(object fidelity)、场景保真度(scene fidelity)以及对象与场景之间的和谐性(object-scene harmony)。现有方法通常依赖于每个对象的微调(per-object fine-tuning)或额外的成对数据,而这些数据往往仅在不同场景中呈现相同对象,导致模型过度依赖背景线索而非学习跨对象对齐能力。解决方案的关键在于提出一种自监督且源感知(source-aware)的框架 SourceSwap,其核心创新是通过在初始噪声空间中进行频率分离扰动(frequency-separated perturbation)来合成高质量伪配对样本,从而在不依赖视频、多视角图像或额外图像的情况下,保留姿态、粗略形状和场景布局的同时改变外观;此外,该方法采用带全源条件控制的双 U-Net 结构和无噪声参考编码器,实现直接的对象间对齐、零样本推理(zero-shot inference)及轻量级迭代优化,显著提升了对象交换的质量与泛化能力。
链接: https://arxiv.org/abs/2602.23697
作者: Jiahui Zhan,Xianbing Sun,Xiangnan Zhu,Yikun Ji,Ruitong Liu,Liqing Zhang,Jianfu Zhang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by CVPR 2026 Findings
Abstract:Object swapping aims to replace a source object in a scene with a reference object while preserving object fidelity, scene fidelity, and object-scene harmony. Existing methods either require per-object finetuning and slow inference or rely on extra paired data that mostly depict the same object across contexts, forcing models to rely on background cues rather than learning cross-object alignment. We propose SourceSwap, a self-supervised and source-aware framework that learns cross-object alignment. Our key insight is to synthesize high-quality pseudo pairs from any image via a frequency-separated perturbation in the initial-noise space, which alters appearance while preserving pose, coarse shape, and scene layout, requiring no videos, multi-view data, or additional images. We then train a dual U-Net with full-source conditioning and a noise-free reference encoder, enabling direct inter-object alignment, zero-shot inference without per-object finetuning, and lightweight iterative refinement. We further introduce SourceBench, a high-quality benchmark with higher resolution, more categories, and richer interactions. Experiments demonstrate that SourceSwap achieves superior fidelity, stronger scene preservation, and more natural harmony, and it transfers well to edits such as subject-driven refinement and face swapping.
[CV-73] Any Model Any Place Any Time: Get Remote Sensing Foundation Model Embeddings On Demand
【速读】:该论文旨在解决遥感领域基础模型(foundation models)在实际应用中因模型发布格式、平台接口和输入数据规范不一致而导致的嵌入表示(embeddings)获取、使用与基准测试成本高昂的问题。解决方案的关键在于提出 rs-embed,一个统一的 Python 库,采用以感兴趣区域(region of interest, ROI)为中心的接口设计,用户仅需一行代码即可从任意支持的模型中获取指定地理位置和时间范围内的嵌入表示,并支持高效批量处理以实现大规模嵌入生成与评估。
链接: https://arxiv.org/abs/2602.23678
作者: Dingqi Ye,Daniel Kiv,Wei Hu,Jimeng Shi,Shaowen Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The remote sensing community is witnessing a rapid growth of foundation models, which provide powerful embeddings for a wide range of downstream tasks. However, practical adoption and fair comparison remain challenging due to substantial heterogeneity in model release formats, platforms and interfaces, and input data specifications. These inconsistencies significantly increase the cost of obtaining, using, and benchmarking embeddings across models. To address this issue, we propose rs-embed, a Python library that offers a unified, region of interst (ROI) centric interface: with a single line of code, users can retrieve embeddings from any supported model for any location and any time range. The library also provides efficient batch processing to enable large-scale embedding generation and evaluation. The code is available at: this https URL
[CV-74] Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation
【速读】:该论文旨在解决细粒度杂草-作物分割模型在异质农业环境中泛化能力不足的问题,其核心挑战在于现有深度学习模型依赖数据集特定的视觉特征,难以适应不同作物类型、杂草种类、生长阶段及传感条件下的实际应用场景。解决方案的关键在于提出视觉-语言杂草分割(Vision-Language Weed Segmentation, VL-WS)框架,通过将像素级分割任务与语义对齐的域不变表示相结合,利用冻结的对比语言-图像预训练(CLIP)嵌入与任务特定的空间特征融合,并借助特征逐通道线性调制(FiLM)层根据自然语言描述动态调节特征通道,从而实现文本引导的通道级特征优化同时保持精细的空间定位能力。该方法在包含近距离地面图像和高空无人机图像的统一语料库上训练,显著提升了跨域泛化性能与标签效率。
链接: https://arxiv.org/abs/2602.23677
作者: Nazia Hossain,Xintong Jiang,Yu Tian,Philippe Seguin,O. Grant Clark,Shangpeng Sun
机构: McGill University (麦吉尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fine-grained crop-weed segmentation is essential for enabling targeted herbicide application in precision agriculture. However, existing deep learning models struggle to generalize across heterogeneous agricultural environments due to reliance on dataset-specific visual features. We propose Vision-Language Weed Segmentation (VL-WS), a novel framework that addresses this limitation by grounding pixel-level segmentation in semantically aligned, domain-invariant representations. Our architecture employs a dual-encoder design, where frozen Contrastive Language-Image Pretraining (CLIP) embeddings and task-specific spatial features are fused and modulated via Feature-wise Linear Modulation (FiLM) layers conditioned on natural language captions. This design enables image level textual descriptions to guide channel-wise feature refinement while preserving fine-grained spatial localization. Unlike prior works restricted to training and evaluation on single-source datasets, VL-WS is trained on a unified corpus that includes close-range ground imagery (robotic platforms) and high-altitude UAV imagery, covering diverse crop types, weed species, growth stages, and sensing conditions. Experimental results across four benchmark datasets demonstrate the effectiveness of our framework, with VL-WS achieving a mean Dice score of 91.64% and outperforming the CNN baseline by 4.98%. The largest gains occur on the most challenging weed class, where VL-WS attains 80.45% Dice score compared to 65.03% for the best baseline, representing a 15.42% improvement. VL-WS further maintains stable weed segmentation performance under limited target-domain supervision, indicating improved generalization and data efficiency. These findings highlight the potential of vision-language alignment to enable scalable, label-efficient segmentation models deployable across diverse real-world agricultural domains.
[CV-75] Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering
【速读】:该论文旨在解决医学影像报告自动生成中因模型对历史影像的错误比较而引发的“先验对比幻觉”(prior-comparison hallucination)问题,即模型在生成当前影像报告时错误地引入了与当前图像无关的历史病灶描述。其解决方案的关键在于提出了一种无需训练、仅在推理阶段生效的控制框架——语义解耦潜在空间引导(Semantically Decoupled Latent Steering, SDLS),该方法通过大语言模型(LLM)驱动的语义分解结合QR正交化操作构建一个语义无关的干预向量,从而精准定位并抑制“历史对比”维度,避免传统主成分分析(PCA)方向中存在的临床语义混杂问题,确保干预仅作用于目标轴向,显著提升报告准确性与真实性。
链接: https://arxiv.org/abs/2602.23676
作者: Ao Li,Rui Liu,Mingjie Li,Sheng Liu,Lei Wang,Xiaodan Liang,Lina Yao,Xiaojun Chang,Lei Xing
机构: University of New South Wales (新南威尔士大学); Australian Artificial Intelligence Institute (澳大利亚人工智能研究院); University of Technology Sydney (悉尼科技大学); Stanford University (斯坦福大学); School of Computing and Information Technology of University of Wollongong Australia (澳大利亚伍伦贡大学计算机与信息技术学院); Sun Yat-sen University (中山大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5 figures
Abstract:Automated radiology report generation using vision-language models (VLMs) is limited by the risk of prior-comparison hallucination, where the model generates historical findings unsupported by the current study. We address this challenge with a training-free, inference-time control framework termed Semantically Decoupled Latent Steering (SDLS). Unlike generic activation steering, which often suffers from semantic entanglement, our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition followed by QR -based orthogonalization. This orthogonalization step is critical. It leverages geometric constraints to filter out the clinical semantics often entangled in standard principal component analysis (PCA) directions, ensuring that the steering vector targets only the ``historical comparison" axis. We validate our method on the BiomedGPT foundation model, demonstrating that it overcomes the trade-off between hallucination suppression and clinical accuracy. Extensive experiments on MIMIC-CXR, and zero-shot transfer evaluation on CheXpert Plus and IU-Xray, demonstrate the robustness of our approach. Quantitative evaluations on MIMIC-CXR show that our approach significantly reduces the probability of historical hallucinations (FilBERT score decreases from 0.2373 to 0.1889) and improves clinical label fidelity (CheXpert macro-F1 increases from 0.2242 to 0.3208). Supplementary evaluations confirm that the structural integrity of the clinical narrative is maintained.
[CV-76] ProtoDCS: Towards Robust and Efficient Open-Set Test-Time Adaptation for Vision-Language Models
【速读】:该论文旨在解决大规模视觉-语言模型(Vision-Language Models, VLMs)在真实场景部署中因分布偏移(distribution shifts)导致性能下降的问题,尤其聚焦于开放集测试时适应(Open-Set Test-Time Adaptation, OSTTA)场景下的挑战:即如何在测试流同时包含协变量偏移的已知类(covariate-shifted in-distribution, csID)和未知类(covariate-shifted out-of-distribution, csOOD)数据时,准确区分并安全适应csID样本,同时避免对csOOD样本产生错误分类。现有方法依赖硬阈值分离与熵最小化策略,存在误判模糊样本、过度自信预测及计算开销大的问题。其解决方案的关键在于提出原型驱动的双检分离机制(Prototype-based Double-Check Separation, ProtoDCS),通过概率性高斯混合模型(GMM)验证替代脆弱的阈值判断,并引入基于证据的自适应策略——利用不确定性感知损失函数和高效的原型级更新机制,在提升已知类准确率的同时显著增强OOD检测能力,且大幅降低计算复杂度。
链接: https://arxiv.org/abs/2602.23653
作者: Wei Luo,Yangfan Ou,Jin Deng,Zeshuai Deng,Xiquan Yan,Zhiquan Wen,Mingkui Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, under review
Abstract:Large-scale Vision-Language Models (VLMs) exhibit strong zero-shot recognition, yet their real-world deployment is challenged by distribution shifts. While Test-Time Adaptation (TTA) can mitigate this, existing VLM-based TTA methods operate under a closed-set assumption, failing in open-set scenarios where test streams contain both covariate-shifted in-distribution (csID) and out-of-distribution (csOOD) data. This leads to a critical difficulty: the model must discriminate unknown csOOD samples to avoid interference while simultaneously adapting to known csID classes for accuracy. Current open-set TTA (OSTTA) methods rely on hard thresholds for separation and entropy minimization for adaptation. These strategies are brittle, often misclassifying ambiguous csOOD samples and inducing overconfident predictions, and their parameter-update mechanism is computationally prohibitive for VLMs. To address these limitations, we propose Prototype-based Double-Check Separation (ProtoDCS), a robust framework for OSTTA that effectively separates csID and csOOD samples, enabling safe and efficient adaptation of VLMs to csID data. Our main contributions are: (1) a novel double-check separation mechanism employing probabilistic Gaussian Mixture Model (GMM) verification to replace brittle thresholding; and (2) an evidence-driven adaptation strategy utilizing uncertainty-aware loss and efficient prototype-level updates, mitigating overconfidence and reducing computational overhead. Extensive experiments on CIFAR-10/100-C and Tiny-ImageNet-C demonstrate that ProtoDCS achieves state-of-the-art performance, significantly boosting both known-class accuracy and OOD detection metrics. Code will be available at this https URL.
[CV-77] 3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection
【速读】:该论文旨在解决将视觉语言模型(Vision-Language Models, VLMs)应用于多器官三维磁共振成像(3D MRI)时面临的两大核心挑战:一是模态特异性视觉-语言对齐问题,二是跨模态特征融合问题。解决方案的关键在于提出MedMAP框架,该框架包含两个阶段:第一阶段为模态感知的视觉-语言对齐阶段,通过模态感知编码器隐式建模联合模态分布,增强视觉与文本表征之间的对齐;第二阶段为微调阶段,在保持文本编码器冻结的前提下,仅微调预训练的视觉编码器以适配下游多器官异常检测任务。该方法在自建数据集MedMoM-MRI3D上验证了其有效性,显著优于现有VLMs在3D MRI多器官异常检测中的性能表现。
链接: https://arxiv.org/abs/2602.23652
作者: Haowen Zhu,Ning Yin,Xiaogen Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) show strong potential for complex diagnostic tasks in medical imaging. However, applying VLMs to multi-organ medical imaging introduces two principal challenges: (1) modality-specific vision-language alignment and (2) cross-modal feature fusion. In this work, we propose MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language representation learning in 3D MRI. MedMAP comprises a modality-aware vision-language alignment stage and a fine-tuning stage for multi-organ abnormality detection. During the pre-training stage, the modality-aware encoders implicitly capture the joint modality distribution and improve alignment between visual and textual representations. We then fine-tune the pre-trained vision encoders (while keeping the text encoder frozen) for downstream tasks. To this end, we curated MedMoM-MRI3D, comprising 7,392 3D MRI volume-report pairs spanning twelve MRI modalities and nine abnormalities tailored for various 3D medical analysis tasks. Extensive experiments on MedMoM-MRI3D demonstrate that MedMAP significantly outperforms existing VLMs in 3D MRI-based multi-organ abnormality detection. Our code is available at this https URL.
[CV-78] BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds CVPR2026
【速读】:该论文旨在解决从分布多样且常稀疏或噪声严重的点云数据(如机载激光雷达和运动恢复结构获取的点云)中进行结构化三维建筑重建的问题,尤其在缺乏足够约束条件下如何恢复出具有艺术抽象特征的建筑结构。其解决方案的关键在于引入一种名为“松散级联扩散Transformer”(Loosely Cascaded Diffusion Transformer, Loca-DiT)的生成框架,该框架分两阶段工作:首先通过条件扩散模型从噪声或稀疏点云中恢复潜在分布,随后利用解码器-only的Transformer实现基于该分布的自回归网格生成,从而有效融合显式3D生成先验与点云条件信息,显著提升重建质量与表面精度。
链接: https://arxiv.org/abs/2602.23645
作者: Tongyan Hua,Haoran Gong,Yuan Liu,Di Wang,Ying-Cong Chen,Wufan Zhao
机构: HKUST(GZ); Xi’an Jiaotong University; HKUST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:We introduce BuildAnyPoint, a novel generative framework for structured 3D building reconstruction from point clouds with diverse distributions, such as those captured by airborne LiDAR and Structure-from-Motion. To recover artist-created building abstraction in this highly underconstrained setting, we capitalize on the role of explicit 3D generative priors in autoregressive mesh generation. Specifically, we design a Loosely Cascaded Diffusion Transformer (Loca-DiT) that initially recovers the underlying distribution from noisy or sparse points, followed by autoregressively encapsulating them into compact meshes. We first formulate distribution recovery as a conditional generation task by training latent diffusion models conditioned on input point clouds, and then tailor a decoder-only transformer for conditional autoregressive mesh generation based on the recovered point clouds. Our method delivers substantial qualitative and quantitative improvements over prior building abstraction methods. Furthermore, the effectiveness of our approach is evidenced by the strong performance of its recovered point clouds on building point cloud completion benchmarks, which exhibit improved surface accuracy and distribution uniformity.
[CV-79] DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model
【速读】:该论文旨在解决当前指令驱动图像编辑模型(Instruction-based Image Editing Models, IIEMs)在小尺度物体编辑能力上研究不足的问题,尤其是在精确局部编辑和细节优化方面的性能欠缺。其关键解决方案是提出了首个专注于评估IIEMs小尺度对象编辑能力的基准测试平台DeepLookEditBench(DLEBench),包含1889个样本、7类指令类型,且目标物体占图像面积仅为1%-10%,涵盖部分遮挡与多目标等复杂场景;同时设计了一套细化评分标准的评估协议,并引入双模式评估框架(工具驱动模式与Oracle引导模式),以减少人工评判与大模型作为裁判(LMM-as-a-Judge)之间的偏差,从而实现更客观、可靠的性能评估。
链接: https://arxiv.org/abs/2602.23622
作者: Shibo Hong,Boxian Ai,Jun Kuang,Wei Wang,FengJiao Chen,Zhongyuan Peng,Chenhao Huang,Yixin Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.
[CV-80] Egocentric Visibility-Aware Human Pose Estimation
【速读】:该论文旨在解决**第一人称视角人体姿态估计(egocentric human pose estimation, HPE)**中因关键点遮挡导致的精度下降问题。现有方法普遍缺乏对关键点可见性的标注,且在估计过程中未区分可见与不可见关键点,从而影响了对可见关键点的准确预测。解决方案的关键在于:首先构建了大规模带可见性标注的Eva-3M数据集(超过300万帧,其中43.5万帧标注了关键点可见性),并扩展EMHI数据集以支持可见性感知研究;其次提出EvaPose方法,显式引入可见性信息作为先验,在模型训练和推理中优化可见关键点的估计性能,从而显著提升第一人称视角下人体姿态估计的准确性。
链接: https://arxiv.org/abs/2602.23618
作者: Peng Dai,Yu Zhang,Yiqiang Feng,Zhen Fan,Yang Zhang
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Conference on Computer Vision and Pattern Recognition 2026
Abstract:Egocentric human pose estimation (HPE) using a head-mounted device is crucial for various VR and AR applications, but it faces significant challenges due to keypoint invisibility. Nevertheless, none of the existing egocentric HPE datasets provide keypoint visibility annotations, and the existing methods often overlook the invisibility problem, treating visible and invisible keypoints indiscriminately during estimation. As a result, their capacity to accurately predict visible keypoints is compromised. In this paper, we first present Eva-3M, a large-scale egocentric visibility-aware HPE dataset comprising over 3.0M frames, with 435K of them annotated with keypoint visibility labels. Additionally, we augment the existing EMHI dataset with keypoint visibility annotations to further facilitate the research in this direction. Furthermore, we propose EvaPose, a novel egocentric visibility-aware HPE method that explicitly incorporates visibility information to enhance pose estimation accuracy. Extensive experiments validate the significant value of ground-truth visibility labels in egocentric HPE settings, and demonstrate that our EvaPose achieves state-of-the-art performance in both Eva-3M and EMHI datasets.
[CV-81] Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning
【速读】:该论文旨在解决当前大模型(Large Multimodal Models, LMMs)在处理高分辨率视觉输入时因图像标记数量随分辨率平方增长而带来的冗余信息和计算效率低下的问题,同时探索如何在不依赖人工标注的视觉监督信号的情况下提升模型对关键区域的定位能力。解决方案的关键在于提出一种闭环框架HART(High-resolution Annotation-free Reasoning Technique),其核心创新是引入一种后训练优化策略——优势偏好组相对策略优化(Advantage Preference Group Relative Policy Optimization, AP-GRPO),通过鼓励模型自我验证关键区域的定位准确性,实现无需外部标注即可增强模型的视觉接地(grounding)能力,并支持可解释的推理路径与高效优化。
链接: https://arxiv.org/abs/2602.23615
作者: Jiacheng Yang,Anqi Chen,Yunkai Dang,Qi Fan,Cong Wang,Wenbin Li,Feng Miao,Yang Gao
机构: Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model’s grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training paradigm in which we design Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions. Notably, HART provides explainable reasoning pathways and enables efficient optimization of localization. Extensive experiments demonstrate that HART improves performance across a wide range of high-resolution visual tasks, consistently outperforming strong baselines. When applied to post-train Qwen2.5-VL-7B, HART even surpasses larger-scale models such as Qwen2.5-VL-72B and LLaVA-OneVision-72B on high-resolution, vision-centric benchmarks.
[CV-82] Extended Reality (XR): The Next Frontier in Education
【速读】:该论文旨在解决扩展现实(Extended Reality, XR)在教育领域广泛应用所面临的挑战,包括高实施成本、技术复杂性以及学生隐私和生物特征数据保护等伦理问题。解决方案的关键在于平衡技术创新与可及性及伦理规范,具体包括遵循GDPR和FERPA等法规以确保合规性,并建立完善的信息安全框架来保护学习者的敏感数据,从而推动XR教育的可持续发展。
链接: https://arxiv.org/abs/2602.23601
作者: Shadeeb Hossain
机构: Shadeeb Engineering Lab (Shadeeb 工程实验室)
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Extended Reality (XR), encompassing Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR), is revolutionizing education by creating immersive, interactive learning environments. This article explores the potential of XR to enhance student engagement, experiential learning, and skill development while addressing the challenges of widespread adoption. High implementation costs, technical complexities, and ethical concerns especially regarding student privacy and biometric data protection still possess significant barriers to integration. The discussion also highlights regulatory compliance with GDPR and FERPA and the importance of cybersecurity frameworks to safeguard sensitive learner data. Ultimately, the article provides insights into balancing innovation with accessibility and ethics in the evolution of XR based education
[CV-83] Incremental dimension reduction for efficient and accurate visual anomaly detection
【速读】:该论文旨在解决视觉异常检测算法在处理大规模图像数据时因特征维度高而导致的计算与存储瓶颈问题。其解决方案的关键在于提出一种增量式降维算法,该算法通过将特征向量分批处理,在每一批次中更新累积的截断奇异值分解(Truncated Singular Value Decomposition, TSVD)结果,并对每批特征进行局部降维以降低内存开销;最终通过重新变换各批次的降维表示至全局奇异向量空间,实现高效且接近原始精度的特征压缩,从而加速先进异常检测模型的训练过程。
链接: https://arxiv.org/abs/2602.23595
作者: Teng-Yok Lee
机构: Mitsubishi Electric Coporation(三菱电机公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While nowadays visual anomaly detection algorithms use deep neural networks to extract salient features from images, the high dimensionality of extracted features makes it difficult to apply those algorithms to large data with 1000s of images. To address this issue, we present an incremental dimension reduction algorithm to reduce the extracted features. While our algorithm essentially computes truncated singular value decomposition of these features, other than processing all vectors at once, our algorithm groups the vectors into batches. At each batch, our algorithm updates the truncated singular values and vectors that represent all visited vectors, and reduces each batch by its own singular values and vectors so they can be stored in the memory with low overhead. After processing all batches, we re-transform these batch-wise singular vectors to the space spanned by the singular vectors of all features. We show that our algorithm can accelerate the training of state-of-the-art anomaly detection algorithm with close accuracy.
[CV-84] Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models
【速读】:该论文旨在解决当前多模态模型(如CLIP)在处理小视觉差异蕴含大语义信息的领域(如图表理解)时表现不佳的问题,其核心挑战在于模型对细粒度结构变化的敏感性不足。解决方案的关键在于提出一种新的训练范式,通过引入由图表渲染器生成的伪对比样本(pseudo contrastive samples),这些样本利用随机选取的文字元素合成出具有结构差异的示意图,从而在不修改原始数据的前提下强化模型对图表结构语义的一致性感知能力。实验表明,该方法在流程图基准数据集上的图像-文本匹配和视觉问答任务中显著优于标准CLIP与硬负样本CLIP训练策略,验证了领域特定训练策略对提升图表理解性能的有效性。
链接: https://arxiv.org/abs/2602.23589
作者: Hiroshi Sasaki
机构: The Japan Research Institute, Limited
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures
Abstract:Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differences carry large semantic significance, such as diagram understanding, remain challenging due to the models’ limited sensitivity to fine-grained structural variations. We propose a new training paradigm designed to enhance diagram comprehension in vision-language models. Our approach introduces pseudo contrastive samples generated by a diagram renderer that creates synthetic diagrams using randomly picked text elements. These samples highlight structural differences in diagrammatic imagery without requiring any modification or editing of the original data. By incorporating these pseudo contrastive samples into the training objective, the model learns to capture more precise and semantically consistent diagram structures. Empirical evaluations on a benchmark dataset of flowcharts demonstrate substantial improvements over standard CLIP and hard-negative CLIP training in both image-text matching and visual question answering tasks. The results underscore the value of domain-specific training strategies and contribute to advancing diagrammatic understanding within the broader context of vision-language learning. Comments: 9 pages, 3 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.23589 [cs.CV] (or arXiv:2602.23589v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.23589 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-85] Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning
【速读】:该论文旨在解决多模态基础模型(如视觉和语言模型)之间跨模态对齐的问题,传统方法通常依赖于计算密集型的端到端微调,不仅资源消耗大,还可能破坏预训练模型的表征能力。其解决方案的关键在于提出HDFLIM框架,通过将冻结的视觉与语言模型嵌入映射到一个共享的高维空间,并利用轻量级符号操作(绑定、捆绑和基于相似性的检索)构建关联的跨模态表示,从而在单次数据遍历中实现跨模态映射,无需修改模型参数。该方法使图像描述生成成为高维记忆检索的结果,而非迭代梯度优化过程,显著提升了语义一致性并避免了大规模参数更新。
链接: https://arxiv.org/abs/2602.23588
作者: Abhishek Dalvi,Vasant Honavar
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large unimodal foundation models for vision and language encode rich semantic structures, yet aligning them typically requires computationally intensive multimodal fine-tuning. Such approaches depend on large-scale parameter updates, are resource intensive, and can perturb pretrained representations. Emerging evidence suggests, however, that independently trained foundation models may already exhibit latent semantic compatibility, reflecting shared structures in the data they model. This raises a fundamental question: can cross-modal alignment be achieved without modifying the models themselves? Here we introduce HDFLIM (HyperDimensional computing with Frozen Language and Image Models), a framework that establishes cross-modal mappings while keeping pretrained vision and language models fully frozen. HDFLIM projects unimodal embeddings into a shared hyperdimensional space and leverages lightweight symbolic operations – binding, bundling, and similarity-based retrieval to construct associative cross-modal representations in a single pass over the data. Caption generation emerges from high-dimensional memory retrieval rather than iterative gradient-based optimization. We show that HDFLIM achieves performance comparable to end-to-end vision-language training methods and produces captions that are more semantically grounded than zero-shot baselines. By decoupling alignment from parameter tuning, our results suggest that semantic mapping across foundation models can be realized through symbolic operations on hyperdimensional encodings of the respective embeddings. More broadly, this work points toward an alternative paradigm for foundation model alignment in which frozen models are integrated through structured representational mappings rather than through large-scale retraining. The codebase for our implementation can be found at this https URL.
[CV-86] CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Birds-Eye-View Semantic Segmentation CVPR2026
【速读】:该论文旨在解决自动驾驶中从透视视图(Perspective View, PV)到鸟瞰图(Bird’s-Eye-View, BEV)空间转换时存在的深度模糊性和遮挡问题,这些问题导致现有视图变换(View Transformation, VT)模型在BEV语义分割任务中的性能受限。解决方案的关键在于提出一种名为CycleBEV的新正则化框架,其核心思想是利用循环一致性(cycle consistency)机制:通过设计一个逆向视图变换(Inverse View Transformation, IVT)网络,将BEV分割结果映射回PV空间,并以该映射结果与原始PV分割图之间的差异作为损失函数来约束VT网络训练,从而增强VT模型对输入PV图像中语义和几何信息的捕捉能力。此外,作者进一步扩展循环一致性至几何空间和表示空间,提升了IVT网络的表达能力。实验表明,该方法在nuScenes数据集上显著提升多个VT模型的BEV语义分割性能,且不增加推理复杂度,因IVT仅用于训练阶段。
链接: https://arxiv.org/abs/2602.23575
作者: Jeongbin Hong,Dooseop Choi,Taeg-Hyun An,Kyounghwan An,Kyoung-Wook Min
机构: Electronics and Telecommunications Research Institute (ETRI); University of Science and Technology (UST)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026
Abstract:Transforming image features from perspective view (PV) space to bird’s-eye-view (BEV) space remains challenging in autonomous driving due to depth ambiguity and occlusion. Although several view transformation (VT) paradigms have been proposed, the challenge still remains. In this paper, we propose a new regularization framework, dubbed CycleBEV, that enhances existing VT models for BEV semantic segmentation. Inspired by cycle consistency, widely used in image distribution modeling, we devise an inverse view transformation (IVT) network that maps BEV segmentation maps back to PV segmentation maps and use it to regularize VT networks during training through cycle consistency losses, enabling them to capture richer semantic and geometric information from input PV images. To further exploit the capacity of the IVT network, we introduce two novel ideas that extend cycle consistency into geometric and representation spaces. We evaluate CycleBEV on four representative VT models covering three major paradigms using the large-scale nuScenes dataset. Experimental results show consistent improvements – with gains of up to 0.74, 4.86, and 3.74 mIoU for drivable area, vehicle, and pedestrian classes, respectively – without increasing inference complexity, since the IVT network is used only during training. The implementation code is available at this https URL.
[CV-87] Evidential Neural Radiance Fields
【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRFs)在三维场景建模中缺乏不确定性估计的问题,这限制了其在安全关键场景中的可信部署。现有方法难以同时量化数据驱动的偶然不确定性(aleatoric uncertainty)和模型本身的认知不确定性(epistemic uncertainty),且多数方法要么牺牲渲染质量,要么带来显著的计算开销。论文提出证据神经辐射场(Evidential Neural Radiance Fields),其核心创新在于将概率建模无缝集成到NeRF的渲染流程中,仅通过一次前向传播即可直接量化两种类型的不确定性,从而在保持高质量渲染的同时实现高效、准确的不确定性估计。
链接: https://arxiv.org/abs/2602.23574
作者: Ruxiao Duan,Alex Wong
机构: Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Understanding sources of uncertainty is fundamental to trustworthy three-dimensional scene modeling. While recent advances in neural radiance fields (NeRFs) achieve impressive accuracy in scene reconstruction and novel view synthesis, the lack of uncertainty estimation significantly limits their deployment in safety-critical settings. Existing uncertainty quantification methods for NeRFs fail to capture both aleatoric and epistemic uncertainty. Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process and enables direct quantification of both aleatoric and epistemic uncertainty from a single forward pass. We compare multiple uncertainty quantification methods on three standardized benchmarks, where our approach demonstrates state-of-the-art scene reconstruction fidelity and uncertainty estimation quality.
[CV-88] No Calibration No Depth No Problem: Cross-Sensor View Synthesis with 3D Consistency CVPR2026
【速读】:该论文旨在解决跨模态传感器视图合成(cross-sensor view synthesis)中的关键挑战:获取对齐的RGB-X数据(如RGB与深度、红外等异构传感器数据),而现有方法普遍假设此类配对数据已存在,忽略了实际应用中因传感器标定困难导致的巨大工程成本。其解决方案的核心在于提出“匹配-稠密化-融合”(match-densify-consolidate)框架:首先进行RGB-X图像匹配并基于引导点稠密化,利用置信度感知的稠密化策略与自匹配过滤机制提升视图合成质量;随后在3D高斯泼溅(3D Gaussian Splatting, 3DGS)中整合结果。该方法无需X传感器的3D先验信息,仅依赖低成本的COLMAP进行RGB重建,从而显著降低多模态传感器标定复杂度,为大规模真实场景下RGB-X数据的采集与学习提供可扩展的解决方案。
链接: https://arxiv.org/abs/2602.23559
作者: Cho-Ying Wu,Zixun Huang,Xinyu Huang,Liu Ren
机构: Bosch Research North America; Bosch Center for Artificial Intelligence (BCAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Main Conference. Project page: this https URL
Abstract:We present the first study of cross-sensor view synthesis across different modalities. We examine a practical, fundamental, yet widely overlooked problem: getting aligned RGB-X data, where most RGB-X prior work assumes such pairs exist and focuses on modality fusion, but it empirically requires huge engineering effort in calibration. We propose a match-densify-consolidate method. First, we perform RGB-X image matching followed by guided point densification. Using the proposed confidence-aware densification and self-matching filtering, we attain better view synthesis and later consolidate them in 3D Gaussian Splatting (3DGS). Our method uses no 3D priors for X-sensor and only assumes nearly no-cost COLMAP for RGB. We aim to remove the cumbersome calibration for various RGB-X sensors and advance the popularity of cross-sensor learning by a scalable solution that breaks through the bottleneck in large-scale real-world RGB-X data collection.
[CV-89] LE-NeuS: Latency-Efficient Neuro-Symbolic Video Understanding via Adaptive Temporal Verification
【速读】:该论文旨在解决神经符号方法在长视频问答(Long-form Video Question Answering, LVQA)中因形式化验证带来的高延迟问题,现有方法推理速度比基础视觉语言模型(Vision-Language Model, VLM)提示慢高达90倍,难以部署于对延迟敏感的边缘场景。解决方案的关键在于识别出自动化器构建过程中跨帧的串行且密集命题检测是主要计算瓶颈,并提出两项优化策略:(1) 基于CLIP的两阶段自适应采样,利用视觉冗余跳过语义相似帧但保留时间边界;(2) 批处理命题检测,通过并行化VLM推理于时间窗口实现加速。理论分析给出了延迟上界与视频长度、命题复杂度和采样密度的关系,实验证明在LongVideoBench和Video-MME基准上将延迟差距从90倍降至约10倍,同时保持对时序复杂查询的10%准确率提升。
链接: https://arxiv.org/abs/2602.23553
作者: Shawn Liang,Sahil Shah,Chengwei Zhou,SP Sharan,Harsh Goel,Arnab Sanyal,Sandeep Chinchali,Gourav Datta
机构: Case Western Reserve University (凯斯西储大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:Neuro-symbolic approaches to long-form video question answering (LVQA) have demonstrated significant accuracy improvements by grounding temporal reasoning in formal verification. However, existing methods incur prohibitive latency overheads, up to 90x slower than base VLM prompting, rendering them impractical for latency-sensitive edge deployments. We present LE-NeuS, a latency-efficient neuro-symbolic framework that preserves the accuracy benefits of temporal logic-guided video understanding while drastically reducing inference latency. Our key insight is that the dominant computational bottleneck arises from sequential and dense proposition detection across video frames during automaton construction. We address this through two principled optimizations: (1) CLIP guided two-stage adaptive sampling that exploits visual redundancy to skip semantically similar frames while preserving temporal boundaries, and (2) batched proposition detection that parallelizes VLM inference across temporal windows. Theoretically, we derive latency bounds as a function of video length, proposition complexity, and sampling density, establishing conditions under which latency efficiency is achievable. Empirically, on LongVideoBench and Video-MME benchmarks deployed on NVIDIA H100 GPUs, LE-NeuS reduces the latency gap from 90x to approximately 10x while maintaining 10% accuracy gains on temporally complex queries.
[CV-90] Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos
【速读】:该论文旨在解决视频场景图(Video Scene Graph, VSG)生成中规模小、多样性不足以及时空关系建模困难的问题。现有数据集在对象数量、属性和关系的覆盖范围上存在明显局限,且缺乏对动态场景中物体轨迹与时间语义的有效捕捉能力。解决方案的关键在于构建大规模、高多样性的全景视频场景图数据集 Synthetic Visual Genome 2 (SVG2),并通过全自动流水线实现多尺度全景分割、在线-离线轨迹跟踪、自动新物体发现、轨迹语义解析及基于 GPT-5 的时空关系推理;在此基础上提出 TRaSER 模型,其核心创新是引入轨迹对齐的 token 排列机制、物体轨迹重采样模块和时间窗口重采样模块,将原始视频与全景轨迹转化为紧凑的时空场景图,在单次前向传播中同时保留局部运动信息与全局对象上下文,显著提升关系、属性和对象预测性能,并在视频问答任务中验证了显式时空场景图作为中间表示的有效性。
链接: https://arxiv.org/abs/2602.23543
作者: Ziqi Gao,Jieyu Zhang,Wisdom Oluchi Ikezogwo,Jae Sung Park,Tario G. You,Daniel Ogbu,Chenhao Zheng,Weikai Huang,Yinuo Yang,Winson Han,Quan Kong,Rajat Saini,Ranjay Krishna
机构: Allen Institute for AI (艾伦人工智能研究所); University of Washington (华盛顿大学); Woven by Toyota (丰田织物); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass. The temporal-window resampler binds visual tokens to short trajectory segments to preserve local motion and temporal semantics, while the object-trajectory resampler aggregates entire trajectories to maintain global context for objects. On the PVSG, VIPSeg, VidOR and SVG2 test datasets, TRaSER improves relation detection by +15 to 20%, object prediction by +30 to 40% over the strongest open-source baselines and by +13% over GPT-5, and attribute prediction by +15%. When TRaSER’s generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5 to 4.6% absolute accuracy gain over using video only or video augmented with Qwen2.5-VL’s generated scene graphs, demonstrating the utility of explicit spatio-temporal scene graphs as an intermediate representation.
[CV-91] V-MORALS: Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space
【速读】:该论文旨在解决传统可达性分析(reachability analysis)方法在机器人系统中面临的三大局限:一是依赖已知的系统动力学模型或大量数据来估计准确模型;二是计算复杂度高;三是假设能够获取完整的状态信息。针对这些问题,论文提出了一种名为V-MORALS(Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space)的新方法,其核心创新在于仅使用基于图像的轨迹数据即可学习一个低维潜在空间(latent space),并在该空间中构建定义良好的Morse图(Morse Graph),从而无需依赖完整状态信息即可计算吸引域(Region of Attraction, ROA)。该方案的关键在于利用视觉传感器数据驱动地学习潜在表示,并借助拓扑工具实现高效且鲁棒的可达性分析,显著提升了在实际感知受限场景下的适用性与实用性。
链接: https://arxiv.org/abs/2602.23524
作者: Faiz Aladin,Ashwin Balasubramanian,Lars Lindemann,Daniel Seita
机构: University of Southern California (南加州大学); ETH Zürich (苏黎世联邦理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Reachability analysis has become increasingly important in robotics to distinguish safe from unsafe states. Unfortunately, existing reachability and safety analysis methods often fall short, as they typically require known system dynamics or large datasets to estimate accurate system models, are computationally expensive, and assume full state information. A recent method, called MORALS, aims to address these shortcomings by using topological tools to estimate3DR-eEgnciodnesr of Attraction (ROA) in a low-dimensional latent space. However, MORALS still relies on full state knowledge and has not been studied when only sensor measurements are available. This paper presents Visual Morse Graph-Aided Estimation of Regions of Attraction in a Learned Latent Space (V- MORALS). V-MORALS takes in a dataset of image-based trajectories of a system under a given controller, and learns a latent space for reachability analysis. Using this learned latent space, our method is able to generate well-defined Morse Graphs, from which we can compute ROAs for various systems and controllers. V-MORALS provides capabilities similar to the original MORALS architecture without relying on state knowledge, and using only high-level sensor data. Our project website is at: this https URL.
[CV-92] All in One: Unifying Deepfake Detection Tampering Localization and Source Tracing with a Robust Landmark-Identity Watermark CVPR2026
【速读】:该论文旨在解决当前深度伪造(deepfake)检测、篡改定位与来源追溯任务相互独立、缺乏统一框架的问题。现有方法通常将这三个任务视为孤立问题处理,难以实现协同优化与高效部署。其解决方案的关键在于提出一种统一的主动取证框架,核心创新是引入一种152维的结构化地标-身份水印(LIDMark),该水印将人脸关键点与唯一源标识符进行结构融合,从而实现三重功能一体化:通过因子化头解码器(Factorized-Head Decoder, FHD)分别提取嵌入的地标信息和源标识符,其中回归头基于“内在-外在”一致性验证完成检测与定位,分类头则用于鲁棒地识别伪造内容的来源。此设计实现了高鲁棒性、不可感知且统一的深度伪造内容检测、定位与追踪能力。
链接: https://arxiv.org/abs/2602.23523
作者: Junjiang Wu,Liejun Wang,Zhiqing Guo
机构: Xinjiang University (新疆大学); Xinjiang Multimodal Intelligent Processing and Information Security Engineering Technology Research Center (新疆多模态智能处理与信息安全工程技术研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026
Abstract:With the rapid advancement of deepfake technology, malicious face manipulations pose a significant threat to personal privacy and social security. However, existing proactive forensics methods typically treat deepfake detection, tampering localization, and source tracing as independent tasks, lacking a unified framework to address them jointly. To bridge this gap, we propose a unified proactive forensics framework that jointly addresses these three core tasks. Our core framework adopts an innovative 152-dimensional landmark-identity watermark termed LIDMark, which structurally interweaves facial landmarks with a unique source identifier. To robustly extract the LIDMark, we design a novel Factorized-Head Decoder (FHD). Its architecture factorizes the shared backbone features into two specialized heads (i.e., regression and classification), robustly reconstructing the embedded landmarks and identifier, respectively, even when subjected to severe distortion or tampering. This design realizes an “all-in-one” trifunctional forensic solution: the regression head underlies an “intrinsic-extrinsic” consistency check for detection and localization, while the classification head robustly decodes the source identifier for tracing. Extensive experiments show that the proposed LIDMark framework provides a unified, robust, and imperceptible solution for the detection, localization, and tracing of deepfake content. The code is available at this https URL.
[CV-93] Modelling and Simulation of Neuromorphic Datasets for Anomaly Detection in Computer Vision
【速读】:该论文旨在解决动态视觉传感器(Dynamic Vision Sensors, DVS)数据稀缺问题,这一限制严重制约了类脑计算机视觉(neuromorphic computer vision)研究的进展。现有数据集普遍存在样本数量有限或场景覆盖不足的缺陷。为应对这一挑战,作者提出了一种名为ANTShapes(Anomalous Neuromorphic Tool for Shapes)的新颖数据模拟框架,其核心创新在于基于Unity引擎构建可配置的3D抽象场景,并通过符合中心极限定理的统计过程随机生成物体的行为属性(如运动与旋转),从而实现对异常行为物体的自动标注。该方案的关键优势在于仅需调整少量参数即可生成任意规模的数据集及其配套标签和帧信息,有效支持对象识别、定位及异常检测等任务的定制化数据需求。
链接: https://arxiv.org/abs/2602.23514
作者: Mike Middleton,Teymoor Ali,Hakan Kayan,Basabdatta Sen Bhattacharya,Charith Perera,Oliver Rhodes,Elena Gheorghiu,Mark Vousden,Martin A. Trefzer
机构: University of York (约克大学); University of Stirling (斯特灵大学); Cardiff University (卡迪夫大学); University of Manchester (曼彻斯特大学); University of Southampton (南安普顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: draft paper
Abstract:Limitations on the availability of Dynamic Vision Sensors (DVS) present a fundamental challenge to researchers of neuromorphic computer vision applications. In response, datasets have been created by the research community, but often contain a limited number of samples or scenarios. To address the lack of a comprehensive simulator of neuromorphic vision datasets, we introduce the Anomalous Neuromorphic Tool for Shapes (ANTShapes), a novel dataset simulation framework. Built in the Unity engine, ANTShapes simulates abstract, configurable 3D scenes populated by objects displaying randomly-generated behaviours describing attributes such as motion and rotation. The sampling of object behaviours, and the labelling of anomalously-acting objects, is a statistical process following central limit theorem principles. Datasets containing an arbitrary number of samples can be created and exported from ANTShapes, along with accompanying label and frame data, through the adjustment of a limited number of parameters within the software. ANTShapes addresses the limitations of data availability to researchers of event-based computer vision by allowing for the simulation of bespoke datasets to suit purposes including object recognition and localisation alongside anomaly detection.
[CV-94] DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在图形布局(graphic layout)生成中难以匹配人类审美判断的问题。现有基于文本到图像生成的偏好数据集和奖励模型无法有效迁移至布局评估任务,因为布局质量高度依赖于相同元素的空间排列而非内容本身。其解决方案的关键在于构建一个大规模、高质量的人类标注偏好对数据集 DesignSense-10k(包含 10,235 对布局比较),并通过五阶段精细化数据清洗与增强流程(包括语义分组、布局预测、过滤、聚类及视觉语言模型 VLM 增强)确保对比对的合理性与多样性;在此基础上训练出专用于布局偏好的视觉-语言模型(DesignSense),该模型在多项指标上显著优于现有开源与商业基线(Macro F1 提升达 54.6%),并进一步推动下游布局生成任务的实际性能提升(强化学习训练中胜率提高约 3%,推理时采样优化带来 3.6% 的改进)。
链接: https://arxiv.org/abs/2602.23438
作者: Varun Gopal,Rishabh Jain,Aradhya Mathur,Nikitha SR,Sohan Patnaik,Sudhir Yarram,Mayur Hemani,Balaji Krishnamurthy,Mausoom Sarkar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures
Abstract:Graphic layouts serve as an important and engaging medium for visual communication across different channels. While recent layout generation models have demonstrated impressive capabilities, they frequently fail to align with nuanced human aesthetic judgment. Existing preference datasets and reward models trained on text-to-image generation do not generalize to layout evaluation, where the spatial arrangement of identical elements determines quality. To address this critical gap, we introduce DesignSense-10k, a large-scale dataset of 10,235 human-annotated preference pairs for graphic layout evaluation. We propose a five-stage curation pipeline that generates visually coherent layout transformations across diverse aspect ratios, using semantic grouping, layout prediction, filtering, clustering, and VLM-based refinement to produce high-quality comparison pairs. Human preferences are annotated using a 4-class scheme (left, right, both good, both bad) to capture subjective ambiguity. Leveraging this dataset, we train DesignSense, a vision-language model-based classifier that substantially outperforms existing open-source and proprietary models across comprehensive evaluation metrics (54.6% improvement in Macro F1 over the strongest proprietary baseline). Our analysis shows that frontier VLMs remain unreliable overall and fail catastrophically on the full four-class task, underscoring the need for specialized, preference-aware models. Beyond the dataset, our reward model DesignSense yields tangible downstream gains in layout generation. Using our judge during RL based training improves generator win rate by about 3%, while inference-time scaling, which involves generating multiple candidates and selecting the best one, provides a 3.6% improvement. These results highlight the practical impact of specialized, layout-aware preference modeling on real-world layout generation quality.
[CV-95] Demystifying Action Space Design for Robotic Manipulation Policies
【速读】:该论文旨在解决模仿学习驱动的机器人操作策略设计中动作空间(action space)选择缺乏系统性理解的问题,当前做法多依赖经验启发式或遗留设计,导致对策略设计哲学的认知模糊。其解决方案的关键在于通过大规模实证研究(基于13,000+真实世界轨迹和500+训练模型,在四类场景下评估),系统分析动作空间在时空维度上的结构化影响,明确指出:合理设计策略以预测增量动作(delta actions)能稳定提升性能,而关节空间(joint-space)与任务空间(task-space)参数化分别在控制稳定性和泛化能力上具有互补优势。
链接: https://arxiv.org/abs/2602.23408
作者: Yuchun Feng,Jinliang Zheng,Zhihao Wang,Dongxiu Liu,Jianxiong Li,Jiangmiao Pang,Tai Wang,Xianyuan Zhan
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The specification of the action space plays a pivotal role in imitation-based robotic manipulation policy learning, fundamentally shaping the optimization landscape of policy learning. While recent advances have focused heavily on scaling training data and model capacity, the choice of action space remains guided by ad-hoc heuristics or legacy designs, leading to an ambiguous understanding of robotic policy design philosophies. To address this ambiguity, we conducted a large-scale and systematic empirical study, confirming that the action space does have significant and complex impacts on robotic policy learning. We dissect the action design space along temporal and spatial axes, facilitating a structured analysis of how these choices govern both policy learnability and control stability. Based on 13,000+ real-world rollouts on a bimanual robot and evaluation on 500+ trained models over four scenarios, we examine the trade-offs between absolute vs. delta representations, and joint-space vs. task-space parameterizations. Our large-scale results suggest that properly designing the policy to predict delta actions consistently improves performance, while joint-space and task-space representations offer complementary strengths, favoring control stability and generalization, respectively.
[CV-96] Leverag ing large multimodal models for audio-video deepfake detection: a pilot study ICASSP2026
【速读】:该论文旨在解决当前音频-视觉深度伪造检测(Audio-Visual Deepfake Detection, AVD)模型在跨域泛化能力差、规模小且任务特定的问题,即现有方法虽在受控测试集上表现良好,但难以适应多样化的实际应用场景。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)架构的监督微调(Supervised Fine-Tuning, SFT)多模态检测框架AV-LMMDetect,其将AVD建模为一个提示驱动的二分类任务(“这个视频是真实的还是伪造的?”),并通过双阶段训练策略实现高效对齐:首先使用轻量级LoRA(Low-Rank Adaptation)进行音频-视觉流的初步对齐,随后对整个音频-视觉编码器进行全参数微调,从而显著提升模型在复杂场景下的检测性能与泛化能力。
链接: https://arxiv.org/abs/2602.23393
作者: Songjun Cao(1),Yuqi Li(1 and 2),Yunpeng Luo(1),Jianjun Yin(2),Long Ma(1) ((1) Tencent YouTu Lab, China, (2) Fudan University, China)
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
备注: 5pages,ICASSP2026
Abstract:Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - “Is this video real or fake?”. Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.
[CV-97] Extending 2D foundational DINOv3 representations to 3D segmentation of neonatal brain MR images
【速读】:该论文旨在解决早产儿和足月婴儿海马结构精确体积勾画的问题,其核心挑战在于如何利用预训练的2D视觉基础模型(foundation models)实现对具有复杂3D空间组织的脑部结构进行准确分割。传统方法受限于2D特征提取方式,难以捕捉三维解剖学一致性。解决方案的关键在于提出一种基于结构化窗口分解-重组机制的体积分割策略:将整个MRI体积划分为非重叠的3D子立方体(sub-cubes),每个子立方体通过独立的解码分支处理,该分支基于冻结的高保真2D特征提取器生成预测,并在最终阶段通过密集预测头进行逐像素对齐与重构。此设计在保持恒定解码器内存占用的同时,强制预测结果符合解剖学几何约束,从而从冻结的2D基础表示中恢复出完整的3D解剖结构。
链接: https://arxiv.org/abs/2602.23962
作者: Annayah Usman,Behraj Khan,Tahir Qasim Syed
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Precise volumetric delineation of hippocampal structures is essential for quantifying neurodevelopmental trajectories in pre-term and term infants, where subtle morphological variations may carry prognostic significance. While foundation encoders trained on large-scale visual data offer discriminative representations, their 2D formulation is a limitation with respect to the 3 D organization of brain anatomy. We propose a volumetric segmentation strategy that reconciles this tension through a structured window-based disassembly-reassembly mechanism: the global MRI volume is decomposed into non-overlapping 3D windows or sub-cubes, each processed via a separate decoding arm built upon frozen high-fidelity features, and subsequently reassembled prior to a ground-truth correspendence using a dense-prediction head. This architecture preserves constant a decoder memory footprint while forcing predictions to lie within an anatomically consistent geometry. Evaluated on the ALBERT dataset for hippocampal segmentation, the proposed approach achieves a Dice score of 0.65 for a single 3D window. The method demonstrates that volumetric anatomical structure could be recovered from frozen 2D foundation representations through structured compositional decoding, and offers a principled and generalizable extension for foundation models for 3D medical applications.
[CV-98] Clinically-aligned ischemic stroke segmentation and ASPECTS scoring on NCCT imaging using a slice-gated loss on foundation representations
【速读】:该论文旨在解决急性缺血性卒中(Acute Ischemic Stroke, AIS)患者在非增强CT(Non-contrast CT, NCCT)图像中快速准确评估梗死区域的问题,特别是针对ASPECTS(Alberta Stroke Program Early CT Score)评分的自动化分割任务。现有深度学习方法多采用像素级分割,未能建模临床实践中基底节(Basal Ganglia, BG)与皮层下区域(Supraganglionic, SG)之间耦合的解剖结构推理关系,导致分割结果与临床判读不一致。解决方案的关键在于提出一种结构对齐的框架:利用冻结的DINOv3骨干网络提取基础表征,并结合轻量级解码器;同时引入一种区域感知门控损失函数(Territory-Aware Gated Loss, TAGL),在训练阶段强制BG-SG区域的一致性约束,从而提升分割的解剖合理性与临床一致性。该方法在AISD和专有ASPECTS数据集上均显著优于传统CNN及基础模型,且不增加推理复杂度。
链接: https://arxiv.org/abs/2602.23961
作者: Hiba Azeem,Behraj Khan,Tahir Qasim Syed
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rapid infarct assessment on non-contrast CT (NCCT) is essential for acute ischemic stroke management. Most deep learning methods perform pixel-wise segmentation without modeling the structured anatomical reasoning underlying ASPECTS scoring, where basal ganglia (BG) and supraganglionic (SG) levels are clinically interpreted in a coupled manner. We propose a clinically aligned framework that combines a frozen DINOv3 backbone with a lightweight decoder and introduce a Territory-Aware Gated Loss (TAGL) to enforce BG-SG consistency during training. This anatomically informed supervision adds no inference-time complexity. Our method achieves a Dice score of 0.6385 on AISD, outperforming prior CNN and foundation-model baselines. On a proprietary ASPECTS dataset, TAGL improves mean Dice from 0.698 to 0.767. These results demonstrate that integrating foundation representations with structured clinical priors improves NCCT stroke segmentation and ASPECTS delineation.
[CV-99] Polarization Uncertainty-Guided Diffusion Model for Color Polarization Image Demosaicking AAAI2026
【速读】:该论文旨在解决彩色偏振去马赛克(Color Polarization Demosaicking, CPDM)中因缺失像素预测难度大及高质量训练数据稀缺导致的极化特性重建误差问题,尤其在度偏振(Degree of Polarization, DOP)和偏振角(Angle of Polarization, AOP)恢复上表现不佳。解决方案的关键在于引入文本到图像(Text-to-Image, T2I)模型中的图像扩散先验(image diffusion prior),以弥补现有网络方法因数据分布受限而导致的表征能力不足;同时,通过显式建模重建过程中的极化不确定性,并利用该不确定性引导扩散模型聚焦于高误差区域进行优化,从而实现高保真且视觉感知强的极化特征恢复。
链接: https://arxiv.org/abs/2602.23847
作者: Chenggong Li,Yidong Luo,Junchao Zhang,Degui Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI2026
Abstract:Color polarization demosaicking (CPDM) aims to reconstruct full-resolution polarization images of four directions from the color-polarization filter array (CPFA) raw image. Due to the challenge of predicting numerous missing pixels and the scarcity of high-quality training data, existing network-based methods, despite effectively recovering scene intensity information, still exhibit significant errors in reconstructing polarization characteristics (degree of polarization, DOP, and angle of polarization, AOP). To address this problem, we introduce the image diffusion prior from text-to-image (T2I) models to overcome the performance bottleneck of network-based methods, with the additional diffusion prior compensating for limited representational capacity caused by restricted data distribution. To effectively leverage the diffusion prior, we explicitly model the polarization uncertainty during reconstruction and use uncertainty to guide the diffusion model in recovering high error regions. Extensive experiments demonstrate that the proposed method accurately recovers scene polarization characteristics with both high fidelity and strong visual perception.
[CV-100] Revisiting Integration of Image and Metadata for DICOM Series Classification: Cross-Attention and Dictionary Learning
【速读】:该论文旨在解决DICOM图像序列分类中的关键挑战,包括异质的切片内容、可变的序列长度以及缺失、不完整或不一致的DICOM元数据(metadata),这些问题严重影响了大规模医学影像分析的准确性与可靠性。解决方案的关键在于提出一种端到端的多模态框架,通过以下三个核心设计实现:(i) 使用模态感知模块分别编码图像内容和元数据,并采用双向跨模态注意力机制进行融合;(ii) 基于可学习特征字典和值条件调制机制构建稀疏且对缺失敏感的元数据编码器,无需任何插补操作;(iii) 利用2.5D视觉编码器和等距采样切片上的注意力机制处理序列长度与图像维度的变化。该方法在真实临床数据集上验证了其优于纯图像、纯元数据及传统2D/3D多模态基线模型的性能,证明显式建模元数据稀疏性和跨模态交互能显著提升DICOM序列分类的鲁棒性。
链接: https://arxiv.org/abs/2602.23833
作者: Tuan Truong,Melanie Dohmen,Sara Lorio,Matthias Lenga
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated identification of DICOM image series is essential for large-scale medical image analysis, quality control, protocol harmonization, and reliable downstream processing. However, DICOM series classification remains challenging due to heterogeneous slice content, variable series length, and entirely missing, incomplete or inconsistent DICOM metadata. We propose an end-to-end multimodal framework for DICOM series classification that jointly models image content and acquisition metadata while explicitly accounting for all these challenges. (i) Images and metadata are encoded with modality-aware modules and fused using a bi-directional cross-modal attention mechanism. (ii) Metadata is processed by a sparse, missingness-aware encoder based on learnable feature dictionaries and value-conditioned modulation. By design, the approach does not require any form of imputation. (iii) Variability in series length and image data dimensions is handled via a 2.5D visual encoder and attention operating on equidistantly sampled slices. We evaluate the proposed approach on the publicly available Duke Liver MRI dataset and a large multi-institutional in-house cohort, assessing both in-domain performance and out-of-domain generalization. Across all evaluation settings, the proposed method consistently outperforms relevant image only, metadata-only and multimodal 2D/3D baselines. The results demonstrate that explicitly modeling metadata sparsity and cross-modal interactions improves robustness for DICOM series classification.
[CV-101] BiM-GeoAttn-Net: Linear-Time Depth Modeling with Geometry-Aware Attention for 3D Aortic Dissection CTA Segmentation
【速读】:该论文旨在解决主动脉夹层(Aortic Dissection, AD)在CT血管造影(CTA)图像中三维腔室分割的准确性问题,尤其针对长距离上下文建模不足导致的跨切片一致性差以及低对比度条件下结构区分能力弱的挑战。其解决方案的关键在于提出BiM-GeoAttn-Net框架,该框架融合了线性时间深度可分离状态空间建模与几何感知的血管细化机制:其中双向深度Mamba(Bidirectional Depth Mamba, BiM)模块高效捕捉跨切片依赖关系,几何感知血管注意力(Geometry-Aware Vessel Attention, GeoAttn)模块则通过方向敏感的各向异性滤波增强管状结构并锐化模糊边界,从而实现高精度且计算高效的3D主动脉夹层分割。
链接: https://arxiv.org/abs/2602.23803
作者: Yuan Zhang,Lei Liu,Jialin Zhang,Ya-Nan Zhang,Ling Wang,Nan Mu
机构: Sichuan Normal University (四川师范大学); Zhejiang University (浙江大学); Ant Group (蚂蚁集团); Guizhou University (贵州大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation of aortic dissection (AD) lumens in CT angiography (CTA) is essential for quantitative morphological assessment and clinical decision-making. However, reliable 3D delineation remains challenging due to limited long-range context modeling, which compromises inter-slice coherence, and insufficient structural discrimination under low-contrast conditions. To address these limitations, we propose BiM-GeoAttn-Net, a lightweight framework that integrates linear-time depth-wise state-space modeling with geometry-aware vessel refinement. Our approach is featured by Bidirectional Depth Mamba (BiM) to efficiently capture cross-slice dependencies and Geometry-Aware Vessel Attention (GeoAttn) module that employs orientation-sensitive anisotropic filtering to refine tubular structures and sharpen ambiguous boundaries. Extensive experiments on a multi-source AD CTA dataset demonstrate that BiM-GeoAttn-Net achieves a Dice score of 93.35% and an HD95 of 12.36 mm, outperforming representative CNN-, Transformer-, and SSM-based baselines in overlap metrics while maintaining competitive boundary accuracy. These results suggest that coupling linear-time depth modeling with geometry-aware refinement provides an effective, computationally efficient solution for robust 3D AD segmentation.
[CV-102] FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy CVPR2026
【速读】:该论文旨在解决荧光显微镜中焦点质量评估(Focus Quality Assessment, FQA)的准确性问题,其核心挑战在于不同荧光染料(stain)因光学特性差异导致焦点偏移具有显著的异质性和突发性,而现有数据集与模型普遍忽略这种染色依赖性,将FQA视为染色无关(stain-agnostic)任务。解决方案的关键在于提出“染色感知的FQA”新范式,通过构建首个面向染色感知的FQA数据集FluoMix,涵盖多种组织、染料及焦点变化,并设计两阶段视觉-语言框架FluoCLIP:第一阶段通过CLIP对齐机制学习通用染色表征(stain-grounding),第二阶段利用染色特异性排序提示(rank prompts)实现有序焦点预测(stain-guided ranking),从而显著提升模型在复杂染色条件下的泛化能力。
链接: https://arxiv.org/abs/2602.23791
作者: Hyejin Park,Jiwon Yoon,Sumin Park,Suree Kim,Sinae Jang,Eunsoo Lee,Dongmin Kang,Dongbo Min
机构: Ewha Womans University (梨花女子大学); Daesang (大昌)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 (preview), Project Page: this https URL
Abstract:Accurate focus quality assessment (FQA) in fluorescence microscopy remains challenging, as the stain-dependent optical properties of fluorescent dyes cause abrupt and heterogeneous focus shifts. However, existing datasets and models overlook this variability, treating focus quality as a stain-agnostic problem. In this work, we formulate the task of stain-aware FQA, emphasizing that focus behavior in fluorescence microscopy must be modeled as a function of staining characteristics. Through quantitative analysis of existing datasets (FocusPath, BBBC006) and our newly curated FluoMix, we demonstrate that focus-rank relationships vary substantially across stains, underscoring the need for stain-aware modeling in fluorescence microscopy. To support this new formulation, we propose FluoMix, the first dataset for stain-aware FQA that encompasses multiple tissues, fluorescent stains, and focus variations. Building on this dataset, we propose FluoCLIP, a two-stage vision-language framework that leverages CLIP’s alignment capability to interpret focus quality in the context of biological staining. In the stain-grounding phase, FluoCLIP learns general stain representations by aligning textual stain tokens with visual features, while in the stain-guided ranking phase, it optimizes stain-specific rank prompts for ordinal focus prediction. Together, our formulation, dataset, and framework establish the first foundation for stain-aware FQA, and FluoCLIP achieves strong generalization across diverse fluorescence microscopy conditions.
[CV-103] Breaking the Data Barrier: Robust Few-Shot 3D Vessel Segmentation using Foundation Models
【速读】:该论文旨在解决医学图像中血管分割方法在数据稀缺和域偏移(domain shift)下性能显著下降的问题,尤其是在临床实践中难以获取大量标注数据以适应不同扫描仪或成像协议的情况下。解决方案的关键在于利用预训练的视觉基础模型(Vision Foundation Model, DINOv3)并引入三个核心组件:轻量级3D Adapter以保证体积一致性、多尺度3D Aggregator实现层次化特征融合,以及Z-channel嵌入有效弥合2D预训练与3D医学模态之间的差距,从而在仅有少量样本(如5个训练样本)条件下仍能准确捕捉连续的血管结构。实验表明,该方法在域内(TopCoW)和域外(Lausanne)数据集上均显著优于当前主流方法(如nnU-Net、SwinUNETR等),尤其在极端少样本和跨域场景中展现出更强的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2602.23782
作者: Kirato Yoshihara,Yohei Sugawara,Yuta Tokuoka,Lihang Hong
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 2 tables
Abstract:State-of-the-art vessel segmentation methods typically require large-scale annotated datasets and suffer from severe performance degradation under domain shifts. In clinical practice, however, acquiring extensive annotations for every new scanner or protocol is unfeasible. To address this, we propose a novel framework leveraging a pre-trained Vision Foundation Model (DINOv3) adapted for volumetric vessel segmentation. We introduce a lightweight 3D Adapter for volumetric consistency, a multi-scale 3D Aggregator for hierarchical feature fusion, and Z-channel embedding to effectively bridge the gap between 2D pre-training and 3D medical modalities, enabling the model to capture continuous vascular structures from limited data. We validated our method on the TopCoW (in-domain) and Lausanne (out-of-distribution) datasets. In the extreme few-shot regime with 5 training samples, our method achieved a Dice score of 43.42%, marking a 30% relative improvement over the state-of-the-art nnU-Net (33.41%) and outperforming other Transformer-based baselines, such as SwinUNETR and UNETR, by up to 45%. Furthermore, in the out-of-distribution setting, our model demonstrated superior robustness, achieving a 50% relative improvement over nnU-Net (21.37% vs. 14.22%), which suffered from severe domain overfitting. Ablation studies confirmed that our 3D adaptation mechanism and multi-scale aggregation strategy are critical for vascular continuity and robustness. Our results suggest foundation models offer a viable cold-start solution, improving clinical reliability under data scarcity or domain shifts.
[CV-104] VideoPulse: Neonatal heart rate and peripheral capillary oxygen saturation (SpO2) estimation from contact free video
【速读】:该论文旨在解决新生儿(neonate)在重症监护中无创、非接触式生命体征监测的难题,尤其针对传统方法依赖皮肤贴附探头易引发皮肤刺激和感染风险的问题。其核心解决方案是提出VideoPulse数据集与端到端深度学习管道,通过分析面部视频实现心率(heart rate, HR)和外周毛细血管血氧饱和度(peripheral capillary oxygen saturation, SpO₂)的精准估计。关键创新在于:采用去噪后的脉搏血氧仪信号进行抗伪影监督(artifact-aware supervision),结合3D卷积神经网络(3D CNN)骨干模型,并引入标签分布平滑(label distribution smoothing)与加权回归策略优化SpO₂预测,最终可在2秒时间窗口内输出稳定结果,展现出跨数据集迁移能力与临床实用性。
链接: https://arxiv.org/abs/2602.23771
作者: Deependra Dewagiri,Kamesh Anuradha,Pabadhi Liyanage,Helitha Kulatunga,Pamuditha Somarathne,Udaya S. K. P. Miriya Thanthrige,Nishani Lucas,Anusha Withana,Joshua P. Kulasingham
机构: University of Moratuwa, Sri Lanka; The University of Sydney, Australia; University of Colombo, Sri Lanka
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, 5 tables. Preprint. Intended for submission to an IEEE Journal
Abstract:Remote photoplethysmography (rPPG) enables contact free monitoring of vital signs and is especially valuable for neonates, since conventional methods often require sustained skin contact with adhesive probes that can irritate fragile skin and increase infection control burden. We present VideoPulse, a neonatal dataset and an end to end pipeline that estimates neonatal heart rate and peripheral capillary oxygen saturation (SpO2) from facial video. VideoPulse contains 157 recordings totaling 2.6 hours from 52 neonates with diverse face orientations. Our pipeline performs face alignment and artifact aware supervision using denoised pulse oximeter signals, then applies 3D CNN backbones for heart rate and SpO2 regression with label distribution smoothing and weighted regression for SpO2. Predictions are produced in 2 second windows. On the NBHR neonatal dataset, we obtain heart rate MAE 2.97 bpm using 2 second windows (2.80 bpm at 6 second windows) and SpO2 MAE 1.69 percent. Under cross dataset evaluation, the NBHR trained heart rate model attains 5.34 bpm MAE on VideoPulse, and fine tuning an NBHR pretrained SpO2 model on VideoPulse yields MAE 1.68 percent. These results indicate that short unaligned neonatal video segments can support accurate heart rate and SpO2 estimation, enabling low cost non invasive monitoring in neonatal intensive care.
[CV-105] Unsupervised Causal Prototypical Networks for De-biased Interpretable Dermoscopy Diagnosis
【速读】:该论文旨在解决深度学习在皮肤镜图像分析中因黑箱特性导致的临床可信度不足问题,以及由于临床数据中不可避免的选择偏差引发的捷径学习(shortcut learning)现象——即模型错误地将环境混杂因素(environmental confounders)编码为预测原型,从而生成误导性的视觉证据。解决方案的关键在于提出一种无监督因果原型网络(CausalProto),其核心是基于结构因果模型(Structural Causal Model),通过信息瓶颈约束的编码器实现病理特征与环境混杂因素的无监督正交解耦(orthogonal disentanglement),并将解耦后的表示映射至独立的原型空间;进而利用学习到的虚假字典通过do-演算执行反向门调整(backdoor adjustment),将复杂的因果干预转化为高效的期望池化操作以消除环境噪声,从而在不牺牲诊断性能的前提下显著提升解释性与视觉证据纯度。
链接: https://arxiv.org/abs/2602.23752
作者: Junhao Jia,Yueyi Wu,Huangwei Chen,Haodong Jing,Haishuai Wang,Jiajun Bu,Lei Wu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the success of deep learning in dermoscopy image analysis, its inherent black-box nature hinders clinical trust, motivating the use of prototypical networks for case-based visual transparency. However, inevitable selection bias in clinical data often drives these models toward shortcut learning, where environmental confounders are erroneously encoded as predictive prototypes, generating spurious visual evidence that misleads medical decision-making. To mitigate these confounding effects, we propose CausalProto, an Unsupervised Causal Prototypical Network that fundamentally purifies the visual evidence chain. Framed within a Structural Causal Model, we employ an Information Bottleneck-constrained encoder to enforce strict unsupervised orthogonal disentanglement between pathological features and environmental confounders. By mapping these decoupled representations into independent prototypical spaces, we leverage the learned spurious dictionary to perform backdoor adjustment via do-calculus, transforming complex causal interventions into efficient expectation pooling to marginalize environmental noise. Extensive experiments on multiple dermoscopy datasets demonstrate that CausalProto achieves superior diagnostic performance and consistently outperforms standard black box models, while simultaneously providing transparent and high purity visual interpretability without suffering from the traditional accuracy compromise.
[CV-106] Hierarchical Multi-Scale Graph Learning with Knowledge-Guided Attention for Whole-Slide Image Survival Analysis
【速读】:该论文旨在解决全切片图像(Whole-Slide Images, WSIs)在癌症预后预测中,传统基于注意力机制的多实例学习(MIL)模型忽略空间组织结构、而基于图的MIL模型依赖静态手工构建图的问题。其解决方案的关键在于提出一种分层多尺度知识感知图网络(Hierarchical Multi-scale Knowledge-aware Graph Network, HMKGN),通过强制执行具有空间局部性约束的分层结构:局部细胞级动态图聚合每个兴趣区域(ROI)内空间邻近的图像块,全局切片级动态图则整合ROI特征形成WSI级别的表征;同时,在ROI层级实现多尺度融合,将粗粒度上下文特征与细粒度局部图聚合结构相结合,从而更精准地捕捉WSI中的空间层次关系和多尺度信息。
链接: https://arxiv.org/abs/2602.23557
作者: Bin Xu,Yufei Zhou,Boling Song,Jingwen Sun,Yang Bian,Cheng Lu,Ye Wu,Jianfei Tu,Xiangxue Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 1 figure, 2 tables, ISBI 2026
Abstract:We propose a Hierarchical Multi-scale Knowledge-aware Graph Network (HMKGN) that models multi-scale interactions and spatially hierarchical relationships within whole-slide images (WSIs) for cancer prognostication. Unlike conventional attention-based MIL, which ignores spatial organization, or graph-based MIL, which relies on static handcrafted graphs, HMKGN enforces a hierarchical structure with spatial locality constraints, wherein local cellular-level dynamic graphs aggregate spatially proximate patches within each region of interest (ROI) and a global slide-level dynamic graph integrates ROI-level features into WSI-level representations. Moreover, multi-scale integration at the ROI level combines coarse contextual features from broader views with fine-grained structural representations from local patch-graph aggregation. We evaluate HMKGN on four TCGA cohorts (KIRC, LGG, PAAD, and STAD; N=513, 487, 138, and 370) for survival prediction. It consistently outperforms existing MIL-based models, yielding improved concordance indices (10.85% better) and statistically significant stratification of patient survival risk (log-rank p 0.05).
[CV-107] Automated Dose-Based Anatomic Region Classification of Radiotherapy Treatment for Big Data Applications
【速读】:该论文旨在解决大规模放射治疗规划数据库(含10万例以上患者)在数据整理过程中因解剖部位标注不一致而导致的标准化难题,尤其针对多机构数据中依赖不统一的计划标签或靶区命名所引发的可靠性问题。其解决方案的关键在于开发了一套自动化软件,通过深度学习分割技术直接从剂量体积重叠中推断解剖区域,从而无需依赖元数据;该方法利用深度学习对118个结构(器官、腺体和骨骼)进行分割,并基于85%与50%等剂量线转化为结构以计算器官特异性剂量重叠指标,进而为每份放疗计划分配按优先级排序的解剖区域标签。此算法在100例临床计划上验证后展现出95%的Top-1准确率,表明其能可靠识别主要治疗部位,适用于“大数据”驱动的放射治疗研究。
链接: https://arxiv.org/abs/2602.23536
作者: Justin Hink,Yasin Abdulkadir,Jack Neylon,James Lamb
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 3 figures, 2 tables, 1 supplemental table, references arXiv:2411.08876 ,
Abstract:Curation is a significant barrier to using ‘big data’ radiotherapy planning databases of 100,000+ patients. Anatomic site stratification is essential for downstream analyses, but current methods rely on inconsistent plan labels or target nomenclature, which is unreliable for multi-institutional data. We developed software to automate labeling by inferring anatomic regions directly from dose-volume overlap with deep-learning segmentations, eliminating metadata reliance. The software processes DICOM files in bulk, utilizing deep learning to segment 118 structures (organs, glands, and bones) categorized into six regions: Cranial, Head and Neck, Pelvis, Abdomen, Thorax, Extremity. The 85% and 50% isodose lines are converted to structures to compute organ-specific dose-overlap metrics. Plans are assigned ranked regional labels based on these intersections. The algorithm was refined using 109 expert-labeled cases and validated on 100 consecutive clinical plans. On the 100-plan test dataset, the algorithm achieved 91% Exact Accuracy (matching all expert labels and order), 94% Top-2 Accuracy (matching the top two expert regions regardless of order), and 95% Top-1 Accuracy (matching the primary expert label). The automated workflow demonstrated high accuracy and robustness. The 95% Top-1 Accuracy is particularly significant, as it enables reliable querying of plans based on the primary treatment site. Detailed analysis of the few mismatched cases showed most were treated areas at the border between anatomic regions and were ambiguous between these two regions in a common-sense interpretation. This algorithm provides a scalable, standardized solution for curating the large, multi-institutional datasets required for ‘big data’ in radiotherapy and provides an important complement to text-based approaches.
[CV-108] Few-Shot Continual Learning for 3D Brain MRI with Frozen Foundation Models
【速读】:该论文旨在解决在少量标注数据条件下,基于大规模3D医学影像预训练的础模型(foundation models)在连续学习(continual learning)场景中难以同时适应多个下游任务的问题。其核心挑战在于如何避免灾难性遗忘(catastrophic forgetting),尤其是在没有历史任务数据回放(no replay)的情况下。解决方案的关键在于采用冻结的预训练骨干网络(frozen pretrained backbone)与任务特定的低秩适配模块(task-specific Low-Rank Adaptation, LoRA)相结合的方法:每个新任务仅训练一个LoRA适配器和对应的输出头,而骨干网络保持不变,从而实现零遗忘(BWT=0),且每任务仅需0.1%的可训练参数,显著提升了资源效率和任务间的平衡性能。
链接: https://arxiv.org/abs/2602.23533
作者: Chi-Sheng Chen,Xinyu Zhang,Guan-Ying Chen,Qiuzhe Xie,Fan Zhang,En-Jui Kuo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Foundation models pretrained on large-scale 3D medical imaging data face challenges when adapted to multiple downstream tasks under continual learning with limited labeled data. We address few-shot continual learning for 3D brain MRI by combining a frozen pretrained backbone with task-specific Low-Rank Adaptation (LoRA) modules. Tasks arrive sequentially – tumor segmentation (BraTS) and brain age estimation (IXI) – with no replay of previous task data. Each task receives a dedicated LoRA adapter; only the adapter and task-specific head are trained while the backbone remains frozen, thereby eliminating catastrophic forgetting by design (BWT=0). In continual learning, sequential full fine-tuning suffers severe forgetting (T1 Dice drops from 0.80 to 0.16 after T2), while sequential linear probing achieves strong T1 (Dice 0.79) but fails on T2 (MAE 1.45). Our LoRA approach achieves the best balanced performance across both tasks: T1 Dice 0.62 \pm 0.07, T2 MAE 0.16 \pm 0.05, with zero forgetting and 0.1% trainable parameters per task, though with noted systematic age underestimation in T2 (Wilcoxon p0.001 ). Frozen foundation models with task-specific LoRA adapters thus offer a practical solution when both tasks must be maintained under few-shot continual learning.
[CV-109] SegReg: Latent Space Regularization for Improved Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割模型在训练过程中仅依赖体素级损失函数(voxel-wise losses)导致潜在特征表示(latent feature representations)缺乏约束,从而限制了模型泛化能力的问题。其解决方案的关键在于提出SegReg框架,通过在U-Net模型的特征图上施加隐空间正则化(latent-space regularisation),引导嵌入结构化以增强特征一致性,同时保持与标准分割损失完全兼容,无需额外参数或记忆机制即可提升跨域泛化能力和持续学习性能。
链接: https://arxiv.org/abs/2602.23509
作者: Puru Vaish,Amin Ranem,Felix Meister,Tobias Heimann,Christoph Brune,Jelmer M. Wolterink
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, 2 tables, under review
Abstract:Medical image segmentation models are typically optimised with voxel-wise losses that constrain predictions only in the output space. This leaves latent feature representations largely unconstrained, potentially limiting generalisation. We propose SegReg, a latent-space regularisation framework that operates on feature maps of U-Net models to encourage structured embeddings while remaining fully compatible with standard segmentation losses. Integrated with the nnU-Net framework, we evaluate SegReg on prostate, cardiac, and hippocampus segmentation and demonstrate consistent improvements in domain generalisation. Furthermore, we show that explicit latent regularisation improves continual learning by reducing task drift and enhancing forward transfer across sequential tasks without adding memory or any extra parameters. These results highlight latent-space regularisation as a practical approach for building more generalisable and continual-learning-ready models.
[CV-110] SGDC: Structurally-Guided Dynamic Convolution for Medical Image Segmentation
【速读】:该论文旨在解决主流医学图像分割方法中因使用平均池化(average pooling)生成动态卷积核而导致的高频空间细节丢失问题,进而造成预测结果过度平滑、精细临床结构保真度下降的问题。其解决方案的关键在于提出一种结构引导的动态卷积机制(Structure-Guided Dynamic Convolution, SGDC),通过引入一个显式监督的结构提取分支,以高保真边界信息指导动态卷积核与门控信号的生成,从而实现空间精准的特征调制;该设计摒弃了传统的上下文聚合方式,改用像素级结构引导,有效避免了平均池化带来的信息损失,显著提升了边界精度(如HD95降低2.05)和交并比(IoU提升0.99%–1.49%)。
链接: https://arxiv.org/abs/2602.23496
作者: Bo Shi,Wei-ping Zhu,M.N.S. Swamy
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatially variant dynamic convolution provides a principled approach of integrating spatial adaptivity into deep neural networks. However, mainstream designs in medical segmentation commonly generate dynamic kernels through average pooling, which implicitly collapses high-frequency spatial details into a coarse, spatially-compressed representation, leading to over-smoothed predictions that degrade the fidelity of fine-grained clinical structures. To address this limitation, we propose a novel Structure-Guided Dynamic Convolution (SGDC) mechanism, which leverages an explicitly supervised structure-extraction branch to guide the generation of dynamic kernels and gating signals for structure-aware feature modulation. Specifically, the high-fidelity boundary information from this auxiliary branch is fused with semantic features to enable spatially-precise feature modulation. By replacing context aggregation with pixel-wise structural guidance, the proposed design effectively prevents the information loss introduced by average pooling. Experimental results show that SGDC achieves state-of-the-art performance on ISIC 2016, PH2, ISIC 2018, and CoNIC datasets, delivering superior boundary fidelity by reducing the Hausdorff Distance (HD95) by 2.05, and providing consistent IoU gains of 0.99%-1.49% over pooling-based baselines. Moreover, the mechanism exhibits strong potential for extension to other fine-grained, structure-sensitive vision tasks, such as small-object detection, offering a principled solution for preserving structural integrity in medical image analysis. To facilitate reproducibility and encourage further research, the implementation code for both our SGE and SGDC modules has been is publicly released at this https URL.
[CV-111] Multiprojective Geometry of Compatible Triples of Fundamental and Essential Matrices
【速读】:该论文旨在解决几何计算机视觉领域中兼容基本矩阵三元组(fundamental matrix triples)的代数结构表征问题,具体目标是计算其多重次数(multidegree)和多重齐次消失理想(multihomogeneous vanishing ideal),从而完整刻画这类矩阵三元组的代数约束。此前文献中已发现的约束集均不完整(即无法生成完整的消失理想),且常伴随对矩阵缩放方式的限制性假设。本文的关键突破在于提出了一组简洁的四次约束(quartic constraints),这些约束在兼容基本矩阵三元组上恒为零,并且在本质矩阵(essential matrix)情形下,结合已有约束可局部定义出兼容本质矩阵三元组的代数簇,显著改进了现有理论框架。
链接: https://arxiv.org/abs/2602.23450
作者: Timothy Duff,Viktor Korotynskiy,Anton Leykin,Tomas Pajdla
机构: 未知
类目: Commutative Algebra (math.AC); Computer Vision and Pattern Recognition (cs.CV); Algebraic Geometry (math.AG)
备注: 17 pages, 2 figures
Abstract:We characterize the variety of compatible fundamental matrix triples by computing its multidegree and multihomogeneous vanishing ideal. This answers the first interesting case of a question recently posed by Bråtelund and Rydell. Our result improves upon previously discovered sets of algebraic constraints in the geometric computer vision literature, which are all incomplete (as they do \emphnot generate the vanishing ideal) and sometimes make restrictive assumptions about how a matrix triple should be scaled. Our discussion touches more broadly on generalized compatibility varieties, whose multihomogeneous vanishing ideals are much less well understood. One of our key new discoveries is a simple set of quartic constraints vanishing on compatible fundamental matrix triples. These quartics are also significant in the setting of essential matrices: together with some previously known constraints, we show that they locally cut out the variety of compatible essential matrix triples.
[CV-112] SALIENT: Frequency-Aware Paired Diffusion for Controllable Long-Tail CT Detection
【速读】:该论文旨在解决全器官CT影像中罕见病灶检测面临的极端类别不平衡和目标体积比极低的问题,这些问题导致即使AUC-ROC较高,精确度(precision)仍会崩溃。传统基于像素空间的扩散模型存在计算成本高、掩码条件生成缺乏属性层面可控性及配对监督不足等局限。解决方案的关键在于提出SALIENT框架——一种掩码条件下的小波域扩散方法,通过在离散小波系数空间中进行结构化扩散,显式分离低频亮度与高频结构细节;并引入可学习的频率感知目标函数,解耦目标与背景属性(如结构、对比度、边缘保真度),从而实现可解释且稳定的优化。此外,结合3D变分自编码器生成多样化病灶掩码和半监督教师模型提供切片级伪标签,显著提升生成真实性和下游检测性能,在低频次场景下获得显著的AUPRC提升。
链接: https://arxiv.org/abs/2602.23447
作者: Yifan Li,Mehrdad Salimitari,Taiyu Zhang,Guang Li,David Dreizin
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 figures
Abstract:Detection of rare lesions in whole-body CT is fundamentally limited by extreme class imbalance and low target-to-volume ratios, producing precision collapse despite high AUROC. Synthetic augmentation with diffusion models offers promise, yet pixel-space diffusion is computationally expensive, and existing mask-conditioned approaches lack controllable attribute-level regulation and paired supervision for accountable training. We introduce SALIENT, a mask-conditioned wavelet-domain diffusion framework that synthesizes paired lesion-masking volumes for controllable CT augmentation under long-tail regimes. Instead of denoising in pixel space, SALIENT performs structured diffusion over discrete wavelet coefficients, explicitly separating low-frequency brightness from high-frequency structural detail. Learnable frequency-aware objectives disentangle target and background attributes (structure, contrast, edge fidelity), enabling interpretable and stable optimization. A 3D VAE generates diverse volumetric lesion masks, and a semi-supervised teacher produces paired slice-level pseudo-labels for downstream mask-guided detection. SALIENT improves generative realism, as reflected by higher MS-SSIM (0.63 to 0.83) and lower FID (118.4 to 46.5). In a separate downstream evaluation, SALIENT-augmented training improves long-tail detection performance, yielding disproportionate AUPRC gains across low prevalences and target-to-volume ratios. Optimal synthetic ratios shift from 2x to 4x as labeled seed size decreases, indicating a seed-dependent augmentation regime under low-label conditions. SALIENT demonstrates that frequency-aware diffusion enables controllable, computationally efficient precision rescue in long-tail CT detection.
[CV-113] Analytical Expression for Spherically Symmetric Photoacoustic Sources: A Unified General Solution (Theoretical Analysis and Derivation)
【速读】:该论文旨在解决光声成像(Photoacoustic Imaging, PAI)中对具有球对称初始压力分布的源所产生的时空声压场进行精确建模的问题。传统方法往往局限于特定几何或分布形式,缺乏通用性,限制了系统设计与信号分析的准确性。解决方案的关键在于从基本的光声波方程出发,推导出适用于任意球对称初始压力分布的统一解析解,并进一步提供多种典型分布(如均匀球形、高斯、指数及幂律分布)的具体表达式以及远场近似形式。这一理论框架不仅提升了模型的普适性和精度,还为光声成像系统的仿真与优化提供了高效、可靠的数学工具,相关代码已开源以支持快速前向模拟。
链接: https://arxiv.org/abs/2602.23375
作者: Shuang Li,Yibing Wang,Yu Zhang,Changhui Li
机构: 北京大学(University of Peking)
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Here we present a comprehensive derivation of the analytical expression for the spatiotemporal acoustic pressure generated by photoacoustic sources with spherically symmetric initial pressure distributions. Starting from the fundamental photoacoustic wave equation, we derive a unified analytical solution applicable to arbitrary spherically symmetric initial distributions. Specific expressions are provided for several common distributions including uniform spherical sources, Gaussian distributions, exponential distributions, and power-law distributions. Far-field approximations are also discussed. The derived expressions provide valuable tools for photoacoustic imaging system design and signal analysis. We provide codes for ultrafast forward simulation using the general analytical spherically symmetric model, the implementation is available in the GitHub repository: \hrefthis https URL.
人工智能
[AI-0] CUDA Agent : Large-Scale Agent ic RL for High-Performance CUDA Kernel Generation
【速读】:该论文旨在解决生成式 AI 在 CUDA 内核优化(CUDA kernel optimization)任务中表现不佳的问题,即尽管大语言模型(LLM)在通用编程任务中表现出色,但在生成高性能 CUDA 代码方面仍落后于基于编译器的系统。传统方法要么依赖无训练的微调,要么在固定多轮执行反馈循环中进行微调,但均无法从根本上提升模型对 CUDA 优化的内在能力,导致性能提升有限。其解决方案的关键在于提出 CUDA Agent——一个大规模智能体强化学习(agentic reinforcement learning)系统,包含三个核心组件:可扩展的数据合成管道、带有自动验证与性能分析的技能增强型 CUDA 开发环境(用于提供可靠奖励信号),以及支持稳定训练的强化学习算法技术。该方案使模型能够通过自主探索和反馈迭代积累 CUDA 内核优化经验,最终在 KernelBench 基准测试中显著超越现有最优模型,尤其在最困难的 Level-3 设置上实现约 40% 的性能优势。
链接: https://arxiv.org/abs/2602.24286
作者: Weinan Dai,Hanlin Wu,Qiying Yu,Huan-ang Gao,Jiahao Li,Chengquan Jiang,Weiqiang Lou,Yufan Song,Hongli Yu,Jiaze Chen,Wei-Ying Ma,Ya-Qin Zhang,Jingjing Liu,Mingxuan Wang,Xin Liu,Hao Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as this http URL for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model’s intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100%, 100%, and 92% faster rate over this http URL on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40% on the hardest Level-3 setting.
[AI-1] Memory Caching: RNNs with Growing Memory
【速读】:该论文旨在解决循环神经网络(Recurrent Neural Networks, RNNs)在长序列建模中因固定大小记忆容量而导致的召回能力不足问题,尤其是在需要高记忆保留能力的任务中表现逊于Transformer模型。其核心解决方案是提出**记忆缓存(Memory Caching, MC)**技术,通过缓存RNN隐藏状态(hidden states)的检查点来动态扩展有效记忆容量,从而在保持线性时间复杂度(O(L))的同时模拟Transformer随序列长度增长的记忆能力(O(L²))。MC通过四种变体(包括门控聚合和稀疏选择机制)实现了对浅层与深层记忆模块的有效增强,在语言建模和长上下文理解任务中显著提升RNN性能,且在上下文召回任务中接近Transformer的精度,缩小了与最优生成式AI模型之间的差距。
链接: https://arxiv.org/abs/2602.24281
作者: Ali Behrouz,Zeman Li,Yuan Deng,Peilin Zhong,Meisam Razaviyayn,Vahab Mirrokni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformers have been established as the de-facto backbones for most recent advances in sequence modeling, mainly due to their growing memory capacity that scales with the context length. While plausible for retrieval tasks, it causes quadratic complexity and so has motivated recent studies to explore viable subquadratic recurrent alternatives. Despite showing promising preliminary results in diverse domains, such recurrent architectures underperform Transformers in recall-intensive tasks, often attributed to their fixed-size memory. In this paper, we introduce Memory Caching (MC), a simple yet effective technique that enhances recurrent models by caching checkpoints of their memory states (a.k.a. hidden states). Memory Caching allows the effective memory capacity of RNNs to grow with sequence length, offering a flexible trade-off that interpolates between the fixed memory (i.e., O(L) complexity) of RNNs and the growing memory (i.e., O(L^2) complexity) of Transformers. We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules. Our experimental results on language modeling, and long-context understanding tasks show that MC enhances the performance of recurrent models, supporting its effectiveness. The results of in-context recall tasks indicate that while Transformers achieve the best accuracy, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.
[AI-2] A Minimal Agent for Automated Theorem Proving
【速读】:该论文旨在解决当前基于AI的定理证明系统(AI-based theorem prover)架构多样性导致难以进行系统性比较的问题。为实现这一目标,作者提出了一种极简的智能体基线(minimal agentic baseline),其关键在于统一整合了当前最先进系统共有的核心功能:迭代式证明精炼(iterative proof refinement)、库搜索(library search)和上下文管理(context management)。该设计通过简化架构实现了与主流方法相当甚至更优的性能,同时显著提升了样本效率和成本效益,从而为未来研究提供了一个可复现、易访问的基准参考。
链接: https://arxiv.org/abs/2602.24273
作者: Borja Requena Pozo,Austin Letson,Krystian Nowakowski,Izan Beltran Ferreiro,Leopoldo Sarra
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We evaluate our baseline using qualitatively different benchmarks and compare various popular models and design choices, and demonstrate competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture. Our results demonstrate consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness. The implementation is released open-source as a candidate reference for future research and as an accessible prover for the community.
[AI-3] Efficient Discovery of Approximate Causal Abstractions via Neural Mechanism Sparsification
【速读】:该论文旨在解决如何从预训练神经网络中提取可解释的因果抽象(causal abstraction)问题,即寻找一个简化且高阶的结构化因果模型(Structural Causal Model, SCM),该模型在干预下仍能忠实反映原网络的行为。传统方法通常依赖暴力替换干预或重新训练,效率低下。论文的关键创新在于将结构化剪枝(structured pruning)视为对近似因果抽象空间的搜索,并将训练好的网络建模为确定性SCM,从而推导出一种基于干预风险(Interventional Risk)的目标函数。其二阶展开给出了单位替换为常数或折叠至邻近单元的闭式判据;在均匀曲率假设下,该评分退化为激活方差,恢复了基于方差的剪枝方法作为特例,同时阐明了其失效场景。由此得到的高效算法能够直接从预训练模型中提取稀疏且干预忠实的因果抽象,通过交换干预实验验证其有效性。
链接: https://arxiv.org/abs/2602.24266
作者: Amir Asiaee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural networks are hypothesized to implement interpretable causal mechanisms, yet verifying this requires finding a causal abstraction – a simpler, high-level Structural Causal Model (SCM) faithful to the network under interventions. Discovering such abstractions is hard: it typically demands brute-force interchange interventions or retraining. We reframe the problem by viewing structured pruning as a search over approximate abstractions. Treating a trained network as a deterministic SCM, we derive an Interventional Risk objective whose second-order expansion yields closed-form criteria for replacing units with constants or folding them into neighbors. Under uniform curvature, our score reduces to activation variance, recovering variance-based pruning as a special case while clarifying when it fails. The resulting procedure efficiently extracts sparse, intervention-faithful abstractions from pretrained networks, which we validate via interchange interventions.
[AI-4] FaultXformer: A Transformer-Encoder Based Fault Classification and Location Identification model in PMU-Integrated Active Electrical Distribution System
【速读】:该论文旨在解决电力配电系统中故障检测与定位的准确性问题,尤其是在分布式能源资源(DER)广泛接入背景下,电网运行复杂性显著增加所带来的挑战。解决方案的关键在于提出一种基于Transformer编码器架构的FaultXformer模型,该模型通过双阶段处理流程实现高保真表示学习:第一阶段利用时间序列电流数据提取丰富的时序特征,用于初步识别故障类型并精确定位多个节点;第二阶段进一步区分不同故障类型并精准定位故障位置。此方法显著优于传统深度学习模型(如CNN、RNN和LSTM),在IEEE 13节点测试馈线数据集上实现了98.76%的故障类型分类准确率和98.92%的故障定位准确率,验证了其在高DER渗透场景下的有效性。
链接: https://arxiv.org/abs/2602.24254
作者: Kriti Thakur,Alivelu Manga Parimi,Mayukha Pal
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Accurate fault detection and localization in electrical distribution systems is crucial, especially with the increasing integration of distributed energy resources (DERs), which inject greater variability and complexity into grid operations. In this study, FaultXformer is proposed, a Transformer encoder-based architecture developed for automatic fault analysis using real-time current data obtained from phasor measurement unit (PMU). The approach utilizes time-series current data to initially extract rich temporal information in stage 1, which is crucial for identifying the fault type and precisely determining its location across multiple nodes. In Stage 2, these extracted features are processed to differentiate among distinct fault types and identify the respective fault location within the distribution system. Thus, this dual-stage transformer encoder pipeline enables high-fidelity representation learning, considerably boosting the performance of the work. The model was validated on a dataset generated from the IEEE 13-node test feeder, simulated with 20 separate fault locations and several DER integration scenarios, utilizing current measurements from four strategically located PMUs. To demonstrate robust performance evaluation, stratified 10-fold cross-validation is performed. FaultXformer achieved average accuracies of 98.76% in fault type classification and 98.92% in fault location identification across cross-validation, consistently surpassing conventional deep learning baselines convolutional neural network (CNN), recurrent neural network (RNN). long short-term memory (LSTM) by 1.70%, 34.95%, and 2.04% in classification accuracy and by 10.82%, 40.89%, and 6.27% in location accuracy, respectively. These results demonstrate the efficacy of the proposed model with significant DER penetration.
[AI-5] SafeGen-LLM : Enhancing Safety Generalization in Task Planning for Robotic Systems
【速读】:该论文旨在解决机器人系统中安全关键任务规划的挑战,即传统规划方法可扩展性差、基于强化学习(Reinforcement Learning, RL)的方法泛化能力弱,而基础大语言模型(Large Language Models, LLMs)无法保证安全性。解决方案的关键在于提出一种名为SafeGen-LLM的安全通用大语言模型,其核心创新包括:构建包含显式安全约束的多领域Planning Domain Definition Language 3 (PDDL3) 基准数据集,并设计两阶段后训练框架——首先通过监督微调(Supervised Fine-Tuning, SFT)学习规划语法与语义,再利用由形式验证导出的细粒度奖励机器(reward machines)引导组相对策略优化(Group Relative Policy Optimization, GRPO),实现安全对齐与课程学习相结合,从而在多个领域和输入格式下显著提升任务计划的安全性与泛化能力。
链接: https://arxiv.org/abs/2602.24235
作者: Jialiang Fan,Weizhe Xu,Mengyu Liu,Oleg Sokolsky,Insup Lee,Fangxin Kong
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures
Abstract:Safety-critical task planning in robotic systems remains challenging: classical planners suffer from poor scalability, Reinforcement Learning (RL)-based methods generalize poorly, and base Large Language Models (LLMs) cannot guarantee safety. To address this gap, we propose safety-generalizable large language models, named SafeGen-LLM. SafeGen-LLM can not only enhance the safety satisfaction of task plans but also generalize well to novel safety properties in various domains. We first construct a multi-domain Planning Domain Definition Language 3 (PDDL3) benchmark with explicit safety constraints. Then, we introduce a two-stage post-training framework: Supervised Fine-Tuning (SFT) on a constraint-compliant planning dataset to learn planning syntax and semantics, and Group Relative Policy Optimization (GRPO) guided by fine-grained reward machines derived from formal verification to enforce safety alignment and by curriculum learning to better handle complex tasks. Extensive experiments show that SafeGen-LLM achieves strong safety generalization and outperforms frontier proprietary baselines across multi-domain planning tasks and multiple input formats (e.g., PDDLs and natural language).
[AI-6] An Efficient Unsupervised Federated Learning Approach for Anomaly Detection in Heterogeneous IoT Networks
【速读】:该论文旨在解决物联网(IoT)环境中异构数据导致的联邦学习(Federated Learning, FL)在异常检测任务中模型性能下降和隐私保护难以兼顾的问题。其核心挑战在于不同设备间的数据格式、功能差异及通信约束引发的特征异质性,使得局部模型训练困难且全局模型优化受限。解决方案的关键在于提出一种高效的无监督联邦学习框架,通过融合两个互补的物联网数据集——一个专注于异常检测,另一个用于设备识别——来提取共享特征并保留各数据集特有的特征表示,从而提升异常检测准确性;同时引入可解释人工智能(Explainable AI, XAI)技术(如SHAP值)增强模型决策的透明度与可解释性,确保关键特征对本地决策的影响得以量化分析。实验表明,该方法显著优于传统联邦学习策略,在去中心化物联网场景下实现了更优的异常检测性能。
链接: https://arxiv.org/abs/2602.24209
作者: Mohsen Tajgardan,Atena Shiranzaei,Mahdi Rabbani,Reza Khoshkangini,Mahtab Jamali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning (FL) is an effective paradigm for distributed environments such as the Internet of Things (IoT), where data from diverse devices with varying functionalities remains localized while contributing to a shared global model. By eliminating the need to transmit raw data, FL inherently preserves privacy. However, the heterogeneous nature of IoT data, stemming from differences in device capabilities, data formats, and communication constraints, poses significant challenges to maintaining both global model performance and privacy. In the context of IoT-based anomaly detection, unsupervised FL offers a promising means to identify abnormal behavior without centralized data aggregation. Nevertheless, feature heterogeneity across devices complicates model training and optimization, hindering effective implementation. In this study we propose an efficient unsupervised FL framework that enhances anomaly detection by leveraging shared features from two distinct IoT datasets: one focused on anomaly detection and the other on device identification, while preserving dataset-specific features. To improve transparency and interpretability, we employ explainable AI techniques, such as SHAP, to identify key features influencing local model decisions. Experiments conducted on real-world IoT datasets demonstrate that the proposed method significantly outperforms conventional FL approaches in anomaly detection accuracy. This work underscores the potential of using shared features from complementary datasets to optimize unsupervised federated learning and achieve superior anomaly detection results in decentralized IoT environments.
[AI-7] Resilient Strategies for Stochastic Systems: How Much Does It Take to Break a Winning Strategy? AAMAS2026
【速读】:该论文旨在解决在不确定性环境下,如何设计具有鲁棒性的决策策略问题,特别是针对那些能够导致代理(agent)决策被翻转的扰动。这类扰动可能源于执行器故障等环境因素,使得原定动作无法正确执行。解决方案的关键在于引入**韧性(resilience)**的概念于随机设定中,并构建一套完整的基础问题框架,适用于具有可达性与安全性目标的马尔可夫决策过程(Markov Decision Processes, MDPs),并可自然扩展至随机博弈(stochastic games)。为处理随机性,论文提供了多种扰动累积方式的聚合方法,如期望值或最坏情况下的评估,并采用定量指标(如扰动发生频率)来分析无限次扰动情形,从而实现对策略韧性的形式化刻画与优化。
链接: https://arxiv.org/abs/2602.24191
作者: Kush Grover,Markel Zubia,Debraj Chakraborty,Muqsit Azeem,Nils Jansen,Jan Kretinsky
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: To appear in Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), Paphos, Cyprus, May 25-29, 2026
Abstract:We study the problem of resilient strategies in the presence of uncertainty. Resilient strategies enable an agent to make decisions that are robust against disturbances. In particular, we are interested in those disturbances that are able to flip a decision made by the agent. Such a disturbance may, for instance, occur when the intended action of the agent cannot be executed due to a malfunction of an actuator in the environment. In this work, we introduce the concept of resilience in the stochastic setting and present a comprehensive set of fundamental problems. Specifically, we discuss such problems for Markov decision processes with reachability and safety objectives, which also smoothly extend to stochastic games. To account for the stochastic setting, we provide various ways of aggregating the amounts of disturbances that may have occurred, for instance, in expectation or in the worst case. Moreover, to reason about infinite disturbances, we use quantitative measures, like their frequency of occurrence.
[AI-8] Learning Flexible Job Shop Scheduling under Limited Buffers and Material Kitting Constraints
【速读】:该论文旨在解决实际生产环境中被忽视的柔性作业车间调度问题(Flexible Job Shop Scheduling Problem, FJSP)中的关键约束——有限缓冲区(Limited Buffers)与物料齐套(Material Kitting)问题,这些问题对生产效率具有显著影响。为应对现有深度强化学习(Deep Reinforcement Learning, DRL)在建模复杂依赖关系和长期约束时能力不足的问题,论文提出在DRL框架中引入异质图网络(Heterogeneous Graph Network)来刻画全局状态,通过在机器、工序与缓冲区之间构建高效的消息传递机制,聚焦于避免长时间序列调度中频繁的托盘更换行为,从而提升缓冲区利用率与决策质量。实验表明,该方法在makespan和托盘变更次数上优于传统启发式算法与先进DRL方法,并在解的质量与计算成本之间取得良好平衡。
链接: https://arxiv.org/abs/2602.24180
作者: Shishun Zhang,Juzhan Xu,Yidan Fan,Chenyang Zhu,Ruizhen Hu,Yongjun Wang,Kai Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures, conference
Abstract:The Flexible Job Shop Scheduling Problem (FJSP) originates from real production lines, while some practical constraints are often ignored or idealized in current FJSP studies, among which the limited buffer problem has a particular impact on production efficiency. To this end, we study an extended problem that is closer to practical scenarios–the Flexible Job Shop Scheduling Problem with Limited Buffers and Material Kitting. In recent years, deep reinforcement learning (DRL) has demonstrated considerable potential in scheduling tasks. However, its capacity for state modeling remains limited when handling complex dependencies and long-term constraints. To address this, we leverage a heterogeneous graph network within the DRL framework to model the global state. By constructing efficient message passing among machines, operations, and buffers, the network focuses on avoiding decisions that may cause frequent pallet changes during long-sequence scheduling, thereby helping improve buffer utilization and overall decision quality. Experimental results on both synthetic and real production line datasets show that the proposed method outperforms traditional heuristics and advanced DRL methods in terms of makespan and pallet changes, and also achieves a good balance between solution quality and computational cost. Furthermore, a supplementary video is provided to showcase a simulation system that effectively visualizes the progression of the production line.
[AI-9] LemmaBench: A Live Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)在研究级数学能力评估中缺乏动态、真实科研问题基准的问题。现有基准主要依赖静态的竞赛或教科书式题目,难以反映实际数学研究的复杂性和前沿性。其解决方案的关键在于构建一个可更新的自动基准测试流程:该流程从arXiv论文中自动提取定理中的引理,并通过显式化所有假设和定义将其重写为自包含的命题,从而形成可定期更新、直接源自人类数学研究的新问题集。此方法确保了评估的时效性与真实性,同时允许历史版本用于训练而不影响未来评估的公平性。
链接: https://arxiv.org/abs/2602.24173
作者: Antoine Peyronnet,Fabian Gloeckle,Amaury Hayat
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures, 5 Tables
Abstract:We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15 % accuracy in theorem proving (pass@1) depending on the model, showing that there is currently a large margin of progression for LLMs to reach human-level proving capabilities in a research context.
[AI-10] Artificial Agency Program: Curiosity compression and communication in agents
【速读】:该论文旨在解决当前人工智能系统缺乏自主性与环境适应能力的问题,特别是如何构建一种在物理和计算资源受限条件下,能够通过内在动机驱动持续学习并有效交互的智能体。其解决方案的关键在于提出“人工代理计划”(Artificial Agency Program, AAP),将AI视为嵌入现实世界的、资源受限的代理,以好奇心驱动的学习进展为核心机制,并通过预测压缩、内在动机、赋能与控制、接口质量(统一性)以及语言/自我通信等选择性信息瓶颈来整合多学科理论,形成一个可验证的实验框架,其中智能体需在观测、行动与反思之间分配有限预算,从而实现感知、理解与执行能力的增强及人-工具-环境交互摩擦的降低。
链接: https://arxiv.org/abs/2602.24100
作者: Richard Csaky
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This is a working draft. Feedback and criticism is most welcome
Abstract:This paper presents the Artificial Agency Program (AAP), a position and research agenda for building AI systems as reality embedded, resource-bounded agents whose development is driven by curiosity-as-learning-progress under physical and computational constraints. The central thesis is that AI is most useful when treated as part of an extended human–tool system that increases sensing, understanding, and actuation capability while reducing friction at the interface between people, tools, and environments. The agenda unifies predictive compression, intrinsic motivation, empowerment and control, interface quality (unification), and language/self-communication as selective information bottlenecks. We formulate these ideas as a falsifiable program with explicit costs, staged experiments, and a concrete multimodal tokenized testbed in which an agent allocates limited budget among observation, action, and deliberation. The aim is to provide a conceptual and experimental framework that connects intrinsic motivation, information theory, thermodynamics, bounded rationality, and modern reasoning systems
[AI-11] Bi-level RL-Heuristic Optimization for Real-world Winter Road Maintenance
【速读】:该论文旨在解决冬季道路维护中的大规模路径优化问题,传统方法依赖人工决策且难以高效处理复杂路网,导致资源分配不均、通行时间过长及碳排放较高。解决方案的关键在于提出一种可扩展的双层优化框架:上层采用强化学习(Reinforcement Learning, RL)代理对道路网络进行智能分区并最优调度多 depot 资源,下层在每个子区域求解多目标车辆路径问题(Multi-objective Vehicle Routing Problem, VRP),以最小化最大车辆行驶时间和总碳排放。该方法显式考虑了车辆特性、仓库容量和路段约束,在真实英国主干道网络(如 M25、M6 和 A1)中验证了其有效性,实现了负载均衡、通行时间低于两小时阈值、减排与成本节约的显著提升。
链接: https://arxiv.org/abs/2602.24097
作者: Yue Xie,Zizhen Xu,William Beazley,Fumiya Iida
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Winter road maintenance is critical for ensuring public safety and reducing environmental impacts, yet existing methods struggle to manage large-scale routing problems effectively and mostly reply on human decision. This study presents a novel, scalable bi-level optimization framework, validated on real operational data on UK strategic road networks (M25, M6, A1), including interconnected local road networks in surrounding areas for vehicle traversing, as part of the highway operator’s efforts to solve existing planning challenges. At the upper level, a reinforcement learning (RL) agent strategically partitions the road network into manageable clusters and optimally allocates resources from multiple depots. At the lower level, a multi-objective vehicle routing problem (VRP) is solved within each cluster, minimizing the maximum vehicle travel time and total carbon emissions. Unlike existing approaches, our method handles large-scale, real-world networks efficiently, explicitly incorporating vehicle-specific constraints, depot capacities, and road segment requirements. Results demonstrate significant improvements, including balanced workloads, reduced maximum travel times below the targeted two-hour threshold, lower emissions, and substantial cost savings. This study illustrates how advanced AI-driven bi-level optimization can directly enhance operational decision-making in real-world transportation and logistics.
[AI-12] Adaptive Correlation-Weighted Intrinsic Rewards for Reinforcement Learning
【速读】:该论文旨在解决稀疏奖励强化学习中因固定内在奖励缩放系数导致探索效率低、训练不稳定的问题。传统方法依赖人工调参的标量系数,难以在不同任务间保持性能一致性;而本文提出的ACWI(Adaptive Correlation Weighted Intrinsic)框架通过在线学习状态相关的缩放系数来实现自适应内在奖励调节。其核心创新在于引入一个轻量级Beta网络,基于编码器结构从智能体状态直接预测内在奖励权重,并采用基于相关性的优化目标,使加权后的内在奖励与未来外在回报的折扣和对齐,从而在不增加显著计算开销的前提下,提升样本效率和训练稳定性。
链接: https://arxiv.org/abs/2602.24081
作者: Viet Bac Nguyen,Phuong Thai Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose ACWI (Adaptive Correlation Weighted Intrinsic), an adaptive intrinsic reward scaling framework designed to dynamically balance intrinsic and extrinsic rewards for improved exploration in sparse reward reinforcement learning. Unlike conventional approaches that rely on manually tuned scalar coefficients, which often result in unstable or suboptimal performance across tasks, ACWI learns a state dependent scaling coefficient online. Specifically, ACWI introduces a lightweight Beta Network that predicts the intrinsic reward weight directly from the agent state through an encoder based architecture. The scaling mechanism is optimized using a correlation based objective that encourages alignment between the weighted intrinsic rewards and discounted future extrinsic returns. This formulation enables task adaptive exploration incentives while preserving computational efficiency and training stability. We evaluate ACWI on a suite of sparse reward environments in MiniGrid. Experimental results demonstrate that ACWI consistently improves sample efficiency and learning stability compared to fixed intrinsic reward baselines, achieving superior performance with minimal computational overhead.
[AI-13] Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction ICLR2026
【速读】:该论文旨在解决当前语音到语音(Speech-to-Speech, S2S)对话系统在人类相似性(human-likeness)方面评估不足的问题,即现有系统是否能像人类一样自然、真实地进行对话尚无明确答案。为应对这一挑战,作者首次对9个最先进的S2S系统进行了图灵测试,收集了2,968条人类判断结果,发现所有被测系统均未通过测试,揭示了显著的人类相似性差距。解决方案的关键在于:首先构建了一个包含18个细粒度维度的人类相似性分类体系,并对收集的对话进行众包标注;其次识别出瓶颈并非语义理解能力不足,而是源于副语言特征(paralinguistic features)、情感表达力(emotional expressivity)和对话人格一致性(conversational persona)等非语义层面;最后提出一种可解释模型,基于上述细粒度评分实现高精度且透明的人类与机器对话区分,从而为自动化的S2S系统人类相似性评估提供了新工具。
链接: https://arxiv.org/abs/2602.24080
作者: Xiang Li,Jiabao Gao,Sipei Lin,Xuan Zhou,Chi Zhang,Bo Cheng,Jiale Han,Benyou Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted by ICLR 2026 Conference
Abstract:The pursuit of human-like conversational agents has long been guided by the Turing test. For modern speech-to-speech (S2S) systems, a critical yet unanswered question is whether they can converse like humans. To tackle this, we conduct the first Turing test for S2S systems, collecting 2,968 human judgments on dialogues between 9 state-of-the-art S2S systems and 28 human participants. Our results deliver a clear finding: no existing evaluated S2S system passes the test, revealing a significant gap in human-likeness. To diagnose this failure, we develop a fine-grained taxonomy of 18 human-likeness dimensions and crowd-annotate our collected dialogues accordingly. Our analysis shows that the bottleneck is not semantic understanding but stems from paralinguistic features, emotional expressivity, and conversational persona. Furthermore, we find that off-the-shelf AI models perform unreliably as Turing test judges. In response, we propose an interpretable model that leverages the fine-grained human-likeness ratings and delivers accurate and transparent human-vs-machine discrimination, offering a powerful tool for automatic human-likeness evaluation. Our work establishes the first human-likeness evaluation for S2S systems and moves beyond binary outcomes to enable detailed diagnostic insights, paving the way for human-like improvements in conversational AI systems.
[AI-14] CIRCLE: A Framework for Evaluating AI from a Real-World Lens
【速读】:该论文旨在解决模型中心性能指标与人工智能(AI)在实际部署中所产生的真实结果之间的“现实差距”问题,即现有评估体系难以提供决策者所需关于AI技术在真实用户变异性及约束条件下行为的系统性证据。解决方案的关键在于提出CIRCLE框架——一个六阶段生命周期方法论,其核心是通过形式化将栈外利益相关者的关切转化为可测量信号,从而将情境敏感的定性洞察与可扩展的定量指标相连接;该框架整合了实地测试、红队演练和纵向研究等方法,形成协调一致的流程,产出既具跨场景可比性又保有本地情境敏感性的系统性知识,使治理依据转向基于实际下游影响而非理论能力。
链接: https://arxiv.org/abs/2602.24055
作者: Reva Schwartz,Carina Westling,Morgan Briggs,Marzieh Fadaee,Isar Nejadgholi,Matthew Holmes,Fariza Rashid,Maya Carlyle,Afaf Taïk,Kyra Wilson,Peter Douglas,Theodora Skeadas,Gabriella Waters,Rumman Chowdhury,Thiago Lacerda
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted at Intelligent Systems Conference (IntelliSys) 2026
Abstract:This paper proposes CIRCLE, a six-stage, lifecycle-based framework to bridge the reality gap between model-centric performance metrics and AI’s materialized outcomes in deployment. While existing frameworks like MLOps focus on system stability and benchmarks measure abstract capabilities, decision-makers outside the AI stack lack systematic evidence about the behavior of AI technologies under real-world user variability and constraints. CIRCLE operationalizes the Validation phase of TEVV (Test, Evaluation, Verification, and Validation) by formalizing the translation of stakeholder concerns outside the stack into measurable signals. Unlike participatory design, which often remains localized, or algorithmic audits, which are often retrospective, CIRCLE provides a structured, prospective protocol for linking context-sensitive qualitative insights to scalable quantitative metrics. By integrating methods such as field testing, red teaming, and longitudinal studies into a coordinated pipeline, CIRCLE produces systematic knowledge: evidence that is comparable across sites yet sensitive to local context. This can enable governance based on materialized downstream effects rather than theoretical capabilities.
[AI-15] Portfolio Reinforcement Learning with Scenario-Context Rollout
【速读】:该论文旨在解决市场 regime shifts 引发的分布偏移(distribution shift)对投资组合再平衡策略性能的负面影响,尤其是在压力事件下,传统方法难以生成合理的多变量收益情景。其解决方案的关键在于提出宏观条件下的场景-上下文滚动(macro-conditioned scenario-context rollout, SCR),通过引入反事实下一状态来修正时序差分学习中的奖励—转移不一致问题,从而稳定强化学习(Reinforcement Learning, RL)评论家(critic)训练过程,并实现更优的偏差-方差权衡。该方法在31个不同的美国股票和ETF投资组合中显著提升夏普比率(最高达76%)并降低最大回撤(最高达53%)。
链接: https://arxiv.org/abs/2602.24037
作者: Vanya Priscillia Bendatu,Yao Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Market regime shifts induce distribution shifts that can degrade the performance of portfolio rebalancing policies. We propose macro-conditioned scenario-context rollout (SCR) that generates plausible next-day multivariate return scenarios under stress events. However, doing so faces new challenges, as history will never tell what would have happened differently. As a result, incorporating scenario-based rewards from rollouts introduces a reward–transition mismatch in temporal-difference learning, destabilizing RL critic training. We analyze this inconsistency and show it leads to a mixed evaluation target. Guided by this analysis, we construct a counterfactual next state using the rollout-implied continuations and augment the critic agent’s bootstrap target. Doing so stabilizes the learning and provides a viable bias-variance tradeoff. In out-of-sample evaluations across 31 distinct universes of U.S. equity and ETF portfolios, our method improves Sharpe ratio by up to 76% and reduces maximum drawdown by up to 53% compared with classic and RL-based portfolio rebalancing baselines. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.24037 [cs.AI] (or arXiv:2602.24037v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.24037 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-16] Foundation World Models for Agents that Learn Verify and Adapt Reliably Beyond Static Environments AAMAS2026
【速读】:该论文旨在解决当前自主代理在开放世界中面临的核心挑战:如何在动态变化的环境中实现高效学习、可靠行动与行为自适应,而传统方法通常假设任务和环境固定且缺乏新颖性,限制了世界模型对策略演化支持的能力。解决方案的关键在于提出“基础世界模型”(foundation world models)这一新范式,其核心是构建持久且可组合的表示体系,统一强化学习、反应式/程序合成与抽象机制,并围绕四大组件展开:(i) 从规范中学习可优化的奖励模型;(ii) 将形式化验证嵌入学习全过程以保障可靠性;(iii) 在线抽象校准以量化预测可信度;(iv) 基于验证器引导的运行时合成与世界模型生成。这些机制共同使代理能够合成可验证程序、从少量交互中推导新策略,并在面对环境变化时保持正确性,从而为学习、推理与适应提供统一基础。
链接: https://arxiv.org/abs/2602.23997
作者: Florent Delgrange
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AAMAS 2026, Blue Sky Idea Track. 4 pages, 1 Figure
Abstract:The next generation of autonomous agents must not only learn efficiently but also act reliably and adapt their behavior in open worlds. Standard approaches typically assume fixed tasks and environments with little or no novelty, which limits world models’ ability to support agents that must evolve their policies as conditions change. This paper outlines a vision for foundation world models: persistent, compositional representations that unify reinforcement learning, reactive/program synthesis, and abstraction mechanisms. We propose an agenda built around four components: (i) learnable reward models from specifications to support optimization with clear objectives; (ii) adaptive formal verification integrated throughout learning; (iii) online abstraction calibration to quantify the reliability of the model’s predictions; and (iv) test-time synthesis and world-model generation guided by verifiers. Together, these components enable agents to synthesize verifiable programs, derive new policies from a small number of interactions, and maintain correctness while adapting to novelty. The resulting framework positions foundation world models as a substrate for learning, reasoning, and adaptation, laying the groundwork for agents that not only act well but can explain and justify the behavior they adopt.
[AI-17] Intrinsic Lorentz Neural Network ICLR2026
【速读】:该论文旨在解决现有超球面神经网络(Hyperbolic Neural Networks)架构中普遍存在的“部分内在性”问题,即混合使用欧几里得(Euclidean)运算与超球面运算,或依赖外在参数化(extrinsic parameterizations),从而导致模型无法完全遵循双曲空间的几何特性。解决方案的关键在于提出一种全内在的洛伦兹神经网络(Intrinsic Lorentz Neural Network, ILNN),其所有计算均在洛伦兹模型(Lorentz model)内完成;核心创新包括:一个基于点到超平面的闭式超球面距离的全内在全连接层(point-to-hyperplane fully connected layer),以及一系列配套的内在模块,如耦合旋子中心化与旋子缩放的旋子批归一化(GyroLBN)、旋子加性偏置、洛伦兹补丁拼接操作和洛伦兹 dropout 层,确保决策函数严格尊重双曲空间的曲率特性,从而显著提升性能并降低训练成本。
链接: https://arxiv.org/abs/2602.23981
作者: Xianglong Shi,Ziheng Chen,Yunhan Jiang,Nicu Sebe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in ICLR 2026
Abstract:Real-world data frequently exhibit latent hierarchical structures, which can be naturally represented by hyperbolic geometry. Although recent hyperbolic neural networks have demonstrated promising results, many existing architectures remain partially intrinsic, mixing Euclidean operations with hyperbolic ones or relying on extrinsic parameterizations. To address it, we propose the \emphIntrinsic Lorentz Neural Network (ILNN), a fully intrinsic hyperbolic architecture that conducts all computations within the Lorentz model. At its core, the network introduces a novel \emphpoint-to-hyperplane fully connected layer (FC), replacing traditional Euclidean affine logits with closed-form hyperbolic distances from features to learned Lorentz hyperplanes, thereby ensuring that the resulting geometric decision functions respect the inherent curvature. Around this fundamental layer, we design intrinsic modules: GyroLBN, a Lorentz batch normalization that couples gyro-centering with gyro-scaling, consistently outperforming both LBN and GyroBN while reducing training time. We additionally proposed a gyro-additive bias for the FC output, a Lorentz patch-concatenation operator that aligns the expected log-radius across feature blocks via a digamma-based scale, and a Lorentz dropout layer. Extensive experiments conducted on CIFAR-10/100 and two genomic benchmarks (TEB and GUE) illustrate that ILNN achieves state-of-the-art performance and computational cost among hyperbolic models and consistently surpasses strong Euclidean baselines. The code is available at \hrefthis https URL\textcolormagentathis url.
[AI-18] Pessimistic Auxiliary Policy for Offline Reinforcement Learning
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning)中因采样分布外(out-of-distribution)动作而导致的近似误差累积与显著高估问题。解决方案的关键在于构建一种新的悲观辅助策略(pessimistic auxiliary policy),通过最大化Q函数的下置信界(lower confidence bound)来实现对可靠动作的采样,从而在已学习策略邻域内获得较高价值且低不确定性的动作,减少因采样高估动作引入的近似误差,有效缓解误差累积,提升现有离线强化学习方法的性能。
链接: https://arxiv.org/abs/2602.23974
作者: Fan Zhang,Baoru Huang,Xin Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Offline reinforcement learning aims to learn an agent from pre-collected datasets, avoiding unsafe and inefficient real-time interaction. However, inevitable access to out-ofdistribution actions during the learning process introduces approximation errors, causing the error accumulation and considerable overestimation. In this paper, we construct a new pessimistic auxiliary policy for sampling reliable actions. Specifically, we develop a pessimistic auxiliary strategy by maximizing the lower confidence bound of the Q-function. The pessimistic auxiliary strategy exhibits a relatively high value and low uncertainty in the vicinity of the learned policy, avoiding the learned policy sampling high-value actions with potentially high errors during the learning process. Less approximation error introduced by sampled action from pessimistic auxiliary strategy leads to the alleviation of error accumulation. Extensive experiments on offline reinforcement learning benchmarks reveal that utilizing the pessimistic auxiliary strategy can effectively improve the efficacy of other offline RL approaches.
[AI-19] SHINE: Sequential Hierarchical Integration Network for EEG and MEG
【速读】:该论文旨在解决自然语音在大脑中的表征问题,特别是如何从脑磁图(MEG)信号中准确重建语音-静默序列(speech-silence sequences),以揭示皮层对语音的解码机制。其解决方案的关键在于提出了一种名为“顺序分层整合网络”(Sequential Hierarchical Integration Network for EEG and MEG, SHINE)的深度学习架构,该模型能够从长达50小时的单个参与者MEG数据中高效提取与语音活动相关的神经响应特征;同时,在扩展赛道中引入语音包络(speech envelope)和梅尔频谱图(Mel spectrogram)的辅助重建任务,通过多任务学习增强训练效果,最终在LibriBrain Competition 2025的测试集上实现了F1-macro分数达0.9184的优异性能。
链接: https://arxiv.org/abs/2602.23960
作者: Xiran Xu,Yujie Yan,Xihong Wu,Jing Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: ranked second at LibriBrain Competition 2025 this https URL
Abstract:How natural speech is represented in the brain constitutes a major challenge for cognitive neuroscience, with cortical envelope-following responses playing a central role in speech decoding. This paper presents our approach to the Speech Detection task in the LibriBrain Competition 2025, utilizing over 50 hours of magnetoencephalography (MEG) signals from a single participant listening to LibriVox audiobooks. We introduce the proposed Sequential Hierarchical Integration Network for EEG and MEG (SHINE) to reconstruct the binary speech-silence sequences from MEG signals. In the Extended Track, we further incorporated auxiliary reconstructions of speech envelopes and Mel spectrograms to enhance training. Ensemble methods combining SHINE with baselines (BrainMagic, AWavNet, ConvConcatNet) achieved F1-macro scores of 0.9155 (Standard Track) and 0.9184 (Extended Track) on the leaderboard test set.
[AI-20] Hierarchical Concept-based Interpretable Models ICLR2026
【速读】:该论文旨在解决现代深度神经网络因潜在表示不透明而导致的可解释性难题,尤其是现有概念嵌入模型(Concept Embedding Models, CEMs)无法有效建模概念间关系且依赖多粒度概念标注的问题。其核心解决方案是提出分层概念嵌入模型(Hierarchical Concept Embedding Models, HiCEMs),通过显式构建层次结构来建模概念之间的层级关系;同时引入概念分割(Concept Splitting)方法,在无需额外标注的情况下,从预训练CEM的嵌入空间中自动发现细粒度子概念,从而显著降低标注成本并支持测试时在不同粒度上进行概念干预,提升任务准确性。
链接: https://arxiv.org/abs/2602.23947
作者: Oscar Hill,Mateo Espinosa Zarlenga,Mateja Jamnik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published as a conference paper at ICLR 2026
Abstract:Modern deep neural networks remain challenging to interpret due to the opacity of their latent representations, impeding model understanding, debugging, and debiasing. Concept Embedding Models (CEMs) address this by mapping inputs to human-interpretable concept representations from which tasks can be predicted. Yet, CEMs fail to represent inter-concept relationships and require concept annotations at different granularities during training, limiting their applicability. In this paper, we introduce Hierarchical Concept Embedding Models (HiCEMs), a new family of CEMs that explicitly model concept relationships through hierarchical structures. To enable HiCEMs in real-world settings, we propose Concept Splitting, a method for automatically discovering finer-grained sub-concepts from a pretrained CEM’s embedding space without requiring additional annotations. This allows HiCEMs to generate fine-grained explanations from limited concept labels, reducing annotation burdens. Our evaluation across multiple datasets, including a user study and experiments on PseudoKitchens, a newly proposed concept-based dataset of 3D kitchen renders, demonstrates that (1) Concept Splitting discovers human-interpretable sub-concepts absent during training that can be used to train highly accurate HiCEMs, and (2) HiCEMs enable powerful test-time concept interventions at different granularities, leading to improved task accuracy.
[AI-21] Green or Fast? Learning to Balance Cold Starts and Idle Carbon in Serverless Computing
【速读】:该论文旨在解决无服务器计算(Serverless Computing)环境中服务延迟与碳排放之间的权衡问题:保留暖实例可降低冷启动延迟,但会增加闲置资源的碳排放;而回收闲置资源虽能减少碳排放,却可能加剧冷启动现象。其解决方案的关键在于提出LACE-RL框架,该框架将函数实例保留策略建模为一个序列决策问题,并采用深度强化学习动态调整保持活跃(keep-alive)时长,同时联合建模冷启动概率、函数特定的延迟成本和实时电网碳强度,从而在复杂的时间变化负载与碳强度条件下实现延迟感知且碳高效的资源管理。
链接: https://arxiv.org/abs/2602.23935
作者: Bowen Sun,Christos D. Antonopoulos,Evgenia Smirni,Bin Ren,Nikolaos Bellas,Spyros Lalis
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:Serverless computing simplifies cloud deployment but introduces new challenges in managing service latency and carbon emissions. Reducing cold-start latency requires retaining warm function instances, while minimizing carbon emissions favors reclaiming idle resources. This balance is further complicated by time-varying grid carbon intensity and varying workload patterns, under which static keep-alive policies are inefficient. We present LACE-RL, a latency-aware and carbon-efficient management framework that formulates serverless pod retention as a sequential decision problem. LACE-RL uses deep reinforcement learning to dynamically tune keep-alive durations, jointly modeling cold-start probability, function-specific latency costs, and real-time carbon intensity. Using the Huawei Public Cloud Trace, we show that LACE-RL reduces cold starts by 51.69% and idle keep-alive carbon emissions by 77.08% compared to Huawei’s static policy, while achieving better latency-carbon trade-offs than state-of-the-art heuristic and single-objective baselines, approaching Oracle performance.
[AI-22] RF-Agent : Automated Reward Function Design via Language Agent Tree Search
【速读】:该论文旨在解决低级控制任务中高效奖励函数设计的难题,特别是现有基于大语言模型(Large Language Models, LLMs)的方法在利用历史反馈信息不足和搜索效率低下方面的问题,导致在复杂控制任务中改进有限。其解决方案的关键在于将LLM视为语言智能体(language agent),并将奖励函数设计建模为一个序列决策过程,通过引入蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)来优化奖励函数生成,从而更好地利用历史反馈信息并提升搜索效率,实现更优的奖励函数发现。
链接: https://arxiv.org/abs/2602.23876
作者: Ning Gao,Xiuhui Zhang,Xingyu Jiang,Mukang You,Mohan Zhang,Yue Deng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 9 tables, 11 figures, Project page see this https URL
Abstract:Designing efficient reward functions for low-level control tasks is a challenging problem. Recent research aims to reduce reliance on expert experience by using Large Language Models (LLMs) with task information to generate dense reward functions. These methods typically rely on training results as feedback, iteratively generating new reward functions with greedy or evolutionary algorithms. However, they suffer from poor utilization of historical feedback and inefficient search, resulting in limited improvements in complex control tasks. To address this challenge, we propose RF-Agent, a framework that treats LLMs as language agents and frames reward function design as a sequential decision-making process, enhancing optimization through better contextual reasoning. RF-Agent integrates Monte Carlo Tree Search (MCTS) to manage the reward design and optimization process, leveraging the multi-stage contextual reasoning ability of LLMs. This approach better utilizes historical information and improves search efficiency to identify promising reward functions. Outstanding experimental results in 17 diverse low-level control tasks demonstrate the effectiveness of our method. The source code is available at this https URL.
[AI-23] Exploring Robust Intrusion Detection: A Benchmark Study of Feature Transferability in IoT Botnet Attack Detection
【速读】:该论文旨在解决跨域入侵检测(Cross-domain intrusion detection)中的性能下降问题,即在不同网络环境(如物联网IoT和工业物联网IIoT)下,由于流量特征分布差异导致模型迁移能力受限的问题。其解决方案的关键在于系统性评估三种主流流特征集(Argus、Zeek 和 CICFlowMeter)在多个异构数据集上的可迁移性,并结合SHapley Additive exPlanations(SHAP)分析特征重要性,从而揭示特征表示与分类算法对跨域适应性的显著影响。研究进一步提出基于特征工程优化与算法适配的实用指南,强调通过精心设计特征空间、合理选择分类模型以及引入自适应策略,实现高域内性能与跨域鲁棒性的协同提升。
链接: https://arxiv.org/abs/2602.23874
作者: Alejandro Guerra-Manzanares,Jialin Huang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in the Proceedings of the 2026 International Conference on Information Systems Security and Privacy (ICISSP)
Abstract:Cross-domain intrusion detection remains a critical challenge due to significant variability in network traffic characteristics and feature distributions across environments. This study evaluates the transferability of three widely used flow-based feature sets (Argus, Zeek and CICFlowMeter) across four widely used datasets representing heterogeneous IoT and Industrial IoT network conditions. Through extensive experiments, we evaluate in- and cross-domain performance across multiple classification models and analyze feature importance using SHapley Additive exPlanations (SHAP). Our results show that models trained on one domain suffer significant performance degradation when applied to a different target domain, reflecting the sensitivity of IoT intrusion detection systems to distribution shifts. Furthermore, the results evidence that the choice of classification algorithm and feature representations significantly impact transferability. Beyond reporting performance differences and thorough analysis of the transferability of features and feature spaces, we provide practical guidelines for feature engineering to improve robustness under domain variability. Our findings suggest that effective intrusion detection requires both high in-domain performance and resilience to cross-domain variability, achievable through careful feature space design, appropriate algorithm selection and adaptive strategies.
[AI-24] RUMAD: Reinforcement-Unifying Multi-Agent Debate
【速读】:该论文旨在解决多智能体辩论(Multi-Agent Debate, MAD)系统在准确性、共识形成与计算效率之间难以协同优化的问题。现有方法中,静态拓扑结构缺乏对任务复杂度变化的适应性,而基于大语言模型(Large Language Model, LLM)的外部协调机制可能引入特权知识,破坏辩论的中立性。其解决方案的关键在于提出RUMAD(Reinforcement-Unifying Multi-Agent Debate)框架,将动态通信拓扑控制建模为强化学习(Reinforcement Learning, RL)问题:通过内容无关的观测机制捕捉高阶辩论动态,利用多目标奖励函数联合优化解的质量、一致性与效率,并采用PPO训练控制器动态调整通信图中的边权重;同时引入双阈值机制实现对智能体激活和信息可见性的细粒度调控,从而在保持推理准确性的前提下显著降低80%以上的token消耗,并展现出跨任务的零样本泛化能力。
链接: https://arxiv.org/abs/2602.23864
作者: Chao Wang,Han Lin,Huaze Tang,Huijing Lin,Wenbo Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures
Abstract:Multi-agent debate (MAD) systems leverage collective intelligence to enhance reasoning capabilities, yet existing approaches struggle to simultaneously optimize accuracy, consensus formation, and computational efficiency. Static topology methods lack adaptability to task complexity variations, while external LLM-based coordination risks introducing privileged knowledge that compromises debate neutrality. This work presents RUMAD (Reinforcement-Unifying Multi-Agent Debate), a novel framework that formulates dynamic communication topology control in MAD as a reinforcement learning (RL) problem. RUMAD employs a content-agnostic observation scheme that captures high-level debate dynamics avoiding access to raw agent reasoning content. RUMAD uses a multi-objective reward to model solution quality, cohesion and efficiency. A PPO-trained controller dynamically adjusts edge weights in the communication graph, while a dual-threshold mechanism enables fine-grained control over both agent activation and information visibility. Experimental evaluation across MMLU, GSM8K, and GPQA benchmarks demonstrates that RUMAD achieves substantial efficiency gains, reducing token costs by over 80%, while still improving reasoning accuracy compared to single LLM model and multiple MAD baselines. Notably, RUMAD trained exclusively on MMLU exhibits robust zero-shot generalization to out-of-domain (OOD) tasks, indicating that the learned communication strategies capture task-independent principles of effective multi-agent coordination. These results establish RUMAD as a efficient and robust approach for deploying multi-agent reasoning application with practical resource constraints. Comments: 13 pages, 3 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.23864 [cs.AI] (or arXiv:2602.23864v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.23864 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-25] MI2DAS: A Multi-Layer Intrusion Detection Framework with Incremental Learning for Securing Industrial IoT Networks
【速读】:该论文旨在解决工业物联网(Industrial Internet of Things, IIoT)环境中因异构设备和动态流量模式导致的网络安全挑战,尤其是传统入侵检测系统在面对新型未知攻击时性能下降的问题。其解决方案的关键在于提出一个多层次的入侵检测框架 MI²DAS,该框架融合了基于异常检测的分层流量聚合、开放集识别(open-set recognition)以区分已知与未知攻击,以及增量学习机制以适应新攻击类型且仅需少量标注数据。通过多层协同设计,该框架实现了对已知攻击的高精度分类、对未知攻击的有效识别,并具备持续适应新威胁的能力,从而提升了IIoT系统的安全性与可扩展性。
链接: https://arxiv.org/abs/2602.23846
作者: Wei Lian,Alejandro Guerra-Manzanares
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in the Proceedings of the 2026 International Conference on Information Systems Security and Privacy (ICISSP)
Abstract:The rapid expansion of Industrial IoT (IIoT) systems has amplified security challenges, as heterogeneous devices and dynamic traffic patterns increase exposure to sophisticated and previously unseen cyberattacks. Traditional intrusion detection systems often struggle in such environments due to their reliance on extensive labeled data and limited ability to detect new threats. To address these challenges, we propose MI ^2 DAS, a multi-layer intrusion detection framework that integrates anomaly-based hierarchical traffic pooling, open-set recognition to distinguish between known and unknown attacks and incremental learning for adapting to novel attack types with minimal labeling. Experiments conducted on the Edge-IIoTset dataset demonstrate strong performance across all layers. In the first layer, GMM achieves superior normal-attack discrimination (accuracy = 0.953, TPR = 1.000). In open-set recognition, GMM attains a recall of 0.813 for known attacks, while LOF achieves 0.882 recall for unknown attacks. For fine-grained classification of known attacks, Random Forest achieves a macro-F1 of 0.941. Finally, the incremental learning module maintains robust performance when incorporation novel attack classes, achieving a macro-F1 of 0.8995. These results showcase MI ^2 DAS as an effective, scalable and adaptive framework for enhancing IIoT security against evolving threats.
[AI-26] Enhancing Continual Learning for Software Vulnerability Prediction: Addressing Catastrophic Forgetting via Hybrid-Confidence-Aware Selective Replay for Temporal LLM Fine-Tuning
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在源代码漏洞检测中因采用随机训练-测试划分而导致性能高估的问题,特别是在时间维度上存在分布偏移(temporal distribution shift)的实际场景下,模型需持续学习并识别未来可能出现的漏洞。其核心挑战在于如何在保持检测准确率的同时,高效应对代码库演进带来的持续性数据变化。解决方案的关键是提出一种混合类感知选择性重放策略(Hybrid Class-Aware Selective Replay, Hybrid-CASR),该方法通过优先保留置信度较低的样本,并在重放缓冲区中维持漏洞函数(VULNERABLE)与修复函数(FIXED)的平衡比例,从而在生物月窗口评估中实现Macro-F1提升至0.667,显著优于仅使用当前窗口训练的基线(0.651),且训练效率提高约17%,同时具备更强的时间后向保持能力(IBR@1 = 0.741)。
链接: https://arxiv.org/abs/2602.23834
作者: Xuhui Dou,Hayretdin Bahsi,Alejandro Guerra-Manzanares
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in the Proceedings of the 2026 International Conference on Information Systems Security and Privacy (ICISSP)
Abstract:Recent work applies Large Language Models (LLMs) to source-code vulnerability detection, but most evaluations still rely on random train-test splits that ignore time and overestimate real-world performance. In practice, detectors are deployed on evolving code bases and must recognise future vulnerabilities under temporal distribution shift. This paper investigates continual fine-tuning of a decoder-style language model (microsoft/phi-2 with LoRA) on a CVE-linked dataset spanning 2018-2024, organised into bi-monthly windows. We evaluate eight continual learning strategies, including window-only and cumulative training, replay-based baselines and regularisation-based variants. We propose Hybrid Class-Aware Selective Replay (Hybrid-CASR), a confidence-aware replay method for binary vulnerability classification that prioritises uncertain samples while maintaining a balanced ratio of VULNERABLE and FIXED functions in the replay buffer. On bi-monthly forward evaluation Hybrid-CASR achieves a Macro-F1 of 0.667, improving on the window-only baseline (0.651) by 0.016 with statistically significant gains ( p = 0.026 ) and stronger backward retention (IBR@1 of 0.741). Hybrid-CASR also reduces training time per window by about 17 percent compared to the baseline, whereas cumulative training delivers only a minor F1 increase (0.661) at a 15.9-fold computational cost. Overall, the results show that selective replay with class balancing offers a practical accuracy-efficiency trade-off for LLM-based temporal vulnerability detection under continuous temporal drift.
[AI-27] FedNSAM:Consistency of Local and Global Flatness for Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因多步本地更新和数据异构性导致全局模型陷入尖锐极小值(sharp global minima)的问题,从而降低全局模型的泛化性能。现有方法如Sharpness-Aware Minimization (SAM) 在本地训练中优化局部损失曲面的平坦性,但在高数据异构场景下,局部平坦性并不能保证全局平坦性,使得SAM在FL中的效果受限。论文提出的关键解决方案是引入全局Nesterov动量(global Nesterov momentum)到本地更新过程中,构建一种新型算法FedNSAM,通过该动量作为客户端对全局扰动的局部估计与外推方向,实现全局与局部平坦性的一致性协调。理论分析表明,FedNSAM相比FedSAM具有更紧的收敛界;实验验证了其在CNN和Transformer模型上的优越性能与效率。
链接: https://arxiv.org/abs/2602.23827
作者: Junkang Liu,Fanhua Shang,Yuxuan Tian,Hongying Liu,Yuanyuan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In federated learning (FL), multi-step local updates and data heterogeneity usually lead to sharper global minima, which degrades the performance of the global model. Popular FL algorithms integrate sharpness-aware minimization (SAM) into local training to address this issue. However, in the high data heterogeneity setting, the flatness in local training does not imply the flatness of the global model. Therefore, minimizing the sharpness of the local loss surfaces on the client data does not enable the effectiveness of SAM in FL to improve the generalization ability of the global model. We define the \textbfflatness distance to explain this phenomenon. By rethinking the SAM in FL and theoretically analyzing the \textbfflatness distance, we propose a novel \textbfFedNSAM algorithm that accelerates the SAM algorithm by introducing global Nesterov momentum into the local update to harmonize the consistency of global and local flatness. \textbfFedNSAM uses the global Nesterov momentum as the direction of local estimation of client global perturbations and extrapolation. Theoretically, we prove a tighter convergence bound than FedSAM by Nesterov extrapolation. Empirically, we conduct comprehensive experiments on CNN and Transformer models to verify the superior performance and efficiency of \textbfFedNSAM. The code is available at this https URL.
[AI-28] Learning to maintain safety through expert demonstrations in settings with unknown constraints: A Q-learning perspective AAMAS2026
【速读】:该论文旨在解决在约束马尔可夫决策过程(Constrained MDP)中,仅通过可观测奖励下的示范轨迹来学习一个既能最大化高回报轨迹概率、又能保证安全性的策略问题。其核心挑战在于:已知任务奖励但未知约束条件(即不可观测的成本),需在保守性与探索高回报潜在不安全路径之间取得平衡。解决方案的关键在于提出一种名为SafeQIL(Safe Q Inverse Constrained Reinforcement Learning)的算法,该方法通过引入“承诺度”(promise)概念量化每个状态-动作对的价值,该价值由任务奖励和状态安全性评估共同决定,从而将逆向学习问题建模为一种安全Q值视角下的优化问题,实现对最有潜力轨迹的概率最大化。
链接: https://arxiv.org/abs/2602.23816
作者: George Papadopoulos,George A. Vouros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication at AAMAS 2026
Abstract:Given a set of trajectories demonstrating the execution of a task safely in a constrained MDP with observable rewards but with unknown constraints and non-observable costs, we aim to find a policy that maximizes the likelihood of demonstrated trajectories trading the balance between being conservative and increasing significantly the likelihood of high-rewarding trajectories but with potentially unsafe steps. Having these objectives, we aim towards learning a policy that maximizes the probability of the most promising trajectories with respect to the demonstrations. In so doing, we formulate the ``promise" of individual state-action pairs in terms of Q values, which depend on task-specific rewards as well as on the assessment of states’ safety, mixing expectations in terms of rewards and safety. This entails a safe Q-learning perspective of the inverse learning problem under constraints: The devised Safe Q Inverse Constrained Reinforcement Learning (SafeQIL) algorithm is compared to state-of-the art inverse constraint reinforcement learning algorithms to a set of challenging benchmark tasks, showing its merits.
[AI-29] Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)在一般函数逼近下的理论局限性问题,特别是现有算法(如PSPI)仅适用于有限或小规模动作空间,且依赖状态层面的镜面下降(mirror descent)与隐式策略诱导机制,难以适配实践中常见的独立参数化策略(standalone policy parameterization)。解决方案的关键在于将镜面下降扩展至参数化策略,并识别出“上下文耦合”(contextual coupling)为核心挑战;通过将镜面下降与自然策略梯度(natural policy gradient)相连接,作者提出了新的分析框架与理论保证,揭示了离线RL与模仿学习(imitation learning)之间令人意外的统一性,从而实现了对大尺度或连续动作空间下参数化策略类的理论拓展。
链接: https://arxiv.org/abs/2602.23811
作者: Xiang Li,Nan Jiang,Yuheng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate the theoretical aspects of offline reinforcement learning (RL) under general function approximation. While prior works (e.g., Xie et al., 2021) have established the theoretical foundations of learning a good policy from offline data via pessimism, existing algorithms that are computationally tractable (often in an oracle-efficient sense), such as PSPI, only apply to finite and small action spaces. Moreover, these algorithms rely on state-wise mirror descent and require actors to be implicitly induced from the critic functions, failing to accommodate standalone policy parameterization which is ubiquitous in practice. In this work, we address these limitations and extend the theoretical guarantees to parameterized policy classes over large or continuous action spaces. When extending mirror descent to parameterized policies, we identify contextual coupling as the core difficulty, and show how connecting mirror descent to natural policy gradient leads to novel analyses, guarantees, and algorithmic insights, including a surprising unification between offline RL and imitation learning.
[AI-30] MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在执行遗忘学习(unlearning)时面临的隐私困境,即服务器端参数和客户端遗忘集合(forget set)均不能共享的双重限制。解决方案的关键在于提出一种算法无关的隐私保护多扰动副本遗忘框架(MPU),其核心由两个服务端模块构成:预处理模块通过随机化生成多个扰动且重参数化的模型副本,使客户端可在本地基于私有遗忘集进行遗忘学习而无需访问原始参数;后处理模块则通过逆向重参数化与谐波去噪聚合更新,有效缓解扰动带来的影响,从而在保障隐私的前提下实现接近无噪声基线的遗忘性能。
链接: https://arxiv.org/abs/2602.23798
作者: Tiantong Wang,Xinyu Yan,Tiantong Wu,Yurong Hao,Yong Jiang,Fei Huang,Wei Yang Bryan Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Machine unlearning for large language models often faces a privacy dilemma in which strict constraints prohibit sharing either the server’s parameters or the client’s forget set. To address this dual non-disclosure constraint, we propose MPU, an algorithm-agnostic privacy-preserving Multiple Perturbed Copies Unlearning framework that primarily introduces two server-side modules: Pre-Process for randomized copy generation and Post-Process for update aggregation. In Pre-Process, the server distributes multiple perturbed and reparameterized model instances, allowing the client to execute unlearning locally on its private forget set without accessing the server’s exact original parameters. After local unlearning, the server performs Post-Process by inverting the reparameterization and aggregating updates with a harmonic denoising procedure to alleviate the impact of perturbation. Experiments with seven unlearning algorithms show that MPU achieves comparable unlearning performance to noise-free baselines, with most algorithms’ average degradation well below 1% under 10% noise, and can even outperform the noise-free baseline for some algorithms under 1% noise. Code is available at this https URL.
[AI-31] UPath: Universal Planner Across Topological Heterogeneity For Grid-Based Pathfinding
【速读】:该论文旨在解决基于网格的路径规划搜索算法(如A*)中启发式函数泛化能力不足的问题,即现有基于学习的方法通常假设训练与测试环境来自相同分布(如城市地图或室内地图),导致在分布外任务上性能显著下降,难以满足实际应用中对通用求解器的需求。解决方案的关键在于设计一种通用启发式预测器(universal heuristic predictor):通过一次训练即可在多种完全未见过的任务场景中实现高效且高质量的启发式估计,从而显著降低A*算法的计算开销(最多提升2.2倍效率),同时保持平均路径成本不超过最优解的3%,这是首个在可学习求解器中实现该里程碑效果的工作。
链接: https://arxiv.org/abs/2602.23789
作者: Aleksandr Ananikian,Daniil Drozdov,Konstantin Yakovlev(Saint-Petersburg University)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The performance of search algorithms for grid-based pathfinding, e.g. A*, critically depends on the heuristic function that is used to focus the search. Recent studies have shown that informed heuristics that take the positions/shapes of the obstacles into account can be approximated with the deep neural networks. Unfortunately, the existing learning-based approaches mostly rely on the assumption that training and test grid maps are drawn from the same distribution (e.g., city maps, indoor maps, etc.) and perform poorly on out-of-distribution tasks. This naturally limits their application in practice when often a universal solver is needed that is capable of efficiently handling any problem instance. In this work, we close this gap by designing an universal heuristic predictor: a model trained once, but capable of generalizing across a full spectrum of unseen tasks. Our extensive empirical evaluation shows that the suggested approach halves the computational effort of A* by up to a factor of 2.2, while still providing solutions within 3% of the optimal cost on average altogether on the tasks that are completely different from the ones used for training \unicodex2013 a milestone reached for the first time by a learnable solver.
[AI-32] radeFM: A Generative Foundation Model for Trade-flow and Market Microstructure
【速读】:该论文旨在解决市场微观结构建模中跨资产泛化能力不足的问题,尤其是传统方法依赖于特定资产的校准且难以捕捉复杂市场动态。其解决方案的关键在于提出TradeFM——一个524M参数的生成式Transformer模型,通过设计尺度不变特征(scale-invariant features)和通用分词方案(universal tokenization scheme),将异构、多模态的订单流事件流映射为统一离散序列,从而实现无需资产特异性调参的跨资产学习。该方法在集成确定性市场模拟器后,能生成符合金融收益典型特征(如厚尾、波动聚集等)的合成数据,并在零样本迁移至亚太地区市场时表现出较低的困惑度下降,验证了其对市场微观结构可转移结构的有效捕获。
链接: https://arxiv.org/abs/2602.23784
作者: Maxime Kawawa-Beaudan,Srijan Sood,Kassiani Papasotiriou,Daniel Borrajo,Manuela Veloso
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP); Trading and Market Microstructure (q-fin.TR)
备注: 29 pages, 17 figures, 6 tables. Preprint
Abstract:Foundation models have transformed domains from language to genomics by learning general-purpose representations from large-scale, heterogeneous data. We introduce TradeFM, a 524M-parameter generative Transformer that brings this paradigm to market microstructure, learning directly from billions of trade events across 9K equities. To enable cross-asset generalization, we develop scale-invariant features and a universal tokenization scheme that map the heterogeneous, multi-modal event stream of order flow into a unified discrete sequence – eliminating asset-specific calibration. Integrated with a deterministic market simulator, TradeFM-generated rollouts reproduce key stylized facts of financial returns, including heavy tails, volatility clustering, and absence of return autocorrelation. Quantitatively, TradeFM achieves 2-3x lower distributional error than Compound Hawkes baselines and generalizes zero-shot to geographically out-of-distribution APAC markets with moderate perplexity degradation. Together, these results suggest that scale-invariant trade representations capture transferable structure in market microstructure, opening a path toward synthetic data generation, stress testing, and learning-based trading agents.
[AI-33] Reasoning -Driven Multimodal LLM for Domain Generalization ICLR2026
【速读】:该论文旨在解决深度学习中的域泛化(Domain Generalization, DG)问题,即如何在训练数据分布与测试数据分布存在差异时仍保持模型的鲁棒性。传统方法主要依赖于视觉特征不变性,而本文提出利用多模态大语言模型(Multimodal Large Language Models, MLLMs)的推理能力,构建从图像到类别标签的推理链(reasoning chains),以提升模型在域偏移下的预测稳定性。解决方案的关键在于提出RD-MLDG框架,其核心创新包括:(i) 多任务交叉训练(Multi-Task Cross-Training, MTCT),引入直接分类路径引导推理监督;(ii) 自对齐推理正则化(Self-Aligned Reasoning Regularization, SARR),通过迭代自标注保留推理链语义丰富性并缓解推理模式不匹配问题,从而在优化效率与语义信息之间取得平衡。实验表明,该方法在标准DomainBed数据集上达到SOTA性能,验证了推理信号作为域泛化补充机制的有效性。
链接: https://arxiv.org/abs/2602.23777
作者: Zhipeng Xu,Zilong Wang,Xinyang Jiang,Dongsheng Li,De Cheng,Nannan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026 (Poster)
Abstract:This paper addresses the domain generalization (DG) problem in deep learning. While most DG methods focus on enforcing visual feature invariance, we leverage the reasoning capability of multimodal large language models (MLLMs) and explore the potential of constructing reasoning chains that derives image categories to achieve more robust predictions under domain shift. To this end, we systematically study the role of reasoning in DG using DomainBed-Reasoning, a newly constructed extension of DomainBed dataset, in which each sample is paired with class-relevant reasoning chains. Our analysis reveals two key challenges: (i) fine-tuning MLLMs with reasoning chains for classification is more challenging than direct label supervision, since the model must optimize complex reasoning sequences before label prediction; and (ii) mismatches in reasoning patterns between supervision signals and fine-tuned MLLMs lead to a trade-off between semantic richness (informative but harder to optimize) and optimization efficiency (easier to optimize but less informative). To address these issues, we propose RD-MLDG (Reasoning-Driven Multimodal LLM for Domain Generalization), a framework with two components: (i) MTCT (Multi-Task Cross-Training), which introduces an additional direct classification pathway to guide reasoning supervision; and (ii) SARR (Self-Aligned Reasoning Regularization), which preserves the semantic richness of reasoning chains while mitigating reasoning-pattern mismatches via iterative self-labeling. Experiments on standard DomainBed datasets (PACS, VLCS, OfficeHome, TerraInc) demonstrate that RD-MLDG achieves state-of-the-art performances, highlighting reasoning as a promising complementary signal for robust out-of-domain generalization.
[AI-34] Bridging Dynamics Gaps via Diffusion Schrödinger Bridge for Cross-Domain Reinforcement Learning
【速读】:该论文旨在解决跨域强化学习(Cross-Domain Reinforcement Learning, CDRL)中因源域与目标域动力学差异导致的策略迁移难题,尤其在缺乏目标域环境交互和奖励监督的情况下如何实现有效策略学习。其解决方案的关键在于提出BDGxRL框架,利用扩散薛定谔桥(Diffusion Schrödinger Bridge, DSB)将源域转移数据对齐至目标域动力学(由离线演示编码),并引入奖励调制机制基于状态转移估计奖励,确保奖励与目标域动力学的一致性,从而在不访问目标环境或其奖励信号的前提下,在源域内完成面向目标域的策略学习。
链接: https://arxiv.org/abs/2602.23737
作者: Hanping Zhang,Yuhong Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Cross-domain reinforcement learning (RL) aims to learn transferable policies under dynamics shifts between source and target domains. A key challenge lies in the lack of target-domain environment interaction and reward supervision, which prevents direct policy learning. To address this challenge, we propose Bridging Dynamics Gaps for Cross-Domain Reinforcement Learning (BDGxRL), a novel framework that leverages Diffusion Schrödinger Bridge (DSB) to align source transitions with target-domain dynamics encoded in offline demonstrations. Moreover, we introduce a reward modulation mechanism that estimates rewards based on state transitions, applying to DSB-aligned samples to ensure consistency between rewards and target-domain dynamics. BDGxRL performs target-oriented policy learning entirely within the source domain, without access to the target environment or its rewards. Experiments on MuJoCo cross-domain benchmarks demonstrate that BDGxRL outperforms state-of-the-art baselines and shows strong adaptability under transition dynamics shifts.
[AI-35] Unlocking Cognitive Capabilities and Analyzing the Perception-Logic Trade-off
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在实现全感知能力(omni-perception)过程中,如何有效融合稳健的感官基础(sensory grounding)与复杂推理(complex reasoning)的问题,尤其是在东南亚(Southeast Asia, SEA)等低资源、文化多样性显著区域的适配难题。其解决方案的关键在于提出一种分阶段训练流程,明确解耦并逐步整合“系统1”(感知)与“系统2”(推理)能力:首先通过正交模态适应(orthogonal modality adaptation)构建区域特异性音频-视觉感知基座(如新加坡英语混杂语码转换、本地文化地标),再利用低成本的Generate-Judge-Refine管道合成高质量银数据(silver data),借助超大语言模型(Super-LLM)的共识机制过滤幻觉并迁移文本链式思维(Chain-of-Thought)至多模态场景,从而实现高效且具认知能力的训练。
链接: https://arxiv.org/abs/2602.23730
作者: Longyin Zhang,Shuo Sun,Yingxu He,Won Cheng Yi Lewis,Muhammad Huzaifah Bin Md Shahrin,Hardik Bhupendra Sailor,Heng Meng Jeremy Wong,Tarun Kumar Vangani,Yi Ma,Qiongqiong Wang,Minh Duc Pham,Ridong Jiang,Jingtao Li,Jingyi Liao,Zhuohan Liu,Yanfeng Lu,Manas Gupta,Ai Ti Aw
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) pursue omni-perception capabilities, yet integrating robust sensory grounding with complex reasoning remains a challenge, particularly for underrepresented regions. In this report, we introduce the research preview of MERaLiON2-Omni (Alpha), a 10B-parameter multilingual omni-perception tailored for Southeast Asia (SEA). We present a progressive training pipeline that explicitly decouples and then integrates “System 1” (Perception) and “System 2” (Reasoning) capabilities. First, we establish a robust Perception Backbone by aligning region-specific audio-visual cues (e.g., Singlish code-switching, local cultural landmarks) with a multilingual LLM through orthogonal modality adaptation. Second, to inject cognitive capabilities without large-scale supervision, we propose a cost-effective Generate-Judge-Refine pipeline. By utilizing a Super-LLM to filter hallucinations and resolve conflicts via a consensus mechanism, we synthesize high-quality silver data that transfers textual Chain-of-Thought reasoning to multimodal scenarios. Comprehensive evaluation on our newly introduced SEA-Omni Benchmark Suite reveals an Efficiency-Stability Paradox: while reasoning acts as a non-linear amplifier for abstract tasks (boosting mathematical and instruction-following performance significantly), it introduces instability in low-level sensory processing. Specifically, we identify Temporal Drift in long-context audio, where extended reasoning desynchronizes the model from acoustic timestamps, and Visual Over-interpretation, where logic overrides pixel-level reality. This report details the architecture, the data-efficient training recipe, and a diagnostic analysis of the trade-offs between robust perception and structured reasoning. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.23730 [cs.AI] (or arXiv:2602.23730v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.23730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-36] SLA-Aware Distributed LLM Inference Across Device-RAN-Cloud
【速读】:该论文旨在解决嵌入式人工智能(Embodied AI)在5G独立组网(SA)网络中实现亚秒级推理时面临的延迟与资源调度挑战,尤其是在异构计算层级(设备端、无线接入网边缘、云端)部署下如何保障实时基带处理不被干扰。其关键解决方案在于通过构建一个包含设备端、三节点RAN边缘集群(容器化5G RAN共驻)和云层的测试平台,量化评估不同模型变体(量化/非量化)及部署层级对服务等级协议(SLA)可行性的影响;结果表明,RAN边缘采用量化模型可稳定低于0.5秒延迟,而多实例GPU(MIG)隔离机制在高并发场景下能维持基带定时健康指标,支持安全共置,从而为跨层级AI推理提供可落地的优化路径。
链接: https://arxiv.org/abs/2602.23722
作者: Hariz Yet,Nguyen Thanh Tam,Mao V. Ngo,Lim Yi Shen,Lin Wei,Jihong Park,Binbin Chen,Tony Q. S. Quek
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE INFOCOM Workshops 2026 (6G AI-RAN 2026), Tokyo, Japan. This arXiv version is a preprint / author version
Abstract:Embodied AI requires sub-second inference near the Radio Access Network (RAN), but deployments span heterogeneous tiers (on-device, RAN-edge, cloud) and must not disrupt real-time baseband processing. We report measurements from a 5G Standalone (SA) AI-RAN testbed using a fixed baseline policy for repeatability. The setup includes an on-device tier, a three-node RAN-edge cluster co-hosting a containerized 5G RAN, and a cloud tier. We find that on-device execution remains multi-second and fails to meet sub-second budgets. At the RAN edge, SLA feasibility is primarily determined by model variant choice: quantized models concentrate below 0.5,s, while unquantized and some larger quantized models incur deadline misses due to stalls and queuing. In the cloud tier, meeting a 0.5,s deadline is challenging on the measured WAN path (up to 32.9% of requests complete within 0.5,s), but all evaluated variants meet a 1.0,s deadline (100% within 1.0,s). Under saturated downlink traffic and up to N=20 concurrent inference clients, Multi-Instance GPU (MIG) isolation preserves baseband timing-health proxies, supporting safe co-location under fixed partitioning.
[AI-37] he Auton Agent ic AI Framework
【速读】:该论文旨在解决生成式 AI(Generative AI)向代理式 AI(Agentic AI)演进过程中存在的根本性架构不匹配问题:大型语言模型(LLMs)输出具有随机性和非结构化特征,而其需控制的后端基础设施(如数据库、API、云服务)则要求确定性和符合模式的输入。解决方案的关键在于提出 Auton Agentic AI 框架,通过严格分离“认知蓝图”(Cognitive Blueprint,即与语言无关的代理身份与能力声明)和“运行时引擎”(Runtime Engine,平台特定的执行基础架构),实现跨语言可移植性、形式化审计能力和模块化工具集成;同时引入模型上下文协议(MCP)、增强型部分可观测马尔可夫决策过程(augmented POMDP)、层次化记忆整合架构、约束流形形式化安全机制及运行时优化技术,系统性地提升代理系统的可控性、安全性与效率。
链接: https://arxiv.org/abs/2602.23720
作者: Sheng Cao,Zhao Chang,Chang Li,Hannan Li,Liyao Fu,Ji Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The field of Artificial Intelligence is undergoing a transition from Generative AI – probabilistic generation of text and images – to Agentic AI, in which autonomous systems execute actions within external environments on behalf of users. This transition exposes a fundamental architectural mismatch: Large Language Models (LLMs) produce stochastic, unstructured outputs, whereas the backend infrastructure they must control – databases, APIs, cloud services – requires deterministic, schema-conformant inputs. The present paper describes the Auton Agentic AI Framework, a principled architecture for standardizing the creation, execution, and governance of autonomous agent systems. The framework is organized around a strict separation between the Cognitive Blueprint, a declarative, language-agnostic specification of agent identity and capabilities, and the Runtime Engine, the platform-specific execution substrate that instantiates and runs the agent. This separation enables cross-language portability, formal auditability, and modular tool integration via the Model Context Protocol (MCP). The paper formalizes the agent execution model as an augmented Partially Observable Markov Decision Process (POMDP) with a latent reasoning space, introduces a hierarchical memory consolidation architecture inspired by biological episodic memory systems, defines a constraint manifold formalism for safety enforcement via policy projection rather than post-hoc filtering, presents a three-level self-evolution framework spanning in-context adaptation through reinforcement learning, and describes runtime optimizations – including parallel graph execution, speculative inference, and dynamic context pruning – that reduce end-to-end latency for multi-step agent workflows.
[AI-38] SAGE-LLM : Towards Safe and Generalizable LLM Controller with Fuzzy-CBF Verification and Graph-Structured Knowledge Retrieval for UAV Decision
【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)动态决策中因复杂多变危险因素导致算法泛化能力不足的问题。尽管大语言模型(Large Language Model, LLM)具备语义理解与场景泛化能力,但其缺乏特定领域的无人机控制知识和形式化安全保证,限制了直接应用。解决方案的关键在于提出一种无需训练的两层决策架构:上层利用模糊控制屏障函数(fuzzy Control Barrier Function)对语义增强动作进行可证明的安全验证,确保LLM输出的安全性;下层通过基于星型分层图的检索增强生成系统(star-hierarchical graph-based retrieval-augmented generation),实现高效、弹性且可解释的场景自适应。实验表明,该框架在未知障碍物和突发威胁下的追逃场景中,可在不依赖在线训练的前提下显著提升安全性与泛化性能。
链接: https://arxiv.org/abs/2602.23719
作者: Wenzhe Zhao,Yang Zhao,Ganchao Liu,Zhiyu Jiang,Dandan Ma,Zihao Li,Xuelong Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:In UAV dynamic decision, complex and variable hazardous factors pose severe challenges to the generalization capability of algorithms. Despite offering semantic understanding and scene generalization, Large Language Models (LLM) lack domain-specific UAV control knowledge and formal safety assurances, restricting their direct applicability. To bridge this gap, this paper proposes a train-free two-layer decision architecture based on LLMs, integrating high-level safety planning with low-level precise control. The framework introduces three key contributions: 1) A fuzzy Control Barrier Function verification mechanism for semantically-augmented actions, providing provable safety certification for LLM outputs. 2) A star-hierarchical graph-based retrieval-augmented generation system, enabling efficient, elastic, and interpretable scene adaptation. 3) Systematic experimental validation in pursuit-evasion scenarios with unknown obstacles and emergent threats, demonstrating that our SAGE-LLM maintains performance while significantly enhancing safety and generalization without online training. The proposed framework demonstrates strong extensibility, suggesting its potential for generalization to broader embodied intelligence systems and safety-critical control domains.
[AI-39] ProductResearch: Training E-Commerce Deep Research Agents via Multi-Agent Synthetic Trajectory Distillation
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的电商对话购物代理在交互深度和上下文广度上的不足,以及深度研究(Deep Research)范式在迁移到电商领域时存在的领域差距问题。解决方案的关键在于提出一个名为ProductResearch的多智能体框架,该框架通过用户代理(User Agent)解析行为历史以推断精细购物意图,并由监督代理(Supervisor Agent)协同研究代理(Research Agent)生成高保真、长时程的工具使用轨迹;随后通过反思内化过程对这些多智能体交互轨迹进行严格过滤与提炼,将其转化为连贯的单角色训练样本,从而有效微调LLM代理以应对复杂购物查询。
链接: https://arxiv.org/abs/2602.23716
作者: Jiangyuan Wang,Kejun Xiao,Huaipeng Zhao,Tao Luo,Xiaoyi Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM)-based agents show promise for e-commerce conversational shopping, yet existing implementations lack the interaction depth and contextual breadth required for complex product research. Meanwhile, the Deep Research paradigm, despite advancing information synthesis in web search, suffers from domain gaps when transferred to e-commerce. We propose ProductResearch, a multi-agent framework that synthesizes high-fidelity, long-horizon tool-use trajectories for training robust e-commerce shopping agents. The framework employs a User Agent to infer nuanced shopping intents from behavioral histories, and a Supervisor Agent that orchestrates iterative collaboration with a Research Agent to generate synthetic trajectories culminating in comprehensive, insightful product research reports. These trajectories are rigorously filtered and distilled through a reflective internalization process that consolidates multi-agent supervisory interactions into coherent single-role training examples, enabling effective fine-tuning of LLM agents for complex shopping inquiries. Extensive experiments show that a compact MoE model fine-tuned on our synthetic data achieves substantial improvements over its base model in response comprehensiveness, research depth, and user-perceived utility, approaching the performance of frontier proprietary deep research systems and establishing multi-agent synthetic trajectory training as an effective and scalable paradigm for enhancing LLM-based shopping assistance.
[AI-40] From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM -based Multi-Agent Systems
【速读】:该论文旨在解决大语言模型驱动的多智能体系统(LLM-powered Multi-Agent Systems, MAS)在复杂任务中因固有脆弱性和不透明的失败机制而导致的可解释性差问题。现有方法通常将执行日志视为线性序列,无法有效解析多智能体间复杂的因果关系,导致责任边界模糊、故障定位不准。其解决方案的关键在于提出CHIEF框架:首先将混乱的执行轨迹转化为结构化的分层因果图(hierarchical causal graph),再通过合成虚拟Oracle引导的回溯策略高效剪枝搜索空间,最终利用渐进式因果筛选策略实施反事实归因,从而精准区分根本原因与传播症状。实验证明,该方法在WhoWhen基准上显著优于八种先进基线,在代理级和步骤级准确率上均取得提升。
链接: https://arxiv.org/abs/2602.23701
作者: Yawen Wang,Wenjie Wu,Junjie Wang,Qing Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:LLM-powered Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in complex domains but suffer from inherent fragility and opaque failure mechanisms. Existing failure attribution methods, whether relying on direct prompting, costly replays, or supervised fine-tuning, typically treat execution logs as flat sequences. This linear perspective fails to disentangle the intricate causal links inherent to MAS, leading to weak observability and ambiguous responsibility boundaries. To address these challenges, we propose CHIEF, a novel framework that transforms chaotic trajectories into a structured hierarchical causal graph. It then employs hierarchical oracle-guided backtracking to efficiently prune the search space via sybthesized virtual oracles. Finally, it implements counterfactual attribution via a progressive causal screening strategy to rigorously distinguish true root causes from propagated symptoms. Experiments on WhoWhen benchmark show that CHIEF outperforms eight strong and state-of-the-art baselines on both agent- and step-level accuracy. Ablation studies further confirm the critical role of each proposed module.
[AI-41] Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training
【速读】:该论文旨在解决小规模Transformer模型训练过程中参数更新轨迹的几何结构问题,特别是理解优化器如何影响学习路径的有效维度与动态特性。其关键解决方案在于采用未中心化、行归一化的轨迹主成分分析(PCA),揭示出参数更新主要沿一个主导漂移方向演化,而横向残差动态则编码了辅助探测任务性能的振荡行为;同时发现AdamW与SGD类优化器在轨迹几何上存在显著差异:前者形成多维漂移结构,后者则趋向共线性演化,且这种差异无法仅从损失值中体现,说明优化器选择对学习轨迹的内在结构具有决定性作用。
链接: https://arxiv.org/abs/2602.23696
作者: Yongzhong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures
Abstract:We study the geometry of training trajectories in small transformer models and find that parameter updates organize into a dominant drift direction with transverse residual dynamics. Using uncentered, row-normalized trajectory PCA, we show that a single direction captures a large fraction of cumulative parameter movement early in training, while remaining components encode oscillatory behavior in auxiliary probe performance. Instantaneous gradients exhibit little alignment with this dominant direction, indicating that it arises from accumulated optimizer updates rather than per-batch gradient structure. Comparing AdamW with SGD variants at matched loss levels reveals substantial differences in trajectory geometry: AdamW develops multi-dimensional drift structure, whereas SGD-family optimizers produce nearly colinear parameter evolution and weaker probe dynamics. Reheating selectively perturbs transverse components with minimal effect on the dominant drift coordinate. These findings suggest that optimizer choice shapes the effective dimensionality and structure of learning trajectories beyond what is apparent from loss values alone.
[AI-42] Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion
【速读】:该论文旨在解决在危险环境中(如灾害现场和工业设施)实现可靠且直观的移动机器人与无人机(UAV)遥操作问题,尤其针对现有基于视觉的手势识别方法在遮挡、光照变化和复杂背景等实际场景下性能下降的问题。其解决方案的关键在于提出一种多模态手势识别框架,通过融合来自双侧Apple Watch的惯性测量单元(IMU)数据(包括加速度计、陀螺仪和方向信息)与定制手套上的电容式传感信号,并采用基于对数似然比(log-likelihood ratio, LLR)的后融合策略,不仅显著提升了识别准确率,还增强了模型的可解释性,能够量化各模态对决策的贡献。该方法在保持与先进视觉基线相当性能的同时,大幅降低计算开销、模型规模和训练时间,适用于实时机器人控制场景。
链接: https://arxiv.org/abs/2602.23694
作者: Seungyeol Baek,Jaspreet Singh,Lala Shakti Swarup Ray,Hymalai Bello,Paul Lukowicz,Sungho Suh
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Human operators are still frequently exposed to hazardous environments such as disaster zones and industrial facilities, where intuitive and reliable teleoperation of mobile robots and Unmanned Aerial Vehicles (UAVs) is essential. In this context, hands-free teleoperation enhances operator mobility and situational awareness, thereby improving safety in hazardous environments. While vision-based gesture recognition has been explored as one method for hands-free teleoperation, its performance often deteriorates under occlusions, lighting variations, and cluttered backgrounds, limiting its applicability in real-world operations. To overcome these limitations, we propose a multimodal gesture recognition framework that integrates inertial data (accelerometer, gyroscope, and orientation) from Apple Watches on both wrists with capacitive sensing signals from custom gloves. We design a late fusion strategy based on the log-likelihood ratio (LLR), which not only enhances recognition performance but also provides interpretability by quantifying modality-specific contributions. To support this research, we introduce a new dataset of 20 distinct gestures inspired by aircraft marshalling signals, comprising synchronized RGB video, IMU, and capacitive sensor data. Experimental results demonstrate that our framework achieves performance comparable to a state-of-the-art vision-based baseline while significantly reducing computational cost, model size, and training time, making it well suited for real-time robot control. We therefore underscore the potential of sensor-based multimodal fusion as a robust and interpretable solution for gesture-driven mobile robot and drone teleoperation.
[AI-43] ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在推理过程中因盲目依赖均匀暴力采样(如固定最优N或自一致性策略)而导致的计算效率低下、可解释性差以及过度思考(overthinking)引发收益递减的问题。其解决方案的核心在于提出ODAR-Expert——一个基于原则性资源分配的自适应路由框架,通过 amortized active inference(摊销主动推断)构建难度估计器,动态地将查询路由至启发式快代理(Fast Agent)与反思式慢代理(Slow Agent)之间;同时引入以自由能(free energy)为指导的风险敏感融合机制,通过最小化变分自由能目标,在对数似然与认知不确定性(varentropy)之间实现平衡,从而替代传统的非结构化投票策略,实现更高效且准确的答案选择。
链接: https://arxiv.org/abs/2602.23681
作者: Siyuan Ma,Bo Gao,Xiaojun Jia,Simeng Qin,Tianlin Li,Ke Ma,Xiaoshuang Jia,Wenqi Ren,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The paradigm of large language model (LLM) reasoning is shifting from parameter scaling to test-time compute scaling, yet many existing approaches still rely on uniform brute-force sampling (for example, fixed best-of-N or self-consistency) that is costly, hard to attribute, and can trigger overthinking with diminishing returns. We propose ODAR-Expert, an adaptive routing framework that optimizes the accuracy-efficiency trade-off via principled resource allocation. ODAR uses a difficulty estimator grounded in amortized active inference to dynamically route queries between a heuristic Fast Agent and a deliberative Slow Agent. We further introduce a free-energy-principled, risk-sensitive fusion mechanism that selects answers by minimizing a variational free energy objective, balancing log-likelihood with epistemic uncertainty (varentropy) as a principled alternative to ad hoc voting over heterogeneous candidates. Extensive evaluation across 23 benchmarks shows strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity’s Last Exam (HLE), while improving the compute-accuracy frontier under compute-matched settings. We also validate reproducibility on a fully open-source stack (Llama 4 + DeepSeek), where ODAR surpasses homogeneous sampling strategies while reducing computational costs by 82%. Overall, our results suggest that thinking-optimal scaling requires adaptive resource allocation with free-energy-based decision-making rather than simply increasing test-time compute.
[AI-44] PseudoAct: Leverag ing Pseudocode Synthesis for Flexible Planning and Action Control in Large Language Model Agents
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在复杂长程任务中因依赖反应式决策范式(如ReAct)而导致的冗余工具调用、推理不稳定以及高Token消耗等问题。其解决方案的关键在于提出PseudoAct框架,通过合成结构化的伪代码计划来实现灵活的规划与动作控制:该框架利用LLM将任务求解策略表达为代码,生成包含序列、条件、循环、并行组合等控制流逻辑的伪代码,从而显式定义全局决策逻辑并确保时序一致性,有效减少冗余动作、防止无限循环,并避免无信息的替代探索路径,最终提升长程任务中的决策效率与成功率。
链接: https://arxiv.org/abs/2602.23668
作者: Yihan(Logon)Wen,Xin Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Large language model (LLM) agents typically rely on reactive decision-making paradigms such as ReAct, selecting actions conditioned on growing execution histories. While effective for short tasks, these approaches often lead to redundant tool usage, unstable reasoning, and high token consumption in complex long-horizon tasks involving branching, iteration, or multi-tool coordination. To address these limitations, this paper introduces PseudoAct, a novel framework for flexible planning and action control in LLM agents through pseudocode synthesis. Leveraging the ability of LLMs to express task-solving strategies as code, PseudoAct synthesizes a structured pseudocode plan that decomposes a task into subtasks and explicitly encodes control flow, including sequencing, conditionals, loops, parallel composition, and combinations of these logic primitives. Actions are then executed by following this global plan, making the decision logic explicit and temporally coherent. This design reduces redundant actions, prevents infinite loops, and avoids uninformative alternative exploration, enabling consistent and efficient long-horizon decision-making. Experiments on benchmark datasets show that our method significantly outperforms existing reactive agent approaches, achieving a 20.93% absolute gain in success rate on FEVER and setting a new state-of-the-art on HotpotQA.
[AI-45] Blockchain-Enabled Routing for Zero-Trust Low-Altitude Intelligent Networks
【速读】:该论文旨在解决低空智能网络(Low-altitude Intelligent Networks, LAINs)中因无人机(Unmanned Aerial Vehicles, UAVs)分布式拓扑和高移动性导致的路由稳定性与安全性问题,尤其是在潜在安全威胁下如何保障数据传输的可靠性与效率。解决方案的关键在于构建一种结合零信任架构(Zero-Trust Architecture)、软件定义边界(Software-Defined Perimeter, SDP)与区块链技术的综合安全机制,以实现对UAV身份与移动性的动态管理;同时,将原问题建模为一个优化端到端(End-to-End, E2E)延迟与传输成功率(Transmission Success Ratio, TSR)的整数非线性规划问题,并通过转化为部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP),设计基于多智能体双深度Q网络(Multi-Agent Double Deep Q-Network, MADDPQN)的路由算法,辅以软分层经验回放缓冲区和优先级经验回放机制,从而在复杂环境中实现高效、鲁棒的路由决策。
链接: https://arxiv.org/abs/2602.23667
作者: Ziye Jia,Sijie He,Ligang Yuan,Fuhui Zhou,Qihui Wu,Zhu Han,Dusit Niyato
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 18 pages, Accepted by JSAC
Abstract:Due to the scalability and portability, low-altitude intelligent networks (LAINs) are essential in various fields such as surveillance and disaster rescue. However, in LAINs, unmanned aerial vehicles (UAVs) are characterized by the distributed topology and high mobility, thus vulnerable to security threats, which may degrade routing performances for data transmissions. Hence, how to ensure the routing stability and security of LAINs is challenging. In this paper, we focus on the routing with multiple UAV clusters in LAINs. To minimize the damage caused by potential threats, we present the zero-trust architecture with the software-defined perimeter and blockchain techniques to manage the identify and mobility of UAVs. Besides, we formulate the routing problem to optimize the end-to-end (E2E) delay and transmission success ratio (TSR) simultaneously, which is an integer nonlinear programming problem and intractable to solve. Therefore, we reformulate the problem into a decentralized partially observable Markov decision process. We design the multi-agent double deep Q-network-based routing algorithms to solve the problem, empowered by the soft-hierarchical experience replay buffer and prioritized experience replay mechanisms. Finally, extensive simulations are conducted and the numerical results demonstrate that the proposed framework reduces the average E2E delay by 59% and improves the TSR by 29% on average compared to benchmarks, while simultaneously enabling faster and more robust identification of low-trust UAVs.
[AI-46] AudioCapBench: Quick Evaluation on Audio Captioning across Sound Music and Speech
【速读】:该论文旨在解决当前大模型在音频描述生成(audio captioning)能力评估中缺乏统一、多维度基准测试的问题。现有研究往往仅依赖单一指标或局限于特定音频类别,难以全面衡量模型对环境声、音乐和语音等多样化音频内容的理解与生成性能。解决方案的关键在于提出AudioCapBench——一个覆盖三个典型音频域(environmental sound, music, speech)的标准化评估基准,包含1000个精心筛选的样本,并采用双轨评估机制:一方面使用传统参考文本指标(METEOR、BLEU、ROUGE-L),另一方面引入LLM-as-Judge框架,从准确性(accuracy)、完整性(completeness)和幻觉程度(hallucination)三个正交维度量化模型输出质量。这一设计显著提升了评估的科学性与可复现性,为音频理解领域的模型开发与比较提供了可靠工具。
链接: https://arxiv.org/abs/2602.23649
作者: Jielin Qiu,Jianguo Zhang,Zixiang Chen,Liangwei Yang,Ming Zhu,Juntao Tan,Haolin Chen,Wenting Zhao,Rithesh Murthy,Roshan Ram,Akshara Prabhakar,Shelby Heinecke,Caiming,Xiong,Silvio Savarese,Huan Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce AudioCapBench, a benchmark for evaluating audio captioning capabilities of large multimodal models. \method covers three distinct audio domains, including environmental sound, music, and speech, with 1,000 curated evaluation samples drawn from established datasets. We evaluate 13 models across two providers (OpenAI, Google Gemini) using both reference-based metrics (METEOR, BLEU, ROUGE-L) and an LLM-as-Judge framework that scores predictions on three orthogonal dimensions: \textitaccuracy (semantic correctness), \textitcompleteness (coverage of reference content), and \textithallucination (absence of fabricated content). Our results reveal that Gemini models generally outperform OpenAI models on overall captioning quality, with Gemini~3~Pro achieving the highest overall score (6.00/10), while OpenAI models exhibit lower hallucination rates. All models perform best on speech captioning and worst on music captioning. We release the benchmark as well as evaluation code to facilitate reproducible audio understanding research.
[AI-47] AI Must Embrace Specialization via Superhuman Adaptable Intelligence
【速读】:该论文试图解决当前关于人工通用智能(Artificial General Intelligence, AGI)概念的模糊性和误导性问题,指出其定义不一致、难以实现且不符合AI发展的实际方向。论文认为,AGI追求“人类水平”的泛化能力本身存在逻辑缺陷,因为人类并非真正意义上的通用智能体。解决方案的关键在于提出“超人类适应性智能”(Superhuman Adaptable Intelligence, SAI),即一种专注于特定领域并能超越人类表现、同时填补人类能力盲区的智能形态。SAI强调专业化与超人性能的结合,旨在为AI发展提供更清晰、可操作和实用的未来导向框架,从而替代被过度泛化的AGI概念,推动更具建设性的技术讨论与应用落地。
链接: https://arxiv.org/abs/2602.23643
作者: Judah Goldfeder,Philippe Wyder,Yann LeCun,Ravid Shwartz Ziv
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Everyone from AI executives and researchers to doomsayers, politicians, and activists is talking about Artificial General Intelligence (AGI). Yet, they often don’t seem to agree on its exact definition. One common definition of AGI is an AI that can do everything a human can do, but are humans truly general? In this paper, we address what’s wrong with our conception of AGI, and why, even in its most coherent formulation, it is a flawed concept to describe the future of AI. We explore whether the most widely accepted definitions are plausible, useful, and truly general. We argue that AI must embrace specialization, rather than strive for generality, and in its specialization strive for superhuman performance, and introduce Superhuman Adaptable Intelligence (SAI). SAI is defined as intelligence that can learn to exceed humans at anything important that we can do, and that can fill in the skill gaps where humans are incapable. We then lay out how SAI can help hone a discussion around AI that was blurred by an overloaded definition of AGI, and extrapolate the implications of using it as a guide for the future.
[AI-48] FedRot-LoRA: Mitigating Rotational Misalignment in Federated LoRA
【速读】:该论文旨在解决联邦LoRA(Federated LoRA)在聚合本地更新时因因子平均(factor-wise averaging)与数学上正确的更新聚合不一致而导致的显著聚合误差和训练不稳定问题。其关键在于识别出问题根源为低秩分解的旋转不变性所引发的旋转错位(rotational misalignment)——即不同客户端的语义等价更新可能位于不同的潜在子空间中,直接平均会导致破坏性干扰。为此,作者提出FedRot-LoRA框架,在聚合前通过正交变换对齐客户端更新,从而在不增加通信开销或限制模型表达能力的前提下,有效减少跨客户端子空间不匹配,提升全局更新的稳定性与准确性。
链接: https://arxiv.org/abs/2602.23638
作者: Haoran Zhang,Dongjun Kim,Seohyeon Cha,Haris Vikalo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: preprint
Abstract:Federated LoRA provides a communication-efficient mechanism for fine-tuning large language models on decentralized data. In practice, however, a discrepancy between the factor-wise averaging used to preserve low rank and the mathematically correct aggregation of local updates can cause significant aggregation error and unstable training. We argue that a major source of this problem is rotational misalignment, arising from the rotational invariance of low-rank factorizations – semantically equivalent updates can be represented in different latent subspaces across clients since (B_i R_i)(R_i^\top A_i) = B_i A_i . When such misaligned factors are averaged directly, they interfere destructively and degrade the global update. To address this issue, we propose FedRot-LoRA, a federated LoRA framework that aligns client updates via orthogonal transformations prior to aggregation. This alignment preserves the semantic update while reducing cross-client subspace mismatch, without increasing communication cost or restricting model expressivity. We provide a convergence analysis that examines the aggregation error induced by factor-wise averaging and shows how rotational alignment yields a tighter upper bound on this error. Extensive experiments on natural language understanding and generative tasks demonstrate that FedRot-LoRA consistently outperforms existing federated LoRA baselines across a range of heterogeneity levels and LoRA ranks.
[AI-49] FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation
【速读】:该论文旨在解决当前大语言模型(Large Language Model, LLM)内容审核机制在实际部署中因“严格度”(strictness)变化而导致的鲁棒性不足问题。现有基于二分类的防护模型隐含假设有害性定义固定不变,但在真实场景中,不同平台或随时间演进的监管要求会导致对“有害性”的定义动态调整,从而使得传统方法在跨严格度环境下性能显著下降。解决方案的关键在于提出 FlexGuard——一种输出校准后连续风险评分(calibrated continuous risk score)的 LLM 基础审核模型,并通过风险对齐优化(risk-alignment optimization)提升评分与风险严重程度的一致性,同时提供阈值选择策略以支持针对目标严格度的灵活决策,从而实现跨严格度环境下的高准确率与强鲁棒性。
链接: https://arxiv.org/abs/2602.23636
作者: Zhihao Ding,Jinming Li,Ze Lu,Jieming Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Ensuring the safety of LLM-generated content is essential for real-world deployment. Most existing guardrail models formulate moderation as a fixed binary classification task, implicitly assuming a fixed definition of harmfulness. In practice, enforcement strictness - how conservatively harmfulness is defined and enforced - varies across platforms and evolves over time, making binary moderators brittle under shifting requirements. We first introduce FlexBench, a strictness-adaptive LLM moderation benchmark that enables controlled evaluation under multiple strictness regimes. Experiments on FlexBench reveal substantial cross-strictness inconsistency in existing moderators: models that perform well under one regime can degrade substantially under others, limiting their practical usability. To address this, we propose FlexGuard, an LLM-based moderator that outputs a calibrated continuous risk score reflecting risk severity and supports strictness-specific decisions via thresholding. We train FlexGuard via risk-alignment optimization to improve score-severity consistency and provide practical threshold selection strategies to adapt to target strictness at deployment. Experiments on FlexBench and public benchmarks demonstrate that FlexGuard achieves higher moderation accuracy and substantially improved robustness under varying strictness. We release the source code and data to support reproducibility.
[AI-50] MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs
【速读】:该论文旨在解决现有训练数据合成方法在长尾知识覆盖、有效性验证和可解释性方面的局限性,以及基于知识图谱的方法在功能完整性、粒度控制、定制化程度和评估体系上的不足。其解决方案的关键在于提出MMKG-RDS框架,该框架利用多模态知识图谱(Multimodal Knowledge Graph, MKG)实现细粒度的知识抽取、可定制的路径采样机制以及多维度的数据质量评分体系,从而有效提升模型推理能力,并通过构建MMKG-RDS-Bench数据集验证了其在多个领域和任务类型中的泛化性能与实用性。
链接: https://arxiv.org/abs/2602.23632
作者: Lun Zhan,Feng Xiong,Huanyong Liu,Feng Zhang,Yuhui Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Synthesizing high-quality training data is crucial for enhancing domain models’ reasoning abilities. Existing methods face limitations in long-tail knowledge coverage, effectiveness verification, and interpretability. Knowledge-graph-based approaches still fall short in functionality, granularity, customizability, and evaluation. To address these issues, we propose MMKG-RDS, a flexible framework for reasoning data synthesis that leverages multimodal knowledge graphs. It supports fine-grained knowledge extraction, customizable path sampling, and multidimensional data quality scoring. We validate MMKG-RDS with the MMKG-RDS-Bench dataset, covering five domains, 17 task types, and 14,950 samples. Experimental results show fine-tuning Qwen3 models (0.6B/8B/32B) on a small number of synthesized samples improves reasoning accuracy by 9.2%. The framework also generates distinct data, challenging existing models on tasks involving tables and formulas, useful for complex benchmark construction. The dataset and code are available at this https URL
[AI-51] When Does Multimodal Learning Help in Healthcare? A Benchmark on EHR and Chest X-Ray Fusion
【速读】:该论文旨在解决多模态学习在临床决策支持中的实际有效性问题,特别是在模态缺失和算法公平性约束下的表现不确定性。其核心挑战在于:何时多模态融合(如电子健康记录EHR与胸部X光片CXR)能真正提升预测性能、不同融合策略的比较效果、模型对不完整输入的鲁棒性,以及是否能实现算法公平。解决方案的关键在于系统性地评估多种融合方法在MIMIC-IV和MIMIC-CXR标准化数据集上的表现,并揭示出:当模态完整时,融合可提升性能,但收益集中在需互补信息的疾病;跨模态学习机制虽能捕捉临床相关依赖关系,但EHR的时间结构导致模态不平衡,仅靠架构复杂度无法解决;在现实模态缺失场景下,除非显式设计处理不完整输入的机制,否则多模态优势迅速下降;此外,多模态融合并不自动带来公平性改善,子群体差异主要源于不同人口统计组间的敏感性不均。研究进一步发布了一个灵活的开源基准工具包,支持新模型和数据集的即插即用集成,为构建可部署、可靠且公平的多模态临床系统提供实证依据与技术路径。
链接: https://arxiv.org/abs/2602.23614
作者: Kejing Yin,Haizhou Xu,Wenfang Yao,Chen Liu,Zijie Chen,Yui Haang Cheung,William K. Cheung,Jing Qin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine learning holds promise for advancing clinical decision support, yet it remains unclear when multimodal learning truly helps in practice, particularly under modality missingness and fairness constraints. In this work, we conduct a systematic benchmark of multimodal fusion between Electronic Health Records (EHR) and chest X-rays (CXR) on standardized cohorts from MIMIC-IV and MIMIC-CXR, aiming to answer four fundamental questions: when multimodal fusion improves clinical prediction, how different fusion strategies compare, how robust existing methods are to missing modalities, and whether multimodal models achieve algorithmic fairness. Our study reveals several key insights. Multimodal fusion improves performance when modalities are complete, with gains concentrating in diseases that require complementary information from both EHR and CXR. While cross-modal learning mechanisms capture clinically meaningful dependencies beyond simple concatenation, the rich temporal structure of EHR introduces strong modality imbalance that architectural complexity alone cannot overcome. Under realistic missingness, multimodal benefits rapidly degrade unless models are explicitly designed to handle incomplete inputs. Moreover, multimodal fusion does not inherently improve fairness, with subgroup disparities mainly arising from unequal sensitivity across demographic groups. To support reproducible and extensible evaluation, we further release a flexible benchmarking toolkit that enables plug-and-play integration of new models and datasets. Together, this work provides actionable guidance on when multimodal learning helps, when it fails, and why, laying the foundation for developing clinically deployable multimodal systems that are both effective and reliable. The open-source toolkit can be found at this https URL.
[AI-52] SleepLM: Natural-Language Intelligence for Human Sleep
【速读】:该论文旨在解决现有基于学习的睡眠分析系统在标签空间上受限(如预定义的睡眠阶段或事件)且无法描述、查询或泛化至新型睡眠现象的问题。其核心解决方案是提出SleepLM,一个将自然语言与多模态多导睡眠图(polysomnography, PSG)对齐的睡眠语言基础模型,通过构建首个大规模睡眠文本数据集(超过100K小时来自10,000名个体的数据)和设计统一的预训练目标(结合对比对齐、文本生成与信号重建),实现语言引导的睡眠生理表征学习,从而支持零样本/少样本学习、跨模态检索、事件定位及未见任务的泛化能力。
链接: https://arxiv.org/abs/2602.23605
作者: Zongzhe Xu,Zitao Shuai,Eideen Mozaffari,Ravi S. Aysola,Rajesh Kumar,Yuzhe Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present SleepLM, a family of sleep-language foundation models that enable human sleep alignment, interpretation, and interaction with natural language. Despite the critical role of sleep, learning-based sleep analysis systems operate in closed label spaces (e.g., predefined stages or events) and fail to describe, query, or generalize to novel sleep phenomena. SleepLM bridges natural language and multimodal polysomnography, enabling language-grounded representations of sleep physiology. To support this alignment, we introduce a multilevel sleep caption generation pipeline that enables the curation of the first large-scale sleep-text dataset, comprising over 100K hours of data from more than 10,000 individuals. Furthermore, we present a unified pretraining objective that combines contrastive alignment, caption generation, and signal reconstruction to better capture physiological fidelity and cross-modal interactions. Extensive experiments on real-world sleep understanding tasks verify that SleepLM outperforms state-of-the-art in zero-shot and few-shot learning, cross-modal retrieval, and sleep captioning. Importantly, SleepLM also exhibits intriguing capabilities including language-guided event localization, targeted insight generation, and zero-shot generalization to unseen tasks. All code and data will be open-sourced.
[AI-53] KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning
【速读】:该论文旨在解决记忆增强型大语言模型(Memory-augmented Large Language Models, LLMs)在具身规划任务中因存储原始文本形式的记忆而导致提示过长、预填充延迟高,以及KV缓存频繁更新造成效率下降的问题。解决方案的关键在于提出一种以KV缓存为中心的记忆管理机制——KEEP,其核心创新包括:(1) 静态-动态记忆构建算法,通过混合粒度的记忆分组减少KV缓存的重复计算;(2) 多跳记忆重计算算法,动态识别不同记忆组间的交叉注意力关系并迭代重构记忆交互;(3) 层平衡记忆加载机制,消除不同层间KV缓存加载和交叉注意力计算的不平衡。实验表明,KEEP在ALFRED数据集上相较文本记忆方法实现2.68倍加速且精度损失可忽略,并优于CacheBlend方法,在成功率提升4.13%的同时将首次标记时间(TTFT)降低1.90倍。
链接: https://arxiv.org/abs/2602.23592
作者: Zebin Yang,Tong Xie,Baotong Lu,Shaoshan Liu,Bo Yu,Meng Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: DAC 2026
Abstract:Memory-augmented Large Language Models (LLMs) have demonstrated remarkable capability for complex and long-horizon embodied planning. By keeping track of past experiences and environmental states, memory enables LLMs to maintain a global view, thereby avoiding repetitive exploration. However, existing approaches often store the memory as raw text, leading to excessively long prompts and high prefill latency. While it is possible to store and reuse the KV caches, the efficiency benefits are greatly undermined due to frequent KV cache updates. In this paper, we propose KEEP, a KV-cache-centric memory management system for efficient embodied planning. KEEP features 3 key innovations: (1) a Static-Dynamic Memory Construction algorithm that reduces KV cache recomputation by mixed-granularity memory group; (2) a Multi-hop Memory Re-computation algorithm that dynamically identifies important cross-attention among different memory groups and reconstructs memory interactions iteratively; (3) a Layer-balanced Memory Loading that eliminates unbalanced KV cache loading and cross-attention computation across different layers. Extensive experimental results have demonstrated that KEEP achieves 2.68x speedup with negligible accuracy loss compared with text-based memory methods on ALFRED dataset. Compared with the KV re-computation method CacheBlend (EuroSys’25), KEEP shows 4.13% success rate improvement and 1.90x time-to-first-token (TTFT) reduction. Our code is available on this https URL.
[AI-54] SDMixer: Sparse Dual-Mixer for Time Series Forecasting
【速读】:该论文旨在解决多变量时间序列预测中普遍存在的多尺度特征、弱相关性及噪声干扰等问题,这些问题限制了现有模型的预测性能。其解决方案的关键在于提出了一种双流稀疏Mixer预测框架(dual-stream sparse Mixer prediction framework),该框架分别在频域和时域上提取序列的全局趋势与局部动态特征,并引入稀疏机制过滤无效信息,从而提升跨变量依赖关系建模的准确性。
链接: https://arxiv.org/abs/2602.23581
作者: Xiang Ao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12pages,2 figures
Abstract:Multivariate time series forecasting is widely applied in fields such as transportation, energy, and finance. However, the data commonly suffers from issues of multi-scale characteristics, weak correlations, and noise interference, which limit the predictive performance of existing models. This paper proposes a dual-stream sparse Mixer prediction framework that extracts global trends and local dynamic features from sequences in both the frequency and time domains, respectively. It employs a sparsity mechanism to filter out invalid information, thereby enhancing the accuracy of cross-variable dependency modeling. Experimental results demonstrate that this method achieves leading performance on multiple real-world scenario datasets, validating its effectiveness and generality. The code is available at this https URL
[AI-55] Construct Merge Solve Adapt with Reinforcement Learning for the min-max Multiple Traveling Salesman Problem
【速读】:该论文旨在解决对称单 depot 最大最小多旅行商问题(min-max mTSP),其目标是通过优化多个销售员的路径分配,使最长路径的长度最小化,从而实现工作负载均衡。解决方案的关键在于提出了一种混合方法 RL-CMSA(Construct, Merge, Solve & Adapt with Reinforcement Learning),该方法融合了强化学习与精确优化:首先利用基于学习到的成对 q 值的概率聚类进行多样化路径构造;随后通过合并策略将路径压缩至紧凑池中,并求解受限集合覆盖混合整数线性规划(MILP)模型;最后采用跨路径删除、移动和交换操作对解进行精细化调整。其中,q 值通过强化高质解中城市对共现行为来更新,同时结合老化与剪枝机制动态维护解池,有效平衡探索与利用,显著提升大规模实例下的求解性能。
链接: https://arxiv.org/abs/2602.23579
作者: Guillem Rodríguez-Corominas,Maria J. Blesa,Christian Blum
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The Multiple Traveling Salesman Problem (mTSP) extends the Traveling Salesman Problem to m tours that start and end at a common depot and jointly visit all customers exactly once. In the min-max variant, the objective is to minimize the longest tour, reflecting workload balance. We propose a hybrid approach, Construct, Merge, Solve Adapt with Reinforcement Learning (RL-CMSA), for the symmetric single-depot min-max mTSP. The method iteratively constructs diverse solutions using probabilistic clustering guided by learned pairwise q-values, merges routes into a compact pool, solves a restricted set-covering MILP, and refines solutions via inter-route remove, shift, and swap moves. The q-values are updated by reinforcing city-pair co-occurrences in high-quality solutions, while the pool is adapted through ageing and pruning. This combination of exact optimization and reinforcement-guided construction balances exploration and exploitation. Computational results on random and TSPLIB instances show that RL-CMSA consistently finds (near-)best solutions and outperforms a state-of-the-art hybrid genetic algorithm under comparable time limits, especially as instance size and the number of salesmen increase.
[AI-56] Flowette: Flow Matching with Graphette Priors for Graph Generation
【速读】:该论文旨在解决复杂图结构数据的生成建模问题,特别是如何有效捕捉图中反复出现的子图模式(subgraph motifs)及其长程结构依赖关系。其解决方案的关键在于提出Flowette框架——一个基于连续流匹配(continuous flow matching)的生成模型,通过图神经网络(Graph Neural Network, GNN)驱动的Transformer学习定义在包含节点和边属性的图表示上的速度场(velocity field),并利用最优传输(optimal transport)实现拓扑保持的耦合机制,同时引入正则化策略以建模长程结构依赖。此外,为嵌入领域驱动的结构先验,作者创新性地提出了graphettes,一种基于图子结构编辑(如环、星形和树状结构)的新型概率图结构模型,从而将图论中的图子结构先验(graphons)推广至可控制的子图生成场景。理论分析与实验表明,该方法在合成图和小分子图生成任务上均展现出稳定提升,验证了结构先验与流式训练结合的有效性。
链接: https://arxiv.org/abs/2602.23566
作者: Asiri Wijesinghe,Sevvandi Kandanaarachchi,Daniel M. Steinberg,Cheng Soon Ong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 37 Pages
Abstract:We study generative modeling of graphs with recurring subgraph motifs. We propose Flowette, a continuous flow matching framework, that employs a graph neural network based transformer to learn a velocity field defined over graph representations with node and edge attributes. Our model preserves topology through optimal transport based coupling, and long-range structural dependencies through regularisation. To incorporate domain driven structural priors, we introduce graphettes, a new probabilistic family of graph structure models that generalize graphons via controlled structural edits for motifs like rings, stars and trees. We theoretically analyze the coupling, invariance, and structural properties of the proposed framework, and empirically evaluate it on synthetic and small-molecule graph generation tasks. Flowette demonstrates consistent improvements, highlighting the effectiveness of combining structural priors with flow-based training for modeling complex graph distributions.
[AI-57] Planning under Distribution Shifts with Causal POMDPs ICAPS-26
【速读】:该论文旨在解决在环境分布发生偏移(distribution shifts)时,传统规划方法因模型失效而导致策略失效的问题。其核心挑战在于:当状态分布或环境动态发生变化时,基于先前数据学习的策略难以适应新场景。解决方案的关键在于构建一个基于因果知识的POMDP(部分可观测马尔可夫决策过程)框架,将环境变化建模为对因果POMDP的干预(intervention),从而能够在假设性变化下评估计划并识别被改变的环境组件。作者进一步提出在扩展的信念空间中维护和更新关于潜在状态与底层领域结构的联合信念,并证明价值函数在该空间中仍保持分段线性凸(piecewise linear and convex, PWLC)性质,这保证了使用α-向量方法进行规划的计算可行性。
链接: https://arxiv.org/abs/2602.23545
作者: Matteo Ceriscioli,Karthika Mohan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To appear at the 36th International Conference on Automated Planning and Scheduling (ICAPS-26)
Abstract:In the real world, planning is often challenged by distribution shifts. As such, a model of the environment obtained under one set of conditions may no longer remain valid as the distribution of states or the environment dynamics change, which in turn causes previously learned strategies to fail. In this work, we propose a theoretical framework for planning under partial observability using Partially Observable Markov Decision Processes (POMDPs) formulated using causal knowledge. By representing shifts in the environment as interventions on this causal POMDP, the framework enables evaluating plans under hypothesized changes and actively identifying which components of the environment have been altered. We show how to maintain and update a belief over both the latent state and the underlying domain, and we prove that the value function remains piecewise linear and convex (PWLC) in this augmented belief space. Preservation of PWLC under distribution shifts has the advantage of maintaining the tractability of planning via \alpha -vector-based POMDP methods.
[AI-58] Causal Identification from Counterfactual Data: Completeness and Bounding Results
【速读】:该论文旨在解决在因果推理中,如何从可物理实现的反事实分布(Layer 3 of Pearl’s Causal Hierarchy)中识别出更多反事实量的问题。此前,由于普遍认为无法获取反事实数据,相关研究仅限于观测或干预分布(Layers 1 和 2)。本文通过引入“反事实可实现性”(counterfactual realizability)的概念,证明了某些反事实分布可通过实验方法直接估计,从而突破了传统限制。解决方案的关键在于提出CTFIDU+算法,该算法能够从任意一组Layer 3分布中识别反事实查询,并被证明是完备的;在此基础上进一步揭示了在非参数设定下,哪些反事实量可以被精确识别,即确定了精确因果推断的根本理论极限。对于无法识别的反事实量,论文还推导出基于可实现反事实数据的新分析边界,并通过模拟验证其在实践中能有效收紧不确定度。
链接: https://arxiv.org/abs/2602.23541
作者: Arvind Raghavan,Elias Bareinboim
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Previous work establishing completeness results for \textitcounterfactual identification has been circumscribed to the setting where the input data belongs to observational or interventional distributions (Layers 1 and 2 of Pearl’s Causal Hierarchy), since it was generally presumed impossible to obtain data from counterfactual distributions, which belong to Layer 3. However, recent work (Raghavan Bareinboim, 2025) has formally characterized a family of counterfactual distributions which can be directly estimated via experimental methods - a notion they call \textitcounterfactual realizabilty . This leaves open the question of what \textitadditional counterfactual quantities now become identifiable, given this new access to (some) Layer 3 data. To answer this question, we develop the CTFIDU+ algorithm for identifying counterfactual queries from an arbitrary set of Layer 3 distributions, and prove that it is complete for this task. Building on this, we establish the theoretical limit of which counterfactuals can be identified from physically realizable distributions, thus implying the \textitfundamental limit to exact causal inference in the non-parametric setting . Finally, given the impossibility of identifying certain critical types of counterfactuals, we derive novel analytic bounds for such quantities using realizable counterfactual data, and corroborate using simulations that counterfactual data helps tighten the bounds for non-identifiable quantities in practice.
[AI-59] FedDAG: Clustered Federated Learning via Global Data and Gradient Integration for Heterogeneous Environments ICLR2026
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在客户端数据异构性(data heterogeneity)下性能下降的问题,以及现有聚类联邦学习(clustered FL)方法仅依赖单一相似性度量(数据或梯度)导致的聚类不充分、跨集群知识共享受限的问题。其解决方案的关键在于提出FedDAG框架:首先设计一种加权类别级相似性度量,融合数据分布与梯度信息以实现更全面的客户端相似性评估;其次引入双编码器架构,主编码器保留集群特异性表征,辅编码器通过来自互补集群的梯度进行优化,从而实现跨集群特征迁移与本地化专业化并存。
链接: https://arxiv.org/abs/2602.23504
作者: Anik Pramanik,Murat Kantarcioglu,Vincent Oria,Shantanu Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: This paper has been accepted in ICLR 2026
Abstract:Federated Learning (FL) enables a group of clients to collaboratively train a model without sharing individual data, but its performance drops when client data are heterogeneous. Clustered FL tackles this by grouping similar clients. However, existing clustered FL approaches rely solely on either data similarity or gradient similarity; however, this results in an incomplete assessment of client similarities. Prior clustered FL approaches also restrict knowledge and representation sharing to clients within the same cluster. This prevents cluster models from benefiting from the diverse client population across clusters. To address these limitations, FedDAG introduces a clustered FL framework, FedDAG, that employs a weighted, class-wise similarity metric that integrates both data and gradient information, providing a more holistic measure of similarity during clustering. In addition, FedDAG adopts a dual-encoder architecture for cluster models, comprising a primary encoder trained on its own clients’ data and a secondary encoder refined using gradients from complementary clusters. This enables cross-cluster feature transfer while preserving cluster-specific specialization. Experiments on diverse benchmarks and data heterogeneity settings show that FedDAG consistently outperforms state-of-the-art clustered FL baselines in accuracy.
[AI-60] aCarla: A comprehensive benchmarking dataset for end-to-end autonomous driving
【速读】:该论文旨在解决当前自动驾驶研究中数据集存在的关键缺陷:现有数据集通常在感知与规划任务之间存在割裂,且缺乏行为多样性及闭环评估能力。具体而言,感知类数据集往往缺少规划信息,而规划类数据集则多为单一前进场景,难以支撑复杂驾驶行为建模;同时,多数真实数据集无法进行闭环评估,限制了模型性能的准确衡量。解决方案的关键在于构建一个面向端到端自动驾驶研究的高质量、多样化数据集——基于CARLA Leaderboard 2.0挑战场景采集超过285万帧图像数据,覆盖动态物体检测、车道线检测、中心线识别、交通灯识别、预测任务以及视觉语言动作模型等多模态任务,并提供数值化稀有度评分以量化场景罕见性,从而支持开放环和闭环两种评估范式,显著提升模型训练与评测的全面性与实用性。
链接: https://arxiv.org/abs/2602.23499
作者: Tugrul Gorgulu,Atakan Dag,M. Esat Kalfaoglu,Halil Ibrahim Kuru,Baris Can Cam,Ozsel Kilinc
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Collecting a high-quality dataset is a critical task that demands meticulous attention to detail, as overlooking certain aspects can render the entire dataset unusable. Autonomous driving challenges remain a prominent area of research, requiring further exploration to enhance the perception and planning performance of vehicles. However, existing datasets are often incomplete. For instance, datasets that include perception information generally lack planning data, while planning datasets typically consist of extensive driving sequences where the ego vehicle predominantly drives forward, offering limited behavioral diversity. In addition, many real datasets struggle to evaluate their models, especially for planning tasks, since they lack a proper closed-loop evaluation setup. The CARLA Leaderboard 2.0 challenge, which provides a diverse set of scenarios to address the long-tail problem in autonomous driving, has emerged as a valuable alternative platform for developing perception and planning models in both open-loop and closed-loop evaluation setups. Nevertheless, existing datasets collected on this platform present certain limitations. Some datasets appear to be tailored primarily for limited sensor configuration, with particular sensor configurations. To support end-to-end autonomous driving research, we have collected a new dataset comprising over 2.85 million frames using the CARLA simulation environment for the diverse Leaderboard 2.0 challenge scenarios. Our dataset is designed not only for planning tasks but also supports dynamic object detection, lane divider detection, centerline detection, traffic light recognition, prediction tasks and visual language action models . Furthermore, we demonstrate its versatility by training various models using our dataset. Moreover, we also provide numerical rarity scores to understand how rarely the current state occurs in the dataset.
[AI-61] BiKA: Kolmogorov-Arnold-Network-inspired Ultra Lightweight Neural Network Hardware Accelerator
【速读】:该论文旨在解决边缘设备上轻量级神经网络加速器的硬件资源受限与能效瓶颈问题。现有方法如量化(quantization)和二值化(binarization)虽可降低硬件成本,但仍依赖传统的人工神经网络(Artificial Neural Network, ANN)计算范式,难以进一步优化资源效率。其解决方案的关键在于提出BiKA架构——一种基于Kolmogorov-Arnold Network(KAN)启发的乘法-free计算结构,将非线性函数替换为可学习的二值阈值(binary, learnable thresholds),从而仅需比较器(comparators)和累加器(accumulators)即可完成核心运算,显著降低硬件复杂度。FPGA原型验证表明,BiKA相较二值化和量化神经网络脉动阵列加速器分别减少27.73%和51.54%的资源占用,同时保持竞争力的精度,为边缘设备上的硬件友好型神经网络设计提供了新路径。
链接: https://arxiv.org/abs/2602.23455
作者: Yuhao Liu,Salim Ullah,Akash Kumar
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Lightweight neural network accelerators are essential for edge devices with limited resources and power constraints. While quantization and binarization can efficiently reduce hardware cost, they still rely on the conventional Artificial Neural Network (ANN) computation pattern. The recently proposed Kolmogorov-Arnold Network (KAN) presents a novel network paradigm built on learnable nonlinear functions. However, it is computationally expensive for hardware deployment. Inspired by KAN, we propose BiKA, a multiply-free architecture that replaces nonlinear functions with binary, learnable thresholds, introducing an extremely lightweight computational pattern that requires only comparators and accumulators. Our FPGA prototype on Ultra96-V2 shows that BiKA reduces hardware resource usage by 27.73% and 51.54% compared with binarized and quantized neural network systolic array accelerators, while maintaining competitive accuracy. BiKA provides a promising direction for hardware-friendly neural network design on edge devices.
[AI-62] Human Supervision as an Information Bottleneck: A Unified Theory of Error Floors in Human-Guided Learning
【速读】:该论文旨在解决大语言模型在人类监督下仍存在持续性误差的问题,这些误差源于标注噪声(annotation noise)、主观偏好偏差(preference distortion)以及自然语言表达带宽有限导致的语义压缩(semantic compression)。其解决方案的关键在于提出并形式化了“人类受限智能”(Human-Bounded Intelligence)极限理论,证明当人类监督通道不足以充分反映潜在评估目标时,该通道本质上是一个信息减少型信道,从而对任何依赖它的学习器施加一个严格为正的过风险下界(excess-risk floor)。这一理论通过六种互补框架(算子理论、PAC-Bayes、信息论、因果推断、范畴论及强化学习中博弈论分析)统一揭示了该下界的结构分解,并指出仅靠模型规模扩展无法消除此类误差;而引入非人类辅助信号(如检索、程序执行或工具调用)可提升有效监督容量,恢复关于潜在目标的信息,进而消除或显著降低该过风险下界。
链接: https://arxiv.org/abs/2602.23446
作者: Alejandro Rodriguez Dominguez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Proceedings from IEEE CAI 2026, Conference on Artificial Intelligence, 8-10 May, Granada, Spain. 8 Pages, 3 Figures, 7 Tables
Abstract:Large language models are trained primarily on human-generated data and feedback, yet they exhibit persistent errors arising from annotation noise, subjective preferences, and the limited expressive bandwidth of natural language. We argue that these limitations reflect structural properties of the supervision channel rather than model scale or optimization. We develop a unified theory showing that whenever the human supervision channel is not sufficient for a latent evaluation target, it acts as an information-reducing channel that induces a strictly positive excess-risk floor for any learner dominated by it. We formalize this Human-Bounded Intelligence limit and show that across six complementary frameworks (operator theory, PAC-Bayes, information theory, causal inference, category theory, and game-theoretic analyses of reinforcement learning from human feedback), non-sufficiency yields strictly positive lower bounds arising from the same structural decomposition into annotation noise, preference distortion, and semantic compression. The theory explains why scaling alone cannot eliminate persistent human-aligned errors and characterizes conditions under which auxiliary non-human signals (e.g., retrieval, program execution, tools) increase effective supervision capacity and collapse the floor by restoring information about the latent target. Experiments on real preference data, synthetic known-target tasks, and externally verifiable benchmarks confirm the predicted structural signatures: human-only supervision exhibits a persistent floor, while sufficiently informative auxiliary channels strictly reduce or eliminate excess error.
[AI-63] Brain-OF: An Omnifunctional Foundation Model for fMRI EEG and MEG
【速读】:该论文旨在解决当前脑科学基础模型普遍局限于单一功能模态(如仅基于fMRI、EEG或MEG)的问题,从而限制了对不同成像技术间互补时空动态特性和大规模数据协同利用的能力。其关键解决方案是提出Brain-OF,首个联合预训练于fMRI、EEG和MEG的全功能脑基础模型,通过引入“任意分辨率神经信号采样器”(Any-Resolution Neural Signal Sampler)实现多模态信号在统一语义空间中的映射,并采用DINT注意力机制与稀疏专家混合(Sparse Mixture of Experts)结构,使共享专家捕捉模态不变特征、路由专家聚焦模态特定语义;同时设计掩码时间-频率建模(Masked Temporal-Frequency Modeling)作为双域预训练目标,联合重建时间域与频域中的脑信号,显著提升了跨模态任务性能与泛化能力。
链接: https://arxiv.org/abs/2602.23410
作者: Hanning Guo,Farah Abdellatif,Hanwen Bi,Andrei Galbenus,Jon. N. Shah,Abigail Morrison,Jürgen Dammers
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Brain foundation models have achieved remarkable advances across a wide range of neuroscience tasks. However, most existing models are limited to a single functional modality, restricting their ability to exploit complementary spatiotemporal dynamics and the collective data scale across imaging techniques. To address this limitation, we propose Brain-OF, the first omnifunctional brain foundation model jointly pretrained on fMRI, EEG and MEG, capable of handling both unimodal and multimodal inputs within a unified framework. To reconcile heterogeneous spatiotemporal resolutions, we introduce the Any-Resolution Neural Signal Sampler, which projects diverse brain signals into a shared semantic this http URL further manage semantic shifts, the Brain-OF backbone integrates DINT attention with a Sparse Mixture of Experts, where shared experts capture modality-invariant representations and routed experts specialize in modality-specific semantics. Furthermore, we propose Masked Temporal-Frequency Modeling, a dual-domain pretraining objective that jointly reconstructs brain signals in both the time and frequency domains. Brain-OF is pretrained on a large-scale corpus comprising around 40 datasets and demonstrates superior performance across diverse downstream tasks, highlighting the benefits of joint multimodal integration and dual-domain pretraining.
[AI-64] Long Range Frequency Tuning for QML
【速读】:该论文旨在解决量子机器学习模型中频率编码的可训练性问题,特别是针对可训练频率方法在优化过程中难以达到目标频率范围的问题。其关键解决方案是引入基于三进制编码(ternary encoding)的网格初始化策略,通过生成密集的整数频率谱,确保目标频率位于局部可训练范围内,从而显著提升模型性能;相较于传统固定频率或可训练频率方法,该方案虽略微增加编码门数量(O(log₃(ω_max))),但能有效克服频率可达性限制,在合成数据和真实世界航班乘客数据集上分别实现R²分数从0.1841提升至0.9969和从0.7876提升至0.9671。
链接: https://arxiv.org/abs/2602.23409
作者: Michael Poppel,Jonas Stein,Sebastian Wölckert,Markus Baumann,Claudia Linnhoff-Popien
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
备注:
Abstract:Quantum machine learning models using angle encoding naturally represent truncated Fourier series, providing universal function approximation capabilities with sufficient circuit depth. For unary fixed-frequency encodings, circuit depth scales as O(omega_max * (omega_max + epsilon^-2)) with target frequency magnitude omega_max and precision epsilon. Trainable-frequency approaches theoretically reduce this to match the target spectrum size, requiring only as many encoding gates as frequencies in the target spectrum. Despite this compelling efficiency, their practical effectiveness hinges on a key assumption: that gradient-based optimization can drive prefactors to arbitrary target values. We demonstrate through systematic experiments that frequency prefactors exhibit limited trainability: movement is constrained to approximately +/-1 units with typical learning rates. When target frequencies lie outside this reachable range, optimization frequently fails. To overcome this frequency reachability limitation, we propose grid-based initialization using ternary encodings, which generate dense integer frequency spectra. While this approach requires O(log_3(omega_max)) encoding gates – more than the theoretical optimum but exponentially fewer than fixed-frequency methods – it ensures target frequencies lie within the locally reachable range. On synthetic targets with three shifted high frequencies, ternary grid initialization achieves a median R^2 score of 0.9969, compared to 0.1841 for the trainable-frequency baseline. For the real-world Flight Passengers dataset, ternary grid initialization achieves a median R^2 score of 0.9671, representing a 22.8% improvement over trainable-frequency initialization (median R^2 = 0.7876).
[AI-65] Learning to Generate Secure Code via Token-Level Rewards
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在代码生成过程中易引入安全漏洞的问题,其核心挑战在于高质量安全数据稀缺以及传统强化学习中奖励信号粒度粗泛。解决方案的关键在于提出两个创新模块:一是通过LLM自我反思机制构建高置信度的漏洞修复对,并生成多样化的隐式提示以建立PrimeVul+数据集,从而提升训练数据的质量与多样性;二是设计SRCode训练框架,首次在代码安全领域引入token级奖励机制,使模型能够在训练过程中持续关注并强化关键细粒度的安全模式,实现对局部安全实现的精准优化。
链接: https://arxiv.org/abs/2602.23407
作者: Jiazheng Quan,Xiaodong Li,Bin Wang,Guo An,Like Liu,Degen Huang,Lin Liu,Chengbin Hou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 18 pages, 3 figures
Abstract:Large language models (LLMs) have demonstrated strong capabilities in code generation, yet they remain prone to producing security vulnerabilities. Existing approaches commonly suffer from two key limitations: the scarcity of high-quality security data and coarse-grained reinforcement learning reward signals. To address these challenges, we propose Vul2Safe, a new secure code generation framework that leverages LLM self-reflection to construct high-confidence repair pairs from real-world vulnerabilities, and further generates diverse implicit prompts to build the PrimeVul+ dataset. Meanwhile, we introduce SRCode, a novel training framework that pioneers the use of token-level rewards in reinforcement learning for code security, which enables the model to continuously attend to and reinforce critical fine-grained security patterns during training. Compared with traditional instance-level reward schemes, our approach allows for more precise optimization of local security implementations. Extensive experiments show that PrimeVul+ and SRCode substantially reduce security vulnerabilities in generated code while improving overall code quality across multiple benchmarks.
[AI-66] Uncovering sustainable personal care ingredient combinations using scientific modelling
【速读】:该论文旨在解决个人护理产品中合成非生物降解成分(如硅油和矿物油)因欧盟法规限制及消费者对可持续性要求提升而面临替代难题的问题,同时确保新配方在性能和成本上不逊于原有合成成分。解决方案的关键在于提出一种基于预测建模与仿真驱动的数字服务平台,通过算法推荐天然基成分组合来替代常用合成原料,并以具体配方应用验证其有效性,从而帮助配方师快速筛选高性能、环境友好的替代方案,推动行业向可持续方向实质性转型。
链接: https://arxiv.org/abs/2602.23887
作者: Sandip Bhattacharya,Vanessa da Silva,Christina Kohlmann
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: Paper submitted and part of 35th IFSCC Congress, Brazil, 14-17 October 2024
Abstract:Personal care formulations often contain synthetic and non-biodegradable ingredients, such as silicone and mineral oils, which can offer a unique performance. However, due to regulations like the EU ban of Octamethylcyclotetrasiloxane (D4), Decamethyl-cyclopentasiloxane (D5), Dodecamethylcyclohexasiloxane (D6) already in effect for rinse off and for leave on cosmetics by June 2027 coupled with growing consumer awareness and expectations on sustainability, personal care brands face significant pressure to replace these synthetic ingredients with natural alternatives without compromising performance and cost. As a result, formulators are confronted with the challenge to find natural-based solutions within a short timeframe. In this study, we propose a pioneering approach that utilizes predicting modelling and simulation-based digital services to obtain natural-based ingredient combinations as recommendations to commonly used synthetic ingredients. We will demonstrate the effectiveness of our predictions through the application of these proposals in specific formulations. By offering a platform of digital services, it is aimed to empower formulators to explore good performing novel and environmentally friendly alternatives, ultimately driving a substantial and genuine transformation in the personal care industry.
[AI-67] Operationalizing Longitudinal Causal Discovery Under Real-World Workflow Constraints
【速读】:该论文旨在解决纵向因果发现(longitudinal causal discovery)在大规模真实世界数据中应用受限的问题,尤其是由于操作性数据生成过程受机构工作流程(institutional workflows)影响,而这些流程所诱导的偏序关系(partial orders)通常未被形式化,导致可接受的有向无环图(DAG)空间扩大,与实际记录机制不一致。其解决方案的关键在于提出一种由工作流程诱导的约束类(workflow-induced constraint class),通过协议推导的结构掩码(structural masks)和时间对齐索引(timeline-aligned indexing)来限制DAG空间,从而减少结构歧义,尤其在混合离散-连续面板数据中提升时间点内方向识别能力。该框架整合了工作流程衍生的可行边约束、测量对齐的时间索引、块结构(block structure)、基于Bootstrap的滞后总效应不确定性量化以及支持干预查询的动态表示,实现了时序一致的子结构和可解释的滞后效应估计。
链接: https://arxiv.org/abs/2602.23800
作者: Tadahisa Okuda,Shohei Shimizu,Thong Pham,Tatsuyoshi Ikenoue,Shingo Fukuma
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Causal discovery has achieved substantial theoretical progress, yet its deployment in large-scale longitudinal systems remains limited. A key obstacle is that operational data are generated under institutional workflows whose induced partial orders are rarely formalized, enlarging the admissible graph space in ways inconsistent with the recording process. We characterize a workflow-induced constraint class for longitudinal causal discovery that restricts the admissible directed acyclic graph space through protocol-derived structural masks and timeline-aligned indexing. Rather than introducing a new optimization algorithm, we show that explicitly encoding workflow-consistent partial orders reduces structural ambiguity, especially in mixed discrete–continuous panels where within-time orientation is weakly identified. The framework combines workflow-derived admissible-edge constraints, measurement-aligned time indexing and block structure, bootstrap-based uncertainty quantification for lagged total effects, and a dynamic representation supporting intervention queries. In a nationwide annual health screening cohort in Japan with 107,261 individuals and 429,044 person-years, workflow-constrained longitudinal LiNGAM yields temporally consistent within-time substructures and interpretable lagged total effects with explicit uncertainty. Sensitivity analyses using alternative exposure and body-composition definitions preserve the main qualitative patterns. We argue that formalizing workflow-derived constraint classes improves structural interpretability without relying on domain-specific edge specification, providing a reproducible bridge between operational workflows and longitudinal causal discovery under standard identifiability assumptions.
[AI-68] ReDON: Recurrent Diffractive Optical Neural Processor with Reconfigurable Self-Modulated Nonlinearity
【速读】:该论文旨在解决传统衍射光学神经网络(Diffractive Optical Neural Networks, DONNs)在计算表达能力上的局限性,即其静态、被动的衍射相位掩模缺乏高效的非线性响应和可重编程性。为此,作者提出了一种新型架构——循环衍射光学神经处理器(Recurrent Diffractive Optical Neural Processor, ReDON),其核心创新在于引入了可重构的、循环自调制非线性机制。该机制通过原位电光自调制实现输入依赖的动态光学传输,利用轻量级参数化函数感知传播光场并调节相位或强度,从而以极低推理开销实现有效非线性。ReDON作为非冯·诺依曼架构,在主权重元件(超表面)保持固定的前提下,借助循环光学硬件复用与可调非线性显著提升了传统DONNs的非线性表征能力和任务适应性。
链接: https://arxiv.org/abs/2602.23616
作者: Ziang Yin,Qi Jing,Raktim Sarma,Rena Huang,Yu Yao,Jiaqi Gu
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 18 pages
Abstract:Diffractive optical neural networks (DONNs) have demonstrated unparalleled energy efficiency and parallelism by processing information directly in the optical domain. However, their computational expressivity is constrained by static, passive diffractive phase masks that lack efficient nonlinear responses and reprogrammability. To address these limitations, we introduce the Recurrent Diffractive Optical Neural Processor (ReDON), a novel architecture featuring reconfigurable, recurrent self-modulated nonlinearity. This mechanism enables dynamic, input-dependent optical transmission through in-situ electro-optic self-modulation, providing a highly efficient and reprogrammable approach to optical computation. Inspired by the gated linear unit (GLU) used in large language models, ReDON senses a fraction of the propagating optical field and modulates its phase or intensity via a lightweight parametric function, enabling effective nonlinearity with minimal inference overhead. As a non-von Neumann architecture in which the primary weighting elements (metasurfaces) remain fixed, ReDON substantially extends the nonlinear representational capacity and task adaptability of conventional DONNs through recurrent optical hardware reuse and dynamically tunable nonlinearity. We systematically investigate various self-modulation configurations to characterize the trade-offs between hardware efficiency and computational expressivity. On image recognition and segmentation benchmarks, ReDON improves test accuracy and mean intersection-over-union (mIoU) by up to 20% compared with prior DONNs employing either optical or digital nonlinearities at comparable model complexity and negligible additional power consumption. This work establishes a new paradigm for reconfigurable nonlinear optical computing, uniting recurrence and self-modulation within non-von Neumann analog processors.
[AI-69] Let There Be Claws: An Early Social Network Analysis of AI Agents on Moltbook
【速读】:该论文旨在解决生成式 AI (Generative AI) 代理生态系统中社会结构如何快速形成的问题,特别是关注在极短时间内(12天)是否能观察到类似人类社交平台中的层级分化、角色分工与注意力集中等现象。其解决方案的关键在于对 Moltbook 平台在 2026 年 1 月 28 日至 2 月 8 日间公开数据的系统分析,包括构建共参与图和定向评论图,量化互动对称性(reciprocity)、社区结构、中心性指标(如 HITS 谱中心性),并结合嵌入式主题建模识别内容主题。结果表明,代理间的交互高度不对称( reciprocity ~1%),中心性清晰分离为枢纽(hub)与权威(authority)角色,且注意力分配远比内容生产更不平等(upvote Gini = 0.992 vs. posting Gini = 0.601),揭示出在代理面向平台中,熟悉的等级制、放大效应和角色分化可在压缩时间尺度上迅速涌现。
链接: https://arxiv.org/abs/2602.20044
作者: H.C.W. Price,H. AlMuhanna,P.M. Bassani,M. Ho,T.S. Evans
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
Abstract:Within twelve days of launch, an AI-native social platform exhibits extreme attention concentration, hierarchical role separation, and one-way attention flow, consistent with the hypothesis that stratification in agent ecosystems can emerge rapidly rather than gradually. We analyse publicly observable traces from a 12-day window of Moltbook (28 January – 8 February 2026), comprising 20,040 posts and 192,410 comments from 15,083 accounts across 759 submolts. We construct co-participation and directed-comment graphs and report reciprocity, community structure, and centrality, alongside descriptive content themes. Under a commenter–post-author tie definition, interaction is strongly asymmetric (reciprocity ~1%), and HITS centrality separates cleanly into hub and authority roles, consistent with broadcast-style attention rather than mutual exchange. Engagement is highly unequal: attention is far more concentrated than production (upvote Gini = 0.992 vs. posting Gini = 0.601), and early-arriving accounts accumulate substantially higher cumulative upvotes prior to exposure-time correction, suggesting rich-get-richer dynamics. Participation is brief and bursty (median observed lifespan 2.48 minutes; 54.8% of posts occur within six peak UTC hours). Embedding-based topic modelling identifies diverse thematic clusters, including technical discussion of memory and identity, onboarding messages, and formulaic token-minting content. These results provide an early structural baseline for large-scale agent–agent social interaction and suggest that familiar forms of hierarchy, amplification, and role differentiation can arise on compressed timescales in agent-facing platforms.
机器学习
[LG-0] Who Guards the Guardians? The Challenges of Evaluating Identifiability of Learned Representations
链接: https://arxiv.org/abs/2602.24278
作者: Shruti Joshi,Théo Saulus,Wieland Brendel,Philippe Brouillard,Dhanya Sridhar,Patrik Reizinger
类目: Machine Learning (cs.LG)
*备注:
Abstract:Identifiability in representation learning is commonly evaluated using standard metrics (e.g., MCC, DCI, R^2) on synthetic benchmarks with known ground-truth factors. These metrics are assumed to reflect recovery up to the equivalence class guaranteed by identifiability theory. We show that this assumption holds only under specific structural conditions: each metric implicitly encodes assumptions about both the data-generating process (DGP) and the encoder. When these assumptions are violated, metrics become misspecified and can produce systematic false positives and false negatives. Such failures occur both within classical identifiability regimes and in post-hoc settings where identifiability is most needed. We introduce a taxonomy separating DGP assumptions from encoder geometry, use it to characterise the validity domains of existing metrics, and release an evaluation suite for reproducible stress testing and comparison.
[LG-1] Coverag e-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web–Knowledge–Web Pipeline
链接: https://arxiv.org/abs/2602.24262
作者: Yijiashun Qi,Yijiazhen Qi,Tanmay Wagh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Identifying the full landscape of small and medium-sized enterprises (SMEs) in specialized industry sectors is critical for supply-chain resilience, yet existing business databases suffer from substantial coverage gaps – particularly for sub-tier suppliers and firms in emerging niche markets. We propose a \textbfWeb–Knowledge–Web (W \to K \to W) pipeline that iteratively (1)~crawls domain-specific web sources to discover candidate supplier entities, (2)~extracts and consolidates structured knowledge into a heterogeneous knowledge graph, and (3)~uses the knowledge graph’s topology and coverage signals to guide subsequent crawling toward under-represented regions of the supplier space. To quantify discovery completeness, we introduce a \textbfcoverage estimation framework inspired by ecological species-richness estimators (Chao1, ACE) adapted for web-entity populations. Experiments on the semiconductor equipment manufacturing sector (NAICS 333242) demonstrate that the W \to K \to W pipeline achieves the highest precision (0.138) and F1 (0.118) among all methods using the same 213-page crawl budget, building a knowledge graph of 765 entities and 586 relations while reaching peak recall by iteration~3 with only 112 pages.
[LG-2] Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text ICASSP2026
链接: https://arxiv.org/abs/2602.24245
作者: Hainan Xu,Vladimir Bataev,Travis M. Bartley,Jagadeesh Balam
类目: Machine Learning (cs.LG)
*备注: Accepted at ICASSP 2026
Abstract:We propose Chunk-wise Attention Transducer (CHAT), a novel extension to RNN-T models that processes audio in fixed-size chunks while employing cross-attention within each chunk. This hybrid approach maintains RNN-T’s streaming capability while introducing controlled flexibility for local alignment modeling. CHAT significantly reduces the temporal dimension that RNN-T must handle, yielding substantial efficiency improvements: up to 46.2% reduction in peak training memory, up to 1.36X faster training, and up to 1.69X faster inference. Alongside these efficiency gains, CHAT achieves consistent accuracy improvements over RNN-T across multiple languages and tasks – up to 6.3% relative WER reduction for speech recognition and up to 18.0% BLEU improvement for speech translation. The method proves particularly effective for speech translation, where RNN-T’s strict monotonic alignment hurts performance. Our results demonstrate that the CHAT model offers a practical solution for deploying more capable streaming speech models without sacrificing real-time constraints.
[LG-3] me Series Foundation Models as Strong Baselines in Transportation Forecasting: A Large-Scale Benchmark Analysis
链接: https://arxiv.org/abs/2602.24238
作者: Javier Pulido,Filipe Rodrigues
类目: Machine Learning (cs.LG)
*备注: 6 pages
Abstract:Accurate forecasting of transportation dynamics is essential for urban mobility and infrastructure planning. Although recent work has achieved strong performance with deep learning models, these methods typically require dataset-specific training, architecture design and hyper-parameter tuning. This paper evaluates whether general-purpose time-series foundation models can serve as forecasters for transportation tasks by benchmarking the zero-shot performance of the state-of-the-art model, Chronos-2, across ten real-world datasets covering highway traffic volume and flow, urban traffic speed, bike-sharing demand, and electric vehicle charging station data. Under a consistent evaluation protocol, we find that, even without any task-specific fine-tuning, Chronos-2 delivers state-of-the-art or competitive accuracy across most datasets, frequently outperforming classical statistical baselines and specialized deep learning architectures, particularly at longer horizons. Beyond point forecasting, we evaluate its native probabilistic outputs using prediction-interval coverage and sharpness, demonstrating that Chronos-2 also provides useful uncertainty quantification without dataset-specific training. In general, this study supports the adoption of time-series foundation models as a key baseline for transportation forecasting research.
[LG-4] Better Learning-Augmented Spanning Tree Algorithms via Metric Forest Completion
链接: https://arxiv.org/abs/2602.24232
作者: Nate Veldt,Thomas Stanley,Benjamin W. Priest,Trevor Steil,Keita Iwabuchi,T.S. Jayram,Grace J. Li,Geoffrey Sanders
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:We present improved learning-augmented algorithms for finding an approximate minimum spanning tree (MST) for points in an arbitrary metric space. Our work follows a recent framework called metric forest completion (MFC), where the learned input is a forest that must be given additional edges to form a full spanning tree. Veldt et al. (2025) showed that optimally completing the forest takes \Omega(n^2) time, but designed a 2.62-approximation for MFC with subquadratic complexity. The same method is a (2\gamma + 1) -approximation for the original MST problem, where \gamma \geq 1 is a quality parameter for the initial forest. We introduce a generalized method that interpolates between this prior algorithm and an optimal \Omega(n^2) -time MFC algorithm. Our approach considers only edges incident to a growing number of strategically chosen ``representative’’ points. One corollary of our analysis is to improve the approximation factor of the previous algorithm from 2.62 for MFC and (2\gamma+1) for metric MST to 2 and 2\gamma respectively. We prove this is tight for worst-case instances, but we still obtain better instance-specific approximations using our generalized method. We complement our theoretical results with a thorough experimental evaluation.
[LG-5] Adaptive Combinatorial Experimental Design: Pareto Optimality for Decision-Making and Inference AISTATS2026
链接: https://arxiv.org/abs/2602.24231
作者: Hongrui Xie,Junyu Cao,Kan Xu
类目: Machine Learning (cs.LG)
*备注: 30 pages, 3 figure, AISTATS 2026 accepted paper
Abstract:In this paper, we provide the first investigation into adaptive combinatorial experimental design, focusing on the trade-off between regret minimization and statistical power in combinatorial multi-armed bandits (CMAB). While minimizing regret requires repeated exploitation of high-reward arms, accurate inference on reward gaps requires sufficient exploration of suboptimal actions. We formalize this trade-off through the concept of Pareto optimality and establish equivalent conditions for Pareto-efficient learning in CMAB. We consider two relevant cases under different information structures, i.e., full-bandit feedback and semi-bandit feedback, and propose two algorithms MixCombKL and MixCombUCB respectively for these two cases. We provide theoretical guarantees showing that both algorithms are Pareto optimal, achieving finite-time guarantees on both regret and estimation error of arm gaps. Our results further reveal that richer feedback significantly tightens the attainable Pareto frontier, with the primary gains arising from improved estimation accuracy under our proposed methods. Taken together, these findings establish a principled framework for adaptive combinatorial experimentation in multi-objective decision-making.
[LG-6] Comparing Classical and Quantum Variational Classifiers on the XOR Problem
链接: https://arxiv.org/abs/2602.24220
作者: Miras Seilkhan,Adilbek Taizhanov
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 32 pages, 17 figures. Code and experiment scripts available at this https URL
Abstract:Quantum machine learning applies principles such as superposition and entanglement to data processing and optimization. Variational quantum models operate on qubits in high-dimensional Hilbert spaces and provide an alternative approach to model expressivity. We compare classical models and a variational quantum classifier on the XOR problem. Logistic regression, a one-hidden-layer multilayer perceptron, and a two-qubit variational quantum classifier with circuit depths 1 and 2 are evaluated on synthetic XOR datasets with varying Gaussian noise and sample sizes using accuracy and binary cross-entropy. Performance is determined primarily by model expressivity. Logistic regression and the depth-1 quantum circuit fail to represent XOR reliably, whereas the multilayer perceptron and the depth-2 quantum circuit achieve perfect test accuracy under representative conditions. Robustness analyses across noise levels, dataset sizes, and random seeds confirm that circuit depth is decisive for quantum performance on this task. Despite matching accuracy, the multilayer perceptron achieves lower binary cross-entropy and substantially shorter training time. Hardware execution preserves the global XOR structure but introduces structured deviations in the decision function. Overall, deeper variational quantum classifiers can match classical neural networks in accuracy on low-dimensional XOR benchmarks, but no clear empirical advantage in robustness or efficiency is observed in the examined settings. Comments: 32 pages, 17 figures. Code and experiment scripts available at this https URL Subjects: Machine Learning (cs.LG); Quantum Physics (quant-ph) Cite as: arXiv:2602.24220 [cs.LG] (or arXiv:2602.24220v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.24220 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-7] he Stability of Online Algorithms in Performative Prediction
链接: https://arxiv.org/abs/2602.24207
作者: Gabriele Farina,Juan Carlos Perdomo
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注:
Abstract:The use of algorithmic predictions in decision-making leads to a feedback loop where the models we deploy actively influence the data distributions we see, and later use to retrain on. This dynamic was formalized by Perdomo et al. 2020 in their work on performative prediction. Our main result is an unconditional reduction showing that any no-regret algorithm deployed in performative settings converges to a (mixed) performatively stable equilibrium: a solution in which models actively shape data distributions in ways that their own predictions look optimal in hindsight. Prior to our work, all positive results in this area made strong restrictions on how models influenced distributions. By using a martingale argument and allowing randomization, we avoid any such assumption and sidestep recent hardness results for finding stable models. Lastly, on a more conceptual note, our connection sheds light on why common algorithms, like gradient descent, are naturally stabilizing and prevent runaway feedback loops. We hope our work enables future technical transfer of ideas between online optimization and performativity.
[LG-8] Flow-Based Density Ratio Estimation for Intractable Distributions with Applications in Genomics
链接: https://arxiv.org/abs/2602.24201
作者: Egor Antipov,Alessandro Palma,Lorenzo Consoli,Stephan Günnemann,Andrea Dittadi,Fabian J. Theis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Estimating density ratios between pairs of intractable data distributions is a core problem in probabilistic modeling, enabling principled comparisons of sample likelihoods under different data-generating processes across conditions and covariates. While exact-likelihood models such as normalizing flows offer a promising approach to density ratio estimation, naive flow-based evaluations are computationally expensive, as they require simulating costly likelihood integrals for each distribution separately. In this work, we leverage condition-aware flow matching to derive a single dynamical formulation for tracking density ratios along generative trajectories. We demonstrate competitive performance on simulated benchmarks for closed-form ratio estimation, and show that our method supports versatile tasks in single-cell genomics data analysis, where likelihood-based comparisons of cellular states across experimental conditions enable treatment effect estimation and batch correction evaluation.
[LG-9] Multi-Objective Reinforcement Learning for Large-Scale Tote Allocation in Human-Robot Collaborative Fulfillm ent Centers
链接: https://arxiv.org/abs/2602.24182
作者: Sikata Sengupta,Guangyi Liu,Omer Gottesman,Joseph W Durham,Michael Kearns,Aaron Roth,Michael Caldara
类目: Machine Learning (cs.LG)
*备注:
Abstract:Optimizing the consolidation process in container-based fulfillment centers requires trading off competing objectives such as processing speed, resource usage, and space utilization while adhering to a range of real-world operational constraints. This process involves moving items between containers via a combination of human and robotic workstations to free up space for inbound inventory and increase container utilization. We formulate this problem as a large-scale Multi-Objective Reinforcement Learning (MORL) task with high-dimensional state spaces and dynamic system behavior. Our method builds on recent theoretical advances in solving constrained RL problems via best-response and no-regret dynamics in zero-sum games, enabling principled minimax policy learning. Policy evaluation on realistic warehouse simulations shows that our approach effectively trades off objectives, and we empirically observe that it learns a single policy that simultaneously satisfies all constraints, even if this is not theoretically guaranteed. We further introduce a theoretical framework to handle the problem of error cancellation, where time-averaged solutions display oscillatory behavior. This method returns a single iterate whose Lagrangian value is close to the minimax value of the game. These results demonstrate the promise of MORL in solving complex, high-impact decision-making problems in large-scale industrial systems.
[LG-10] Sandwiching Polynomials for Geometric Concepts with Low Intrinsic Dimension
链接: https://arxiv.org/abs/2602.24178
作者: Adam R. Klivans,Konstantinos Stavropoulos,Arsen Vasilyan
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注: 30 pages
Abstract:Recent work has shown the surprising power of low-degree sandwiching polynomial approximators in the context of challenging learning settings such as learning with distribution shift, testable learning, and learning with contamination. A pair of sandwiching polynomials approximate a target function in expectation while also providing pointwise upper and lower bounds on the function’s values. In this paper, we give a new method for constructing low-degree sandwiching polynomials that yield greatly improved degree bounds for several fundamental function classes and marginal distributions. In particular, we obtain degree \mathrmpoly(k) sandwiching polynomials for functions of k halfspaces under the Gaussian distribution, improving exponentially over the prior 2^O(k) bound. More broadly, our approach applies to function classes that are low-dimensional and have smooth boundary. In contrast to prior work, our proof is relatively simple and directly uses the smoothness of the target function’s boundary to construct sandwiching Lipschitz functions, which are amenable to results from high-dimensional approximation theory. For low-dimensional polynomial threshold functions (PTFs) with respect to Gaussians, we obtain doubly exponential improvements without applying the FT-mollification method of Kane used in the best previous result. Comments: 30 pages Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC) Cite as: arXiv:2602.24178 [cs.LG] (or arXiv:2602.24178v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.24178 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-11] What You Read is What You Classify: Highlighting Attributions to Text and Text-Like Inputs
链接: https://arxiv.org/abs/2602.24149
作者: Daniel S. Berman,Brian Merritt,Stanley Ta,Dana Udwin,Amanda Ernlund,Jeremy Ratcliff,Vijay Narayan
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 17 pages, 8 figures
Abstract:At present, there are no easily understood explainable artificial intelligence (AI) methods for discrete token inputs, like text. Most explainable AI techniques do not extend well to token sequences, where both local and global features matter, because state-of-the-art models, like transformers, tend to focus on global connections. Therefore, existing explainable AI algorithms fail by (i) identifying disparate tokens of importance, or (ii) assigning a large number of tokens a low value of importance. This method for explainable AI for tokens-based classifiers generalizes a mask-based explainable AI algorithm for images. It starts with an Explainer neural network that is trained to create masks to hide information not relevant for classification. Then, the Hadamard product of the mask and the continuous values of the classifier’s embedding layer is taken and passed through the classifier, changing the magnitude of the embedding vector but keeping the orientation unchanged. The Explainer is trained for a taxonomic classifier for nucleotide sequences and it is shown that the masked segments are less relevant to classification than the unmasked ones. This method focused on the importance the token as a whole (i.e., a segment of the input sequence), producing a human-readable explanation.
[LG-12] Learning with a Budget: Identifying the Best Arm with Resource Constraints AISTATS2024
链接: https://arxiv.org/abs/2602.24146
作者: Zitian Li,Wang Chi Cheung
类目: Machine Learning (cs.LG)
*备注: A preliminary version of this work, titled ‘Best Arm Identification with Resource Constraints,’ was presented at the 27th International Conference on Artificial Intelligence and Statistics (AISTATS 2024). This manuscript extends the original conference paper by providing improved theoretical results and more generalized conclusions, aiming for future journal submission. arXiv admin note: substantial text overlap with arXiv:2402.19090
Abstract:In many applications, evaluating the effectiveness of different alternatives comes with varying costs or resource usage. Motivated by such heterogeneity, we study the Best Arm Identification with Resource Constraints (BAIwRC) problem, where an agent seeks to identify the best alternative (aka arm) in the presence of resource constraints. Each arm pull consumes one or more types of limited resources. We make two key contributions. First, we propose the Successive Halving with Resource Rationing (SH-RR) algorithm, which integrates resource-aware allocation into the classical successive halving framework on best arm identification. The SH-RR algorithm unifies the theoretical analysis for both the stochastic and deterministic consumption settings, with a new \textiteffective consumption measure
[LG-13] Agent ic AI-RAN: Enabling Intent-Driven Explainable and Self-Evolving Open RAN Intelligence
链接: https://arxiv.org/abs/2602.24115
作者: Zhizhou He,Yang Luo,Xinkai Liu,Mahdi Boloursaz Mashhadi,Mohammad Shojafar,Merouane Debbah,Rahim Tafazolli
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures
Abstract:Open RAN (O-RAN) exposes rich control and telemetry interfaces across the Non-RT RIC, Near-RT RIC, and distributed units, but also makes it harder to operate multi-tenant, multi-objective RANs in a safe and auditable manner. In parallel, agentic AI systems with explicit planning, tool use, memory, and self-management offer a natural way to structure long-lived control loops. This article surveys how such agentic controllers can be brought into O-RAN: we review the O-RAN architecture, contrast agentic controllers with conventional ML/RL xApps, and organise the task landscape around three clusters: network slice life-cycle, radio resource management (RRM) closed loops, and cross-cutting security, privacy, and compliance. We then introduce a small set of agentic primitives (Plan-Act-Observe-Reflect, skills as tool use, memory and evidence, and self-management gates) and show, in a multi-cell O-RAN simulation, how they improve slice life-cycle and RRM performance compared to conventional baselines and ablations that remove individual primitives. Security, privacy, and compliance are discussed as architectural constraints and open challenges for standards-aligned deployments. This framework achieves an average 8.83% reduction in resource usage across three classic network slices.
[LG-14] he Subjectivity of Monoculture
链接: https://arxiv.org/abs/2602.24086
作者: Nathanael Jo,Nikhil Garg,Manish Raghavan
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning models – including large language models (LLMs) – are often said to exhibit monoculture, where outputs agree strikingly often. But what does it actually mean for models to agree too much? We argue that this question is inherently subjective, relying on two key decisions. First, the analyst must specify a baseline null model for what “independence” should look like. This choice is inherently subjective, and as we show, different null models result in dramatically different inferences about excess agreement. Second, we show that inferences depend on the population of models and items under consideration. Models that seem highly correlated in one context may appear independent when evaluated on a different set of questions, or against a different set of peers. Experiments on two large-scale benchmarks validate our theoretical findings. For example, we find drastically different inferences when using a null model with item difficulty compared to previous works that do not. Together, our results reframe monoculture evaluation not as an absolute property of model behavior, but as a context-dependent inference problem. Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2602.24086 [cs.CY] (or arXiv:2602.24086v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2602.24086 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-15] Neural Diffusion Intensity Models for Point Process Data
链接: https://arxiv.org/abs/2602.24083
作者: Xinlong Du,Harsha Honnappa,Vinayak Rao
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:
Abstract:Cox processes model overdispersed point process data via a latent stochastic intensity, but both nonparametric estimation of the intensity model and posterior inference over intensity paths are typically intractable, relying on expensive MCMC methods. We introduce Neural Diffusion Intensity Models, a variational framework for Cox processes driven by neural SDEs. Our key theoretical result, based on enlargement of filtrations, shows that conditioning on point process observations preserves the diffusion structure of the latent intensity with an explicit drift correction. This guarantees the variational family contains the true posterior, so that ELBO maximization coincides with maximum likelihood estimation under sufficient model capacity. We design an amortized encoder architecture that maps variable-length event sequences to posterior intensity paths by simulating the drift-corrected SDE, replacing repeated MCMC runs with a single forward pass. Experiments on synthetic and real-world data demonstrate accurate recovery of latent intensity dynamics and posterior paths, with orders-of-magnitude speedups over MCMC-based methods.
[LG-16] Leverag ing Non-linear Dimension Reduction and Random Walk Co-occurrence for Node Embedding
链接: https://arxiv.org/abs/2602.24069
作者: Ryan DeWolfe
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 13 pages, 6 figures
Abstract:Leveraging non-linear dimension reduction techniques, we remove the low dimension constraint from node embedding and propose COVE, an explainable high dimensional embedding that, when reduced to low dimension with UMAP, slightly increases performance on clustering and link prediction tasks. The embedding is inspired by neural embedding methods that use co-occurrence on a random walk as an indication of similarity, and is closely related to a diffusion process. Extending on recent community detection benchmarks, we find that a COVE UMAP HDBSCAN pipeline performs similarly to the popular Louvain algorithm.
[LG-17] pathsig: A GPU-Accelerated Library for Truncated and Projected Path Signatures
链接: https://arxiv.org/abs/2602.24066
作者: Tobias Nygaard
类目: Machine Learning (cs.LG)
*备注:
Abstract:Path signatures provide a rich representation of sequential data, with strong theoretical guarantees and good performance in a variety of machine-learning tasks. While signatures have progressed from fixed feature extractors to trainable components of machine-learning models, existing libraries often lack the required scalability for large-scale, gradient-based learning. To address this gap, this paper introduces pathsig, a PyTorch-native library that computes path signatures directly in the word basis. By using CUDA kernels to update signature coefficients in parallel over prefix-closed word sets, pathsig achieves high GPU throughput and near-minimal peak memory. Compared with other libraries, pathsig achieves 10-30x speedups for computation of truncated signatures and up to 4-10x speedups in training that require backpropagation through the signature. Beyond regular truncation, pathsig supports projections of the (infinite-dimensional) signature onto user-specified sets of words and anisotropic truncation motivated by inhomogeneous path regularity, enabling more compact representations that can reduce dimensionality, redundancy, and computational cost.
[LG-18] Unsupervised Baseline Clustering and Incremental Adaptation for IoT Device Traffic Profiling
链接: https://arxiv.org/abs/2602.24047
作者: Sean M. Alderman,John D. Hastings
类目: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, 4 tables
Abstract:The growth and heterogeneity of IoT devices create security challenges where static identification models can degrade as traffic evolves. This paper presents a two-stage, flow-feature-based pipeline for unsupervised IoT device traffic profiling and incremental model updating, evaluated on selected long-duration captures from the Deakin IoT dataset. For baseline profiling, density-based clustering (DBSCAN) isolates a substantial outlier portion of the data and produces the strongest alignment with ground-truth device labels among tested classical methods (NMI 0.78), outperforming centroid-based clustering on cluster purity. For incremental adaptation, we evaluate stream-oriented clustering approaches and find that BIRCH supports efficient updates (0.13 seconds per update) and forms comparatively coherent clusters for a held-out novel device (purity 0.87), but with limited capture of novel traffic (share 0.72) and a measurable trade-off in known-device accuracy after adaptation (0.71). Overall, the results highlight a practical trade-off between high-purity static profiling and the flexibility of incremental clustering for evolving IoT environments.
[LG-19] InfoNCE Induces Gaussian Distribution ICLR2026
链接: https://arxiv.org/abs/2602.24012
作者: Roy Betser,Eyal Gofer,Meir Yossef Levi,Guy Gilboa
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted to ICLR 2026, Oral
Abstract:Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this work, we show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training. We establish this result in two complementary regimes. First, we show that under certain alignment and concentration assumptions, projections of the high-dimensional representation asymptotically approach a multivariate Gaussian distribution. Next, under less strict assumptions, we show that adding a small asymptotically vanishing regularization term that promotes low feature norm and high feature entropy leads to similar asymptotic results. We support our analysis with experiments on synthetic and CIFAR-10 datasets across multiple encoder architectures and sizes, demonstrating consistent Gaussian behavior. This perspective provides a principled explanation for commonly observed Gaussianity in contrastive representations. The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.
[LG-20] Learning Generation Orders for Masked Discrete Diffusion Models via Variational Inference
链接: https://arxiv.org/abs/2602.23968
作者: David Fox,Sam Bowyer,Song Liu,Laurence Aitchison,Raul Santos-Rodriguez,Mengyue Yang
类目: Machine Learning (cs.LG)
*备注: 12 pages, 1 figure
Abstract:Masked discrete diffusion models (MDMs) are a promising new approach to generative modelling, offering the ability for parallel token generation and therefore greater efficiency than autoregressive counterparts. However, achieving an optimal balance between parallel generation and sample quality remains an open problem. Current approaches primarily address this issue through fixed, heuristic parallel sampling methods. There exist some recent learning based approaches to this problem, but its formulation from the perspective of variational inference remains underexplored. In this work, we propose a variational inference framework for learning parallel generation orders for MDMs. As part of our method, we propose a parameterisation for the approximate posterior of generation orders which facilitates parallelism and efficient sampling during training. Using this method, we conduct preliminary experiments on the GSM8K dataset, where our method performs competitively against heuristic sampling strategies in the regime of highly parallel generation. For example, our method achieves 33.1% accuracy with an average of only only 4 generation steps, compared to 23.7-29.0% accuracy achieved by standard competitor methods in the same number of steps. We believe further experiments and analysis of the method will yield valuable insights into the problem of parallel generation with MDMs.
[LG-21] Learning to Build: Autonomous Robotic Assembly of Stable Structures Without Predefined Plans
链接: https://arxiv.org/abs/2602.23934
作者: Jingwen Wang,Johannes Kirschner,Paul Rolland,Luis Salamanca,Stefana Parascho
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a novel autonomous robotic assembly framework for constructing stable structures without relying on predefined architectural blueprints. Instead of following fixed plans, construction tasks are defined through targets and obstacles, allowing the system to adapt more flexibly to environmental uncertainty and variations during the building process. A reinforcement learning (RL) policy, trained using deep Q-learning with successor features, serves as the decision-making component. As a proof of concept, we evaluate the approach on a benchmark of 15 2D robotic assembly tasks of discrete block construction. Experiments using a real-world closed-loop robotic setup demonstrate the feasibility of the method and its ability to handle construction noise. The results suggest that our framework offers a promising direction for more adaptable and robust robotic construction in real-world environments.
[LG-22] A Theory of Random Graph Shift in Truncated-Spectrum vRKHS
链接: https://arxiv.org/abs/2602.23880
作者: Zhang Wan,Tingting Mu,Samuel Kaski
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper develops a theory of graph classification under domain shift through a random-graph generative lens, where we consider intra-class graphs sharing the same random graph model (RGM) and the domain shift induced by changes in RGM components. While classic domain adaptation (DA) theories have well-underpinned existing techniques to handle graph distribution shift, the information of graph samples, which are itself structured objects, is less explored. The non-Euclidean nature of graphs and specialized architectures for graph learning further complicate a fine-grained analysis of graph distribution shifts. In this paper, we propose a theory that assumes RGM as the data generative process, exploiting its connection to hypothesis complexity in function space perspective for such fine-grained analysis. Building on a vector-valued reproducing kernel Hilbert space (vRKHS) formulation, we derive a generalization bound whose shift penalty admits a factorization into (i) a domain discrepancy term, (ii) a spectral-geometry term summarized by the accessible truncated spectrum, and (iii) an amplitude term that aggregates convergence and construction-stability effects. We empirically verify the insights on these terms in both real data and simulations.
[LG-23] ULW-SleepNet: An Ultra-Lightweight Network for Multimodal Sleep Stage Scoring ICASSP2026
链接: https://arxiv.org/abs/2602.23852
作者: Zhaowen Wang,Dongdong Zhou,Qi Xu,Fengyu Cong,Mohammad Al-Sa’d,Jenni Raitoharju
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted to ICASSP 2026
Abstract:Automatic sleep stage scoring is crucial for the diagnosis and treatment of sleep disorders. Although deep learning models have advanced the field, many existing models are computationally demanding and designed for single-channel electroencephalography (EEG), limiting their practicality for multimodal polysomnography (PSG) data. To overcome this, we propose ULW-SleepNet, an ultra-lightweight multimodal sleep stage scoring framework that efficiently integrates information from multiple physiological signals. ULW-SleepNet incorporates a novel Dual-Stream Separable Convolution (DSSC) Block, depthwise separable convolutions, channel-wise parameter sharing, and global average pooling to reduce computational overhead while maintaining competitive accuracy. Evaluated on the Sleep-EDF-20 and Sleep-EDF-78 datasets, ULW-SleepNet achieves accuracies of 86.9% and 81.4%, respectively, with only 13.3K parameters and 7.89M FLOPs. Compared to state-of-the-art methods, our model reduces parameters by up to 98.6% with only marginal performance loss, demonstrating its strong potential for real-time sleep monitoring on wearable and IoT devices. The source code for this study is publicly available at this https URL.
[LG-24] Inferring Chronic Treatment Onset from ePrescription Data: A Renewal Process Approach
链接: https://arxiv.org/abs/2602.23824
作者: Pavlin G. Poličar,Dalibor Stanimirović,Blaž Zupan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Longitudinal electronic health record (EHR) data are often left-censored, making diagnosis records incomplete and unreliable for determining disease onset. In contrast, outpatient prescriptions form renewal-based trajectories that provide a continuous signal of disease management. We propose a probabilistic framework to infer chronic treatment onset by modeling prescription dynamics as a renewal process and detecting transitions from sporadic to sustained therapy via change-point detection between a baseline Poisson (sporadic prescribing) regime and a regime-specific Weibull (sustained therapy) renewal model. Using a nationwide ePrescription dataset of 2.4 million individuals, we show that the approach yields more temporally plausible onset estimates than naive rule-based triggering, substantially reducing implausible early detections under strong left censoring. Detection performance varies across diseases and is strongly associated with prescription density, highlighting both the strengths and limits of treatment-based onset inference.
[LG-25] Reason X: Declarative Reasoning on Explanations
链接: https://arxiv.org/abs/2602.23810
作者: Laura State,Salvatore Ruggieri,Franco Turini
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Explaining opaque Machine Learning (ML) models has become an increasingly important challenge. However, current eXplanation in AI (XAI) methods suffer several shortcomings, including insufficient abstraction, limited user interactivity, and inadequate integration of symbolic knowledge. We propose ReasonX, an explanation tool based on expressions (or, queries) in a closed algebra of operators over theories of linear constraints. ReasonX provides declarative and interactive explanations for decision trees, which may represent the ML models under analysis or serve as global or local surrogate models for any black-box predictor. Users can express background or common sense knowledge as linear constraints. This allows for reasoning at multiple levels of abstraction, ranging from fully specified examples to under-specified or partially constrained ones. ReasonX leverages Mixed-Integer Linear Programming (MILP) to reason over the features of factual and contrastive instances. We present here the architecture of ReasonX, which consists of a Python layer, closer to the user, and a Constraint Logic Programming (CLP) layer, which implements a meta-interpreter of the query algebra. The capabilities of ReasonX are demonstrated through qualitative examples, and compared to other XAI tools through quantitative experiments.
[LG-26] Actor-Critic Pretraining for Proximal Policy Optimization
链接: https://arxiv.org/abs/2602.23804
作者: Andreas Kernbach,Amr Elsheikh,Nicolas Grupp,René Nagel,Marco F. Huber
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) actor-critic algorithms enable autonomous learning but often require a large number of environment interactions, which limits their applicability in robotics. Leveraging expert data can reduce the number of required environment interactions. A common approach is actor pretraining, where the actor network is initialized via behavioral cloning on expert demonstrations and subsequently fine-tuned with RL. In contrast, the initialization of the critic network has received little attention, despite its central role in policy optimization. This paper proposes a pretraining approach for actor-critic algorithms like Proximal Policy Optimization (PPO) that uses expert demonstrations to initialize both networks. The actor is pretrained via behavioral cloning, while the critic is pretrained using returns obtained from rollouts of the pretrained policy. The approach is evaluated on 15 simulated robotic manipulation and locomotion tasks. Experimental results show that actor-critic pretraining improves sample efficiency by 86.1% on average compared to no pretraining and by 30.9% to actor-only pretraining.
[LG-27] GRAIL: Post-hoc Compensation by Linear Reconstruction for Compressed Networks
链接: https://arxiv.org/abs/2602.23795
作者: Wenwu Tang,Dong Wang,Lothar Thiele,Olga Saukh
类目: Machine Learning (cs.LG)
*备注: Conference on Parsimony and Learning (CPAL)
Abstract:Structured deep model compression methods are hardware-friendly and substantially reduce memory and inference costs. However, under aggressive compression, the resulting accuracy degradation often necessitates post-compression finetuning, which can be impractical due to missing labeled data or high training cost. We propose post-hoc blockwise compensation, called GRAIL, a simple zero-finetuning step applied after model compression that restores each block’s input-output behavior using a small calibration set. The method summarizes hidden activations via a Gram matrix and applies ridge regression to linearly reconstruct the original hidden representation from the reduced one. The resulting reconstruction map is absorbed into the downstream projection weights, while the upstream layer is compressed. The approach is selector-agnostic (Magnitude, Wanda, Gram-based selection, or folding), data-aware (requiring only a few forward passes without gradients or labels), and recovers classic pruning or folding when the Gram matrix is near identity, indicating weak inter-channel correlations. Across ResNets, ViTs, and decoder-only LLMs, GRAIL consistently improves accuracy or perplexity over data-free and data-aware pruning or folding baselines in practical compression regimes, with manageable overhead and no backpropagation. The code is available at this https URL.
[LG-28] Provable Subspace Identification of Nonlinear Multi-view CCA
链接: https://arxiv.org/abs/2602.23785
作者: Zhiwei Han,Stefan Matthes,Hao Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate the identifiability of nonlinear Canonical Correlation Analysis (CCA) in a multi-view setup, where each view is generated by an unknown nonlinear map applied to a linear mixture of shared latents and view-private noise. Rather than attempting exact unmixing, a problem proven to be ill-posed, we instead reframe multi-view CCA as a basis-invariant subspace identification problem. We prove that, under suitable latent priors and spectral separation conditions, multi-view CCA recovers the pairwise correlated signal subspaces up to view-wise orthogonal ambiguity. For N \geq 3 views, the objective provably isolates the jointly correlated subspaces shared across all views while eliminating view-private variations. We further establish finite-sample consistency guarantees by translating the concentration of empirical cross-covariances into explicit subspace error bounds via spectral perturbation theory. Experiments on synthetic and rendered image datasets validate our theoretical findings and confirm the necessity of the assumed conditions.
[LG-29] MAGE: Multi-scale Autoregressive Generation for Offline Reinforcement Learning ICLR2026
链接: https://arxiv.org/abs/2602.23770
作者: Chenxing Lin,Xinhui Gao,Haipeng Zhang,Xinran Li,Haitao Wang,Songzhu Mei,Chenglu Wen,Weiquan Liu,Siqi Shen,Cheng Wang
类目: Machine Learning (cs.LG)
*备注: ICLR2026
Abstract:Generative models have gained significant traction in offline reinforcement learning (RL) due to their ability to model complex trajectory distributions. However, existing generation-based approaches still struggle with long-horizon tasks characterized by sparse rewards. Some hierarchical generation methods have been developed to mitigate this issue by decomposing the original problem into shorter-horizon subproblems using one policy and generating detailed actions with another. While effective, these methods often overlook the multi-scale temporal structure inherent in trajectories, resulting in suboptimal performance. To overcome these limitations, we propose MAGE, a Multi-scale Autoregressive GEneration-based offline RL method. MAGE incorporates a condition-guided multi-scale autoencoder to learn hierarchical trajectory representations, along with a multi-scale transformer that autoregressively generates trajectory representations from coarse to fine temporal scales. MAGE effectively captures temporal dependencies of trajectories at multiple resolutions. Additionally, a condition-guided decoder is employed to exert precise control over short-term behaviors. Extensive experiments on five offline RL benchmarks against fifteen baseline algorithms show that MAGE successfully integrates multi-scale trajectory modeling with conditional guidance, generating coherent and controllable trajectories in long-horizon sparse-reward settings.
[LG-30] A Boundary Integral-based Neural Operator for Mesh Deformation
链接: https://arxiv.org/abs/2602.23703
作者: Zhengyu Wu(1),Jun Liu(1),Wei Wang(1) ((1) School of Electronics and Information a href=“http://Engineering.Hangzhou” rel=“external noopener nofollow” class="link-external link-http"this http URL/a Dianzi University)
类目: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: the code will be available upon request
Abstract:This paper presents an efficient mesh deformation method based on boundary integration and neural operators, formulating the problem as a linear elasticity boundary value problem (BVP). To overcome the high computational cost of traditional finite element methods and the limitations of existing neural operators in handling Dirichlet boundary conditions for vector fields, we introduce a direct boundary integral representation using a Dirichlet-type Green’s tensor. This formulation expresses the internal displacement field solely as a function of boundary displacements, eliminating the need to solve for unknown tractions. Building on this, we design a Boundary-Integral-based Neural Operator (BINO) that learns the geometry- and material-aware Green’s traction kernel. A key technical advantage of our framework is the mathematical decoupling of the physical integration process from the geometric representation via geometric descriptors. While this study primarily demonstrates robust generalization across diverse boundary conditions, the architecture inherently possesses potential for cross-geometry adaptation. Numerical experiments, including large deformations of flexible beams and rigid-body motions of NACA airfoils, confirm the model’s high accuracy and strict adherence to the principles of linearity and superposition. The results demonstrate that the proposed framework ensures mesh quality and computational efficiency, providing a reliable new paradigm for parametric mesh generation and shape optimization in engineering.
[LG-31] Disentangled Mode-Specific Representations for Tensor Time Series via Contrastive Learning
链接: https://arxiv.org/abs/2602.23663
作者: Kohei Obata,Taichi Murayama,Zheng Chen,Yasuko Matsubara,Yasushi Sakurai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-mode tensor time series (TTS) can be found in many domains, such as search engines and environmental monitoring systems. Learning representations of a TTS benefits various applications, but it is also challenging since the complexities inherent in the tensor hinder the realization of rich representations. In this paper, we propose a novel representation learning method designed specifically for TTS, namely MoST. Specifically, MoST uses a tensor slicing approach to reduce the complexity of the TTS structure and learns representations that can be disentangled into individual non-temporal modes. Each representation captures mode-specific features, which are the relationship between variables within the same mode, and mode-invariant features, which are in common in representations of different modes. We employ a contrastive learning framework to learn parameters; the loss function comprises two parts intended to learn representation in a mode-specific way and mode-invariant way, effectively exploiting disentangled representations as augmentations. Extensive experiments on real-world datasets show that MoST consistently outperforms the state-of-the-art methods in terms of classification and forecasting accuracy. Code is available at this https URL.
[LG-32] Selective Denoising Diffusion Model for Time Series Anomaly Detection
链接: https://arxiv.org/abs/2602.23662
作者: Kohei Obata,Zheng Chen,Yasuko Matsubara,Lingwei Zhu,Yasushi Sakurai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series anomaly detection (TSAD) has been an important area of research for decades, with reconstruction-based methods, mostly based on generative models, gaining popularity and demonstrating success. Diffusion models have recently attracted attention due to their advanced generative capabilities. Existing diffusion-based methods for TSAD rely on a conditional strategy, which reconstructs input instances from white noise with the aid of the conditioner. However, this poses challenges in accurately reconstructing the normal parts, resulting in suboptimal detection performance. In response, we propose a novel diffusion-based method, named AnomalyFilter, which acts as a selective filter that only denoises anomaly parts in the instance while retaining normal parts. To build such a filter, we mask Gaussian noise during the training phase and conduct the denoising process without adding noise to the instances. The synergy of the two simple components greatly enhances the performance of naive diffusion models. Extensive experiments on five datasets demonstrate that AnomalyFilter achieves notably low reconstruction error on normal parts, providing empirical support for its effectiveness in anomaly detection. AnomalyFilter represents a pioneering approach that focuses on the noise design of diffusion models specifically tailored for TSAD.
[LG-33] On the Convergence of Single-Loop Stochastic Bilevel Optimization with Approximate Implicit Differentiation
链接: https://arxiv.org/abs/2602.23633
作者: Yubo Zhou,Luo Luo,Guang Dai,Haishan Ye
类目: Machine Learning (cs.LG)
*备注:
Abstract:Stochastic Bilevel Optimization has emerged as a fundamental framework for meta-learning and hyperparameter optimization. Despite the practical prevalence of single-loop algorithms–which update lower and upper variables concurrently–their theoretical understanding, particularly in the stochastic regime, remains significantly underdeveloped compared to their multi-loop counterparts. Existing analyses often yield suboptimal convergence rates or obscure the critical dependence on the lower-level condition number \kappa , frequently burying it within generic Lipschitz constants. In this paper, we bridge this gap by providing a refined convergence analysis of the Single-loop Stochastic Approximate Implicit Differentiation (SSAID) algorithm. We prove that SSAID achieves an \epsilon -stationary point with an oracle complexity of \mathcalO(\kappa^7 \epsilon^-2) . Our result is noteworthy in two aspects: (i) it matches the optimal \mathcalO(\epsilon^-2) rate of state-of-the-art multi-loop methods (e.g., stocBiO) while maintaining the computational efficiency of a single-loop update; and (ii) it provides the first explicit, fine-grained characterization of the \kappa -dependence for stochastic AID-based single-loop methods. This work demonstrates that SSAID is not merely a heuristic approach, but admits a rigorous theoretical foundation with convergence guarantees competitive with mainstream multi-loop frameworks.
[LG-34] BTTackler: A Diagnosis-based Framework for Efficient Deep Learning Hyperparameter Optimization
链接: https://arxiv.org/abs/2602.23630
作者: Zhongyi Pei,Zhiyao Cen,Yipeng Huang,Chen Wang,Lin Liu,Philip Yu,Mingsheng Long
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hyperparameter optimization (HPO) is known to be costly in deep learning, especially when leveraging automated approaches. Most of the existing automated HPO methods are accuracy-based, i.e., accuracy metrics are used to guide the trials of different hyperparameter configurations amongst a specific search space. However, many trials may encounter severe training problems, such as vanishing gradients and insufficient convergence, which can hardly be reflected by accuracy metrics in the early stages of the training and often result in poor performance. This leads to an inefficient optimization trajectory because the bad trials occupy considerable computation resources and reduce the probability of finding excellent hyperparameter configurations within a time limitation. In this paper, we propose \textbfBad Trial Tackler (BTTackler), a novel HPO framework that introduces training diagnosis to identify training problems automatically and hence tackles bad trials. BTTackler diagnoses each trial by calculating a set of carefully designed quantified indicators and triggers early termination if any training problems are detected. Evaluations are performed on representative HPO tasks consisting of three classical deep neural networks (DNN) and four widely used HPO methods. To better quantify the effectiveness of an automated HPO method, we propose two new measurements based on accuracy and time consumption. Results show the advantage of BTTackler on two-fold: (1) it reduces 40.33% of time consumption to achieve the same accuracy comparable to baseline methods on average and (2) it conducts 44.5% more top-10 trials than baseline methods on average within a given time budget. We also released an open-source Python library that allows users to easily apply BTTackler to automated HPO processes with minimal code changes.
[LG-35] Normalisation and Initialisation Strategies for Graph Neural Networks in Blockchain Anomaly Detection
链接: https://arxiv.org/abs/2602.23599
作者: Dang Sy Duy,Nguyen Duy Chien,Kapil Dev,Jeff Nijsse
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures
Abstract:Graph neural networks (GNNs) offer a principled approach to financial fraud detection by jointly learning from node features and transaction graph topology. However, their effectiveness on real-world anti-money laundering (AML) benchmarks depends critically on training practices such as specifically weight initialisation and normalisation that remain underexplored. We present a systematic ablation of initialisation and normalisation strategies across three GNN architectures (GCN, GAT, and GraphSAGE) on the Elliptic Bitcoin dataset. Our experiments reveal that initialisation and normalisation are architecture-dependent: GraphSAGE achieves the strongest performance with Xavier initialisation alone, GAT benefits most from combining GraphNorm with Xavier initialisation, while GCN shows limited sensitivity to these modifications. These findings offer practical, architecture-specific guidance for deploying GNNs in AML pipelines for datasets with severe class imbalance. We release a reproducible experimental framework with temporal data splits, seeded runs, and full ablation results.
[LG-36] Hybrid Quantum Temporal Convolutional Networks
链接: https://arxiv.org/abs/2602.23578
作者: Junghoon Justin Park,Maria Pak,Sebin Lee,Samuel Yen-Chi Chen,Shinjae Yoo,Huan-Hsin Tseng,Jiook Cha
类目: Machine Learning (cs.LG)
*备注:
Abstract:Quantum machine learning models for sequential data face scalability challenges with complex multivariate signals. We introduce the Hybrid Quantum Temporal Convolutional Network (HQTCN), which combines classical temporal windowing with a quantum convolutional neural network core. By applying a shared quantum circuit across temporal windows, HQTCN captures long-range dependencies while achieving significant parameter reduction. Evaluated on synthetic NARMA sequences and high-dimensional EEG time-series, HQTCN performs competitively with classical baselines on univariate data and outperforms all baselines on multivariate tasks. The model demonstrates particular strength under data-limited conditions, maintaining high performance with substantially fewer parameters than conventional approaches. These results establish HQTCN as a parameter-efficient approach for multivariate time-series analysis.
[LG-37] Component Centric Placement Using Deep Reinforcement Learning
链接: https://arxiv.org/abs/2602.23540
作者: Kart Leong Lim
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:
Abstract:Automated placement of components on printed circuit boards (PCBs) is a critical stage in placement layout design. While reinforcement learning (RL) has been successfully applied to system-on-chip IP block placement and chiplet arrangement in complex packages, PCB component placement presents unique challenges due to several factors: variation in component sizes, single- and double-sided boards, wirelength constraints, board constraints, and non-overlapping placement requirements. In this work, we adopt a component-centric layout for automating PCB component placement using RL: first, the main component is fixed at the center, while passive components are placed in proximity to the pins of the main component. Free space around the main component is discretized, drastically reducing the search space while still covering all feasible placement; second, we leverage prior knowledge that each passive’s position has to be near to its corresponding voltage source. This allows us to design the reward function which avoids wasted exploration of infeasible or irrelevant search space. Using the component centric layout, we implemented different methods including Deep Q-Network, Actor-Critic algorithm and Simulated Annealing. Evaluation on over nine real-world PCBs of varying complexity shows that our best proposed method approaches near human-like placements in terms of wirelength and feasibility.
[LG-38] Active Value Querying to Minimize Additive Error in Subadditive Set Function Learning
链接: https://arxiv.org/abs/2602.23529
作者: Martin Černý,David Sychrovský,Filip Úradník,Jakub Černý
类目: Machine Learning (cs.LG)
*备注:
Abstract:Subadditive set functions play a pivotal role in computational economics (especially in combinatorial auctions), combinatorial optimization or artificial intelligence applications such as interpretable machine learning. However, specifying a set function requires assigning values to an exponentially large number of subsets in general, a task that is often resource-intensive in practice, particularly when the values derive from external sources such as retraining of machine learning models. A~simple omission of certain values introduces ambiguity that becomes even more significant when the incomplete set function has to be further optimized over. Motivated by the well-known result about inapproximability of subadditive functions using deterministic value queries with respect to a multiplicative error, we study a problem of approximating an unknown subadditive (or a subclass of thereof) set function with respect to an additive error – i. e., we aim to efficiently close the distance between minimal and maximal completions. Our contributions are threefold: (i) a thorough exploration of minimal and maximal completions of different classes of set functions with missing values and an analysis of their resulting distance; (ii) the development of methods to minimize this distance over classes of set functions with a known prior, achieved by disclosing values of additional subsets in both offline and online manner; and (iii) empirical demonstrations of the algorithms’ performance in practical scenarios.
[LG-39] Neural Operators Can Discover Functional Clusters
链接: https://arxiv.org/abs/2602.23528
作者: Yicen Li,Jose Antonio Lara Benitez,Ruiyang Hong,Anastasis Kratsios,Paul David McNicholas,Maarten Valentijn de Hoop
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computation (stat.CO); Machine Learning (stat.ML)
*备注:
Abstract:Operator learning is reshaping scientific computing by amortizing inference across infinite families of problems. While neural operators (NOs) are increasingly well understood for regression, far less is known for classification and its unsupervised analogue: clustering. We prove that sample-based neural operators can learn any finite collection of classes in an infinite-dimensional reproducing kernel Hilbert space, even when the classes are neither convex nor connected, under mild kernel sampling assumptions. Our universal clustering theorem shows that any K closed classes can be approximated to arbitrary precision by NO-parameterized classes in the upper Kuratowski topology on closed sets, a notion that can be interpreted as disallowing false-positive misclassifications. Building on this, we develop an NO-powered clustering pipeline for functional data and apply it to unlabeled families of ordinary differential equation (ODE) trajectories. Discretized trajectories are lifted by a fixed pre-trained encoder into a continuous feature map and mapped to soft assignments by a lightweight trainable head. Experiments on diverse synthetic ODE benchmarks show that the resulting practical SNO recovers latent dynamical structure in regimes where classical methods fail, providing evidence consistent with our universal clustering theory. Subjects: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computation (stat.CO); Machine Learning (stat.ML) Cite as: arXiv:2602.23528 [cs.LG] (or arXiv:2602.23528v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.23528 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-40] Lap2: Revisiting Laplace DP-SGD for High Dimensions via Majorization Theory
链接: https://arxiv.org/abs/2602.23516
作者: Meisam Mohammady,Qin Yang,Nicholas Stout,Ayesha Samreen,Han Wang,Christopher J Quinn,Yuan Hong
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 16 pages including appendix. arXiv admin note: text overlap with arXiv:2509.06264
Abstract:Differentially Private Stochastic Gradient Descent (DP-SGD) is a cornerstone technique for ensuring privacy in deep learning, widely used in both training from scratch and fine-tuning large-scale language models. While DP-SGD predominantly relies on the Gaussian mechanism, the Laplace mechanism remains underutilized due to its reliance on L1 norm clipping. This constraint severely limits its practicality in high-dimensional models because the L1 norm of an n-dimensional gradient can be up to sqrt(n) times larger than its L2 norm. As a result, the required noise scale grows significantly with model size, leading to poor utility or untrainable models. In this work, we introduce Lap2, a new solution that enables L2 clipping for Laplace DP-SGD while preserving strong privacy guarantees. We overcome the dimensionality-driven clipping barrier by computing coordinate-wise moment bounds and applying majorization theory to construct a tight, data-independent upper bound over the full model. By exploiting the Schur-convexity of the moment accountant function, we aggregate these bounds using a carefully designed majorization set that respects the L2 clipping constraint. This yields a multivariate privacy accountant that scales gracefully with model dimension and enables the use of thousands of moments. Empirical evaluations demonstrate that our approach significantly improves the performance of Laplace DP-SGD, achieving results comparable to or better than Gaussian DP-SGD under strong privacy constraints. For instance, fine-tuning RoBERTa-base (125M parameters) on SST-2 achieves 87.88% accuracy at epsilon=0.54, outperforming Gaussian (87.16%) and standard Laplace (48.97%) under the same budget. Comments: 16 pages including appendix. arXiv admin note: text overlap with arXiv:2509.06264 Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2602.23516 [cs.CR] (or arXiv:2602.23516v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.23516 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-41] Sample Size Calculations for Developing Clinical Prediction Models: Overview and pmsims R package
链接: https://arxiv.org/abs/2602.23507
作者: Diana Shamsutdinova,Felix Zimmer,Oyebayo Ridwan Olaniran,Sarah Markham,Daniel Stahl,Gordon Forbes,Ewan Carr
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 26 pages, 4 figures, 1 table, preprint
Abstract:Background: Clinical prediction models are increasingly used to inform healthcare decisions, but determining the minimum sample size for their development remains a critical and unresolved challenge. Inadequate sample sizes can lead to overfitting, poor generalisability, and biased predictions. Existing approaches, such as heuristic rules, closed-form formulas, and simulation-based methods, vary in flexibility and accuracy, particularly for complex data structures and machine learning models. Methods: We review current methodologies for sample size estimation in prediction modelling and introduce a conceptual framework that distinguishes between mean-based and assurance-based criteria. Building on this, we propose a novel simulation-based approach that integrates learning curves, Gaussian Process optimisation, and assurance principles to identify sample sizes that achieve target performance with high probability. This approach is implemented in pmsims, an open-source, model-agnostic R package. Results: Through case studies, we demonstrate that sample size estimates vary substantially across methods, performance metrics, and modelling strategies. Compared to existing tools, pmsims provides flexible, efficient, and interpretable solutions that accommodate diverse models and user-defined metrics while explicitly accounting for variability in model performance. Conclusions: Our framework and software advance sample size methodology for clinical prediction modelling by combining flexibility with computational efficiency. Future work should extend these methods to hierarchical and multimodal data, incorporate fairness and stability metrics, and address challenges such as missing data and complex dependency structures.
[LG-42] Spiky Rank and Its Applications to Rigidity and Circuits
链接: https://arxiv.org/abs/2602.23503
作者: Lianna Hambardzumyan,Konstantin Myasnikov,Artur Riazanov,Morgan Shirley,Adi Shraibman
类目: Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:
Abstract:We introduce spiky rank, a new matrix parameter that enhances blocky rank by combining the combinatorial structure of the latter with linear-algebraic flexibility. A spiky matrix is block-structured with diagonal blocks that are arbitrary rank-one matrices, and the spiky rank of a matrix is the minimum number of such matrices required to express it as a sum. This measure extends blocky rank to real matrices and is more robust for problems with both combinatorial and algebraic character. Our conceptual contribution is as follows: we propose spiky rank as a well-behaved candidate matrix complexity measure and demonstrate its potential through applications. We show that large spiky rank implies high matrix rigidity, and that spiky rank lower bounds yield lower bounds for depth-2 ReLU circuits, the basic building blocks of neural networks. On the technical side, we establish tight bounds for random matrices and develop a framework for explicit lower bounds, applying it to Hamming distance matrices and spectral expanders. Finally, we relate spiky rank to other matrix parameters, including blocky rank, sparsity, and the \gamma_2 -norm. Subjects: Computational Complexity (cs.CC); Machine Learning (cs.LG) Cite as: arXiv:2602.23503 [cs.CC] (or arXiv:2602.23503v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2602.23503 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-43] Uncertainty-aware Language Guidance for Concept Bottleneck Models
链接: https://arxiv.org/abs/2602.23495
作者: Yangyi Li,Mengdi Huai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Concept Bottleneck Models (CBMs) provide inherent interpretability by first mapping input samples to high-level semantic concepts, followed by a combination of these concepts for the final classification. However, the annotation of human-understandable concepts requires extensive expert knowledge and labor, constraining the broad adoption of CBMs. On the other hand, there are a few works that leverage the knowledge of large language models (LLMs) to construct concept bottlenecks. Nevertheless, they face two essential limitations: First, they overlook the uncertainty associated with the concepts annotated by LLMs and lack a valid mechanism to quantify uncertainty about the annotated concepts, increasing the risk of errors due to hallucinations from LLMs. Additionally, they fail to incorporate the uncertainty associated with these annotations into the learning process for concept bottleneck models. To address these limitations, we propose a novel uncertainty-aware CBM method, which not only rigorously quantifies the uncertainty of LLM-annotated concept labels with valid and distribution-free guarantees, but also incorporates quantified concept uncertainty into the CBM training procedure to account for varying levels of reliability across LLM-annotated concepts. We also provide the theoretical analysis for our proposed method. Extensive experiments on the real-world datasets validate the desired properties of our proposed methods.
[LG-44] On the Limits of Interpretable Machine Learning in Quintic Root Classification
链接: https://arxiv.org/abs/2602.23467
作者: Rohan Thomas,Majid Bani-Yaghoub
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Can Machine Learning (ML) autonomously recover interpretable mathematical structure from raw numerical data? We aim to answer this question using the classification of real-root configurations of polynomials up to degree five as a structured benchmark. We tested an extensive set of ML models, including decision trees, logistic regression, support vector machines, random forest, gradient boosting, XGBoost, symbolic regression, and neural networks. Neural networks achieved strong in-distribution performance on quintic classification using raw coefficients alone (84.3% + or - 0.9% balanced accuracy), whereas decision trees perform substantially worse (59.9% + or - 0.9%). However, when provided with an explicit feature capturing sign changes at critical points, decision trees match neural performance (84.2% + or - 1.2%) and yield explicit classification rules. Knowledge distillation reveals that this single invariant accounts for 97.5% of the extracted decision structure. Out-of-distribution, data-efficiency, and noise robustness analyses indicate that neural networks learn continuous, data-dependent geometric approximations of the decision boundary rather than recovering scale-invariant symbolic rules. This distinction between geometric approximation and symbolic invariance explains the gap between predictive performance and interpretability observed across models. Although high predictive accuracy is attainable, we find no evidence that the evaluated ML models autonomously recover discrete, human-interpretable mathematical rules from raw coefficients. These results suggest that, in structured mathematical domains, interpretability may require explicit structural inductive bias rather than purely data-driven approximation.
[LG-45] Global Interpretability via Automated Preprocessing: A Framework Inspired by Psychiatric Questionnaires
链接: https://arxiv.org/abs/2602.23459
作者: Eric V. Strobl
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:
Abstract:Psychiatric questionnaires are highly context sensitive and often only weakly predict subsequent symptom severity, which makes the prognostic relationship difficult to learn. Although flexible nonlinear models can improve predictive accuracy, their limited interpretability can erode clinical trust. In fields such as imaging and omics, investigators commonly address visit- and instrument-specific artifacts by extracting stable signal through preprocessing and then fitting an interpretable linear model. We adopt the same strategy for questionnaire data by decoupling preprocessing from prediction: we restrict nonlinear capacity to a baseline preprocessing module that estimates stable item values, and then learn a linear mapping from these stabilized baseline items to future severity. We refer to this two-stage method as REFINE (Redundancy-Exploiting Follow-up-Informed Nonlinear Enhancement), which concentrates nonlinearity in preprocessing while keeping the prognostic relationship transparently linear and therefore globally interpretable through a coefficient matrix, rather than through post hoc local attributions. In experiments, REFINE outperforms other interpretable approaches while preserving clear global attribution of prognostic factors across psychiatric and non-psychiatric longitudinal prediction tasks.
[LG-46] On De-Individuated Neurons: Continuous Symmetries Enable Dynamic Topologies
链接: https://arxiv.org/abs/2602.23405
作者: George Bird
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 22 pages, 5 figures, preprint to be submitted for review at Transactions on Machine Learning Research (TMLR)
Abstract:This paper introduces a novel methodology for dynamic networks by leveraging a new symmetry-principled class of primitives, isotropic activation functions. This approach enables real-time neuronal growth and shrinkage of the architectures in response to task demand. This is made possible by network structural changes that are invariant under symmetry reparameterisations, leaving the computation identical under neurogenesis and well approximated under neurodegeneration. This is undertaken by leveraging the isotropic primitives’ property of basis independence, resulting in the loss of the individuated neurons implicit in the elementwise functional form. Isotropy thereby allows a freedom in the basis to which layers are decomposed and interpreted as individual artificial neurons. This enables a layer-wise diagonalisation procedure, in which typical interconnected layers, such as dense layers, convolutional kernels, and others, can be reexpressed so that neurons have one-to-one, ordered connectivity within alternating layers. This indicates which one-to-one neuron-to-neuron communications are strongly impactful on overall functionality and which are not. Inconsequential neurons can thus be removed (neurodegeneration), and new inactive scaffold neurons added (neurogenesis) whilst remaining analytically invariant in function. A new tunable model parameter, intrinsic length, is also introduced to ensure this analytical invariance. This approach mathematically equates connectivity pruning with neurodegeneration. The diagonalisation also offers new possibilities for mechanistic interpretability into isotropic networks, and it is demonstrated that isotropic dense networks can asymptotically reach a sparsity factor of 50% whilst retaining exact network functionality. Finally, the construction is generalised, demonstrating a nested functional class for this form of isotropic primitive architectures.
[LG-47] U-CAN: Utility-Aware Contrastive Attenuation for Efficient Unlearning in Generative Recommendation
链接: https://arxiv.org/abs/2602.23400
作者: Zezheng Wu,Rui Wang,Xinghe Cheng,Yang Shao,Qing Yang,Jiapu Wang,Jingwei Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generative Recommendation (GenRec) typically leverages Large Language Models (LLMs) to redefine personalization as an instruction-driven sequence generation task. However, fine-tuning on user logs inadvertently encodes sensitive attributes into model parameters, raising critical privacy concerns. Existing Machine Unlearning (MU) techniques struggle to navigate this tension due to the Polysemy Dilemma, where neurons superimpose sensitive data with general reasoning patterns, leading to catastrophic utility loss under traditional gradient or pruning methods. To address this, we propose Utility-aware Contrastive AttenuatioN (U-CAN), a precision unlearning framework that operates on low-rank adapters. U-CAN quantifies risk by contrasting activations and focuses on neurons with asymmetric responses that are highly sensitive to the forgetting set but suppressed on the retention set. To safeguard performance, we introduce a utility-aware calibration mechanism that combines weight magnitudes with retention-set activation norms, assigning higher utility scores to dimensions that contribute strongly to retention performance. Unlike binary pruning, which often fragments network structure, U-CAN develop adaptive soft attenuation with a differentiable decay function to selectively down-scale high-risk parameters on LoRA adapters, suppressing sensitive retrieval pathways and preserving the topological connectivity of reasoning circuits. Experiments on two public datasets across seven metrics demonstrate that U-CAN achieves strong privacy forgetting, utility retention, and computational efficiency.
[LG-48] Detoxifying LLM s via Representation Erasure-Based Preference Optimization
链接: https://arxiv.org/abs/2602.23391
作者: Nazanin Mohammadi Sepahvand,Eleni Triantafillou,Hugo Larochelle,Doina Precup,Daniel M. Roy,Gintare Karolina Dziugaite
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful “directions” remain present in representations. To address this, we propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem. Using a novel objective with preference data, we force the representations of toxic continuations to converge toward their benign counterparts. Our mechanistic analysis reveals that this granular approach is critical: unlike baselines, REPO induces deep, localized edits to toxicity-encoding neurons while preserving general model utility. Exhaustive evaluations show that REPO achieves state-of-the-art robustness, stopping sophisticated threats-including relearning attacks and enhanced GCG jailbreaks-where existing representation- and output-based methods fail.
[LG-49] Pacing Opinion Polarization via Graph Reinforcement Learning
链接: https://arxiv.org/abs/2602.23390
作者: Mingkai Liao
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 32 pages, 21 figure
Abstract:Opinion polarization in online social networks poses serious risks to social cohesion and democratic processes. Recent studies formulate polarization moderation as algorithmic intervention problems under opinion dynamics models, especially the Friedkin–Johnsen (FJ) model. However, most existing methods are tailored to specific linear settings and rely on closed-form steady-state analysis, limiting scalability, flexibility, and applicability to cost-aware, nonlinear, or topology-altering interventions. We propose PACIFIER, a graph reinforcement learning framework for sequential polarization moderation via network interventions. PACIFIER reformulates the canonical ModerateInternal (MI) and ModerateExpressed (ME) problems as sequential decision-making tasks, enabling adaptive intervention policies without repeated steady-state recomputation. The framework is objective-agnostic and extends naturally to FJ-consistent settings, including budget-aware interventions, continuous internal opinions, biased-assimilation dynamics, and node removal. Extensive experiments on real-world networks demonstrate strong performance and scalability across diverse moderation scenarios. Comments: 32 pages, 21 figure Subjects: Social and Information Networks (cs.SI); Machine Learning (cs.LG) Cite as: arXiv:2602.23390 [cs.SI] (or arXiv:2602.23390v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2602.23390 Focus to learn more arXiv-issued DOI via DataCite
[LG-50] Neuro-Symbolic AI for Analytical Solutions of Differential Equations DATE
链接: https://arxiv.org/abs/2502.01476
作者: Orestis Oikonomou,Levi Lingsch,Dana Grund,Siddhartha Mishra,Georgios Kissas
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: Updates the method and added extra results
Abstract:Analytical solutions to differential equations offer exact, interpretable insight but are rarely available because discovering them requires expert intuition or exhaustive search in combinatorial spaces. We introduce SIGS, a neuro-symbolic framework that automates this process. SIGS uses a formal grammar to generate only syntactically valid building blocks, embeds these expressions into a continuous space, and then searches this space to assemble, score, and refine candidate closed-form solutions by minimizing a physics-based residual. This design unifies symbolic reasoning with numerical optimization; the grammar constrains candidate solution blocks to be proper by construction, while the latent search makes exploration tractable and data-free. SIGS is the first neuro-symbolic method to (i) analytically solve coupled systems of nonlinear PDEs, (ii) discover solutions under grammar misspecification, and (iii) produce accurate symbolic approximations for PDEs lacking known closed-form solutions. Overall, SIGS achieves orders-of-magnitude improvements in accuracy and efficiency over existing symbolic methods on standard benchmarks.
[LG-51] Active Bipartite Ranking with Smooth Posterior Distributions
链接: https://arxiv.org/abs/2602.24263
作者: James Cheshire,Stephan Clémençon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In this article, bipartite ranking, a statistical learning problem involved in many applications and widely studied in the passive context, is approached in a much more general \textitactive setting than the discrete one previously considered in the literature. While the latter assumes that the conditional distribution is piece wise constant, the framework we develop permits in contrast to deal with continuous conditional distributions, provided that they fulfill a Hölder smoothness constraint. We first show that a naive approach based on discretisation at a uniform level, fixed \textita priori and consisting in applying next the active strategy designed for the discrete setting generally fails. Instead, we propose a novel algorithm, referred to as smooth-rank and designed for the continuous setting, which aims to minimise the distance between the ROC curve of the estimated ranking rule and the optimal one w.r.t. the \sup norm. We show that, for a fixed confidence level \epsilon0 and probability \delta\in (0,1) , smooth-rank is PAC (\epsilon,\delta) . In addition, we provide a problem dependent upper bound on the expected sampling time of smooth-rank and establish a problem dependent lower bound on the expected sampling time of any PAC (\epsilon,\delta) algorithm. Beyond the theoretical analysis carried out, numerical results are presented, providing solid empirical evidence of the performance of the algorithm proposed, which compares favorably with alternative approaches.
[LG-52] A Variational Estimator for L_p Calibration Errors
链接: https://arxiv.org/abs/2602.24230
作者: Eugène Berta,Sacha Braun,David Holzmüller,Francis Bach,Michael I. Jordan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Calibration \unicodex2014 the problem of ensuring that predicted probabilities align with observed class frequencies \unicodex2014 is a basic desideratum for reliable prediction with machine learning systems. Calibration error is traditionally assessed via a divergence function, using the expected divergence between predictions and empirical frequencies. Accurately estimating this quantity is challenging, especially in the multiclass setting. Here, we show how to extend a recent variational framework for estimating calibration errors beyond divergences induced induced by proper losses, to cover a broad class of calibration errors induced by L_p divergences. Our method can separate over- and under-confidence and, unlike non-variational approaches, avoids overestimation. We provide extensive experiments and integrate our code in the open-source package probmetrics (this https URL) for evaluating calibration errors.
[LG-53] BLISSNet: Deep Operator Learning for Fast and Accurate Flow Reconstruction from Sparse Sensor Measurements
链接: https://arxiv.org/abs/2602.24228
作者: Maksym Veremchuk,K. Andrea Scott,Zhao Pan
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:Reconstructing fluid flows from sparse sensor measurements is a fundamental challenge in science and engineering. Widely separated measurements and complex, multiscale dynamics make accurate recovery of fine-scale structures difficult. In addition, existing methods face a persistent tradeoff: high-accuracy models are often computationally expensive, whereas faster approaches typically compromise fidelity. In this work, we introduce BLISSNet, a model that strikes a strong balance between reconstruction accuracy and computational efficiency for both flow reconstruction and nudging-based data assimilation. The model follows a DeepONet-like architecture, enabling zero-shot inference on domains of arbitrary size. After the first model call on a given domain, certain network components can be precomputed, leading to low inference cost for subsequent evaluations on large domains. Consequently, the model can achieve faster inference than classical interpolation methods such as radial basis function or bicubic interpolation. This combination of high accuracy, low cost, and zero-shot generalization makes BLISSNet well-suited for large-scale real-time flow reconstruction and data assimilation tasks.
[LG-54] End-to-end Differentiable Calibration and Reconstruction for Optical Particle Detectors
链接: https://arxiv.org/abs/2602.24129
作者: Omar Alterkait,César Jesús-Valls,Ryo Matsumoto,Patrick de Perio,Kazuhiro Terao
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG); Instrumentation and Detectors (physics.ins-det)
*备注:
Abstract:Large-scale homogeneous detectors with optical readouts are widely used in particle detection, with Cherenkov and scintillator neutrino detectors as prominent examples. Analyses in experimental physics rely on high-fidelity simulators to translate sensor-level information into physical quantities of interest. This task critically depends on accurate calibration, which aligns simulation behavior with real detector data, and on tracking, which infers particle properties from optical signals. We present the first end-to-end differentiable optical particle detector simulator, enabling simultaneous calibration and reconstruction through gradient-based optimization. Our approach unifies simulation, calibration, and tracking, which are traditionally treated as separate problems, within a single differentiable framework. We demonstrate that it achieves smooth and physically meaningful gradients across all key stages of light generation, propagation, and detection while maintaining computational efficiency. We show that gradient-based calibration and reconstruction greatly simplify existing analysis pipelines while matching or surpassing the performance of conventional non-differentiable methods in both accuracy and speed. Moreover, the framework’s modularity allows straightforward adaptation to diverse detector geometries and target materials, providing a flexible foundation for experiment design and optimization. The results demonstrate the readiness of this technique for adoption in current and future optical detector experiments, establishing a new paradigm for simulation and reconstruction in particle physics.
[LG-55] Inference-time optimization for experiment-grounded protein ensemble generation
链接: https://arxiv.org/abs/2602.24007
作者: Advaith Maddipatla,Anar Rzayev,Marco Pegoraro,Martin Pacesa,Paul Schanda,Ailie Marx,Sanketh Vedula,Alex M. Bronstein
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
Abstract:Protein function relies on dynamic conformational ensembles, yet current generative models like AlphaFold3 often fail to produce ensembles that match experimental data. Recent experiment-guided generators attempt to address this by steering the reverse diffusion process. However, these methods are limited by fixed sampling horizons and sensitivity to initialization, often yielding thermodynamically implausible results. We introduce a general inference-time optimization framework to solve these challenges. First, we optimize over latent representations to maximize ensemble log-likelihood, rather than perturbing structures post hoc. This approach eliminates dependence on diffusion length, removes initialization bias, and easily incorporates external constraints. Second, we present novel sampling schemes for drawing Boltzmann-weighted ensembles. By combining structural priors from AlphaFold3 with force-field-based priors, we sample from their product distribution while balancing experimental likelihoods. Our results show that this framework consistently outperforms state-of-the-art guidance, improving diversity, physical energy, and agreement with data in X-ray crystallography and NMR, often fitting the experimental data better than deposited PDB structures. Finally, inference-time optimization experiments maximizing ipTM scores reveal that perturbing AlphaFold3 embeddings can artificially inflate model confidence. This exposes a vulnerability in current design metrics, whose mitigation could offer a pathway to reduce false discovery rates in binder engineering.
[LG-56] A distributed semismooth Newton based augmented Lagrangian method for distributed optimization
链接: https://arxiv.org/abs/2602.23854
作者: Qihao Ma,Chengjing Wang,Peipei Tang,Dunbiao Niu,Aimin Xu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:
Abstract:This paper proposes a novel distributed semismooth Newton based augmented Lagrangian method for solving a class of optimization problems over networks, where the global objective is defined as the sum of locally held cost functions, and communication is restricted to neighboring agents. Specifically, we employ the augmented Lagrangian method to solve an equivalently reformulated constrained version of the original problem. Each resulting subproblem is solved inexactly via a distributed semismooth Newton method. By fully leveraging the structure of the generalized Hessian, a distributed accelerated proximal gradient method is proposed to compute the Newton direction efficiently, eliminating the need to communicate with full Hessian matrices. Theoretical results are also obtained to guarantee the convergence of the proposed algorithm. Numerical experiments demonstrate the efficiency and superiority of our algorithm compared to state-of-the-art distributed algorithms.
[LG-57] Predictive Hotspot Mapping for Data-driven Crime Prediction
链接: https://arxiv.org/abs/2602.23750
作者: Karthik Sriram,Ankur Sinha,Suvashis Choudhary
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 50 pages
Abstract:Predictive hotspot mapping is an important problem in crime prediction and control. An accurate hotspot mapping helps in appropriately targeting the available resources to manage crime in cities. With an aim to make data-driven decisions and automate policing and patrolling operations, police departments across the world are moving towards predictive approaches relying on historical data. In this paper, we create a non-parametric model using a spatio-temporal kernel density formulation for the purpose of crime prediction based on historical data. The proposed approach is also able to incorporate expert inputs coming from humans through alternate sources. The approach has been extensively evaluated in a real-world setting by collaborating with the Delhi police department to make crime predictions that would help in effective assignment of patrol vehicles to control street crime. The results obtained in the paper are promising and can be easily applied in other settings. We release the algorithm and the dataset (masked) used in our study to support future research that will be useful in achieving further improvements.
[LG-58] General Bayesian Policy Learning
链接: https://arxiv.org/abs/2602.23672
作者: Masahiro Kato
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:This study proposes the General Bayes framework for policy learning. We consider decision problems in which a decision-maker chooses an action from an action set to maximize its expected welfare. Typical examples include treatment choice and portfolio selection. In such problems, the statistical target is a decision rule, and the prediction of each outcome Y(a) is not necessarily of primary interest. We formulate this policy learning problem by loss-based Bayesian updating. Our main technical device is a squared-loss surrogate for welfare maximization. We show that maximizing empirical welfare over a policy class is equivalent to minimizing a scaled squared error in the outcome difference, up to a quadratic regularization controlled by a tuning parameter \zeta0 . This rewriting yields a General Bayes posterior over decision rules that admits a Gaussian pseudo-likelihood interpretation. We clarify two Bayesian interpretations of the resulting generalized posterior, a working Gaussian view and a decision-theoretic loss-based view. As one implementation example, we introduce neural networks with tanh-squashed outputs. Finally, we provide theoretical guarantees in a PAC-Bayes style.
[LG-59] Active Learning for Planet Habitability Classification under Extreme Class Imbalance
链接: https://arxiv.org/abs/2602.23666
作者: R. I. El-Kholy,Z. M. Hayman
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 19 pages, 9 figure, 2 tables
Abstract:The increasing size and heterogeneity of exoplanet catalogs have made systematic habitability assessment challenging, particularly given the extreme scarcity of potentially habitable planets and the evolving nature of their labels. In this study, we explore the use of pool-based active learning to improve the efficiency of habitability classification under realistic observational constraints. We construct a unified dataset from the Habitable World Catalog and the NASA Exoplanet Archive and formulate habitability assessment as a binary classification problem. A supervised baseline based on gradient-boosted decision trees is established and optimized for recall in order to prioritize the identification of rare potentially habitable planets. This model is then embedded within an active learning framework, where uncertainty-based margin sampling is compared against random querying across multiple runs and labeling budgets. We find that active learning substantially reduces the number of labeled instances required to approach supervised performance, demonstrating clear gains in label efficiency. To connect these results to a practical astronomical use case, we aggregate predictions from independently trained active-learning models into an ensemble and use the resulting mean probabilities and uncertainties to rank planets originally labeled as non-habitable. This procedure identifies a single robust candidate for further study, illustrating how active learning can support conservative, uncertainty-aware prioritization of follow-up targets rather than speculative reclassification. Our results indicate that active learning provides a principled framework for guiding habitability studies in data regimes characterized by label imbalance, incomplete information, and limited observational resources.
[LG-60] Multivariate Spatio-Temporal Neural Hawkes Processes
链接: https://arxiv.org/abs/2602.23629
作者: Christopher Chukwuemeka,Hojun You,Mikyoung Jun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP); Methodology (stat.ME)
*备注: 16 pages, 20 figures (including supplementary material). Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE)
Abstract:We propose a Multivariate Spatio-Temporal Neural Hawkes Process for modeling complex multivariate event data with spatio-temporal dynamics. The proposed model extends continuous-time neural Hawkes processes by integrating spatial information into latent state evolution through learned temporal and spatial decay dynamics, enabling flexible modeling of excitation and inhibition without predefined triggering kernels. By analyzing fitted intensity functions of deep learning-based temporal Hawkes process models, we identify a modeling gap in how fitted intensity behavior is captured beyond likelihood-based performance, which motivates the proposed spatio-temporal approach. Simulation studies show that the proposed method successfully recovers sensible temporal and spatial intensity structure in multivariate spatio-temporal point patterns, while existing temporal neural Hawkes process approach fails to do so. An application to terrorism data from Pakistan further demonstrates the proposed model’s ability to capture complex spatio-temporal interaction across multiple event types.
[LG-61] Fairness under Graph Uncertainty: Achieving Interventional Fairness with Partially Known Causal Graphs over Clusters of Variables
链接: https://arxiv.org/abs/2602.23611
作者: Yoichi Chikahara
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 9 figures
Abstract:Algorithmic decisions about individuals require predictions that are not only accurate but also fair with respect to sensitive attributes such as gender and race. Causal notions of fairness align with legal requirements, yet many methods assume access to detailed knowledge of the underlying causal graph, which is a demanding assumption in practice. We propose a learning framework that achieves interventional fairness by leveraging a causal graph over \textitclusters of variables, which is substantially easier to estimate than a variable-level graph. With possible \textitadjustment cluster sets identified from such a cluster causal graph, our framework trains a prediction model by reducing the worst-case discrepancy between interventional distributions across these sets. To this end, we develop a computationally efficient barycenter kernel maximum mean discrepancy (MMD) that scales favorably with the number of sensitive attribute values. Extensive experiments show that our framework strikes a better balance between fairness and accuracy than existing approaches, highlighting its effectiveness under limited causal graph knowledge.
[LG-62] Moment Matters: Mean and Variance Causal Graph Discovery from Heteroscedastic Observational Data
链接: https://arxiv.org/abs/2602.23602
作者: Yoichi Chikahara
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 17 pages, 6 figures
Abstract:Heteroscedasticity – where the variance of a variable changes with other variables – is pervasive in real data, and elucidating why it arises from the perspective of statistical moments is crucial in scientific knowledge discovery and decision-making. However, standard causal discovery does not reveal which causes act on the mean versus the variance, as it returns a single moment-agnostic graph, limiting interpretability and downstream intervention design. We propose a Bayesian, moment-driven causal discovery framework that infers separate \textitmean and \textitvariance causal graphs from observational heteroscedastic data. We first derive the identification results by establishing sufficient conditions under which these two graphs are separately identifiable. Building on this theory, we develop a variational inference method that learns a posterior distribution over both graphs, enabling principled uncertainty quantification of structural features (e.g., edges, paths, and subgraphs). To address the challenges of parameter optimization in heteroscedastic models with two graph structures, we take a curvature-aware optimization approach and develop a prior incorporation technique that leverages domain knowledge on node orderings, improving sample efficiency. Experiments on synthetic, semi-synthetic, and real data show that our approach accurately recovers mean and variance structures and outperforms state-of-the-art baselines.
[LG-63] nsor Hypercontraction Error Correction Using Regression
链接: https://arxiv.org/abs/2602.23567
作者: Ishna Satyarth,Eric C. Larson,Devin A. Matthews
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:Wavefunction-based quantum methods are some of the most accurate tools for predicting and analyzing the electronic structure of molecules, in particular for accounting for dynamical electron correlation. However, most methods of including dynamical correlation beyond the simple second-order Møller-Plesset perturbation theory (MP2) level are too computationally expensive to apply to large molecules. Approximations which reduce scaling with system size are a potential remedy, such as the tensor hyper-contraction (THC) technique of Hohenstein et al., but also result in additional sources of error. In this work, we correct errors in THC-approximated methods using machine learning. Specifically, we apply THC to third-order Møller-Plesset theory (MP3) as a simplified model for coupled cluster with single and double excitations (CCSD), and train several regression models on observed THC errors from the Main Group Chemistry Database (MGCDB84). We compare performance of multiple linear regression models and non-linear Kernel Ridge regression models. We also investigate correlation procedures using absolute and relative corrections and evaluate the corrections for both molecule and reaction energies. We discuss the potential for using regression techniques to correct THC-MP3 errors by comparing it to the “canonical” MP3 reference values and find the optimum technique based on accuracy. We find that non-linear regression models reduced root mean squared errors between THC- and canonical MP3 by a factor of 6-9 \times for total molecular energies and 2-3 \times for reaction energies.
[LG-64] VaSST: Variational Inference for Symbolic Regression using Soft Symbolic Trees
链接: https://arxiv.org/abs/2602.23561
作者: Somjit Roy,Pritam Dey,Bani K. Mallick
类目: Methodology (stat.ME); Machine Learning (cs.LG); Symbolic Computation (cs.SC); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 38 pages, 5 figures, 35 tables, Submitted
Abstract:Symbolic regression has recently gained traction in AI-driven scientific discovery, aiming to recover explicit closed-form expressions from data that reveal underlying physical laws. Despite recent advances, existing methods remain dominated by heuristic search algorithms or data-intensive approaches that assume low-noise regimes and lack principled uncertainty quantification. Fully probabilistic formulations are scarce, and existing Markov chain Monte Carlo-based Bayesian methods often struggle to efficiently explore the highly multimodal combinatorial space of symbolic expressions. We introduce VaSST, a scalable probabilistic framework for symbolic regression based on variational inference. VaSST employs a continuous relaxation of symbolic expression trees, termed soft symbolic trees, where discrete operator and feature assignments are replaced by soft distributions over allowable components. This relaxation transforms the combinatorial search over an astronomically large symbolic space into an efficient gradient-based optimization problem while preserving a coherent probabilistic interpretation. The learned soft representations induce posterior distributions over symbolic structures, enabling principled uncertainty quantification. Across simulated experiments and Feynman Symbolic Regression Database within SRBench, VaSST achieves superior performance in both structural recovery and predictive accuracy compared to state-of-the-art symbolic regression methods.
[LG-65] Partition Function Estimation under Bounded f-Divergence
链接: https://arxiv.org/abs/2602.23535
作者: Adam Block,Abhishek Shetty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study the statistical complexity of estimating partition functions given sample access to a proposal distribution and an unnormalized density ratio for a target distribution. While partition function estimation is a classical problem, existing guarantees typically rely on structural assumptions about the domain or model geometry. We instead provide a general, information-theoretic characterization that depends only on the relationship between the proposal and target distributions. Our analysis introduces the integrated coverage profile, a functional that quantifies how much target mass lies in regions where the density ratio is large. We show that integrated coverage tightly characterizes the sample complexity of multiplicative partition function estimation and provide matching lower bounds. We further express these bounds in terms of f -divergences, yielding sharp phase transitions depending on the growth rate of f and recovering classical results as a special case while extending to heavy-tailed regimes. Matching lower bounds establish tightness in all regimes. As applications, we derive improved finite-sample guarantees for importance sampling and self-normalized importance sampling, and we show a strict separation between the complexity of approximate sampling and counting under the same divergence constraints. Our results unify and generalize prior analyses of importance sampling, rejection sampling, and heavy-tailed mean estimation, providing a minimal-assumption theory of partition function estimation. Along the way we introduce new technical tools including new connections between coverage and f -divergences as well as a generalization of the classical Paley-Zygmund inequality.
[LG-66] Uncovering Physical Drivers of Dark Matter Halo Structures with Auxiliary-Variable-Guided Generative Models
链接: https://arxiv.org/abs/2602.23518
作者: Arkaprabha Ganguli,Anirban Samaddar,Florian Kéruzoré,Nesar Ramachandra,Julie Bessac,Sandeep Madireddy,Emil Constantinescu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Deep generative models (DGMs) compress high-dimensional data but often entangle distinct physical factors in their latent spaces. We present an auxiliary-variable-guided framework for disentangling representations of thermal Sunyaev-Zel’dovich (tSZ) maps of dark matter halos. We introduce halo mass and concentration as auxiliary variables and apply a lightweight alignment penalty to encourage latent dimensions to reflect these physical quantities. To generate sharp and realistic samples, we extend latent conditional flow matching (LCFM), a state-of-the-art generative model, to enforce disentanglement in the latent space. Our Disentangled Latent-CFM (DL-CFM) model recovers the established mass-concentration scaling relation and identifies latent space outliers that may correspond to unusual halo formation histories. By linking latent coordinates to interpretable astrophysical properties, our method transforms the latent space into a diagnostic tool for cosmological structure. This work demonstrates that auxiliary guidance preserves generative flexibility while yielding physically meaningful, disentangled embeddings, providing a generalizable pathway for uncovering independent factors in complex astronomical datasets.
[LG-67] Neural ensemble Kalman filter: Data assimilation for compressible flows with shocks
链接: https://arxiv.org/abs/2602.23461
作者: Xu-Hui Zhou,Lorenzo Beronilla,Michael K. Sleeman,Hangchuan Hu,Matthias Morzfeld,Andrew M. Stuart,Tamer A. Zaki
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:Data assimilation (DA) for compressible flows with shocks is challenging because many classical DA methods generate spurious oscillations and nonphysical features near uncertain shocks. We focus here on the ensemble Kalman filter (EnKF). We show that the poor performance of the standard EnKF may be attributed to the bimodal forecast distribution that can arise in the vicinity of an uncertain shock location; this violates the assumptions underpinning the EnKF, which assume a forecast which is close to Gaussian. To address this issue we introduce the new neural EnKF. The basic idea is to systematically embed neural function approximations within ensemble DA by mapping the forecast ensemble of shocked flows to the parameter space (weights and biases) of a deep neural network (NN) and to subsequently perform DA in that space. The nonlinear mapping encodes sharp and smooth flow features in an ensemble of NN parameters. Neural EnKF updates are therefore well-behaved only if the NN parameters vary smoothly within the neural representation of the forecast ensemble. We show that such a smooth variation of network parameters can be enforced via physics-informed transfer learning, and demonstrate that in so-doing the neural EnKF avoids the spurious oscillations and nonphysical features that plague the standard EnKF. The applicability of the neural EnKF is demonstrated through a series of systematic numerical experiments with an inviscid Burgers’ equation, Sod’s shock tube, and a two-dimensional blast wave.
[LG-68] Complex Networks and the Drug Repositioning Problem
链接: https://arxiv.org/abs/2602.23396
作者: Felipe Bivort Haiek
类目: Molecular Networks (q-bio.MN); Machine Learning (cs.LG)
*备注:
Abstract:In this Master’s thesis, the graph properties of a multi-level drug-protein network are studied, as well as how the network’s shape has informed discoveries over the years, identifying primarily crawling discoveries and a smaller number of hopping discoveries. Finally, the network structure is used to inform a network diffusion recommendation system and to prioritize existing drugs for repurposing against proteins in organisms that cause Neglected Tropical Diseases.
[LG-69] Universality of Shallow and Deep Neural Networks on Non-Euclidean Spaces
链接: https://arxiv.org/abs/2602.23381
作者: Vugar Ismailov
类目: General Topology (math.GN); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Functional Analysis (math.FA)
*备注: 23 pages, 35 references
Abstract:We develop a framework for shallow and deep neural networks whose inputs range over a general topological space. The model is built from a prescribed family of continuous feature maps and a fixed scalar activation function, and it reduces to multilayer feedforward networks in the Euclidean case. We focus on the universal approximation property and establish general conditions under which such networks are dense in spaces of continuous vector-valued functions on arbitrary and locally convex topological spaces. In the absence of width constraints, we obtain universality results that extend classical approximation theorems to non-Euclidean settings. A central focus of the paper is the deep narrow framework, in which the width of each hidden layer is uniformly bounded while the depth is allowed to grow. We identify conditions under which such width constrained deep networks retain universal approximation power. As a concrete example, we employ Ostrand’s extension of the Kolmogorov superposition theorem to derive an explicit universality result for products of compact metric spaces, with width bounds expressed in terms of topological dimension.
附件下载





