本篇博文主要内容为 2026-02-06 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-02-06)
今日共更新697篇论文,其中:
- 自然语言处理共101篇(Computation and Language (cs.CL))
- 人工智能共220篇(Artificial Intelligence (cs.AI))
- 计算机视觉共129篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共253篇(Machine Learning (cs.LG))
- 多智能体系统共10篇(Multiagent Systems (cs.MA))
多智能体系统
[MA-0] CommCP: Efficient Multi-Agent Coordination via LLM -Based Communication with Conformal Prediction ICRA2026
【速读】:该论文旨在解决多智能体多任务具身问答(Multi-Agent Multi-Task Embodied Question Answering, MM-EQA)场景下的信息收集问题,特别是在多个具有不同操作能力的机器人协同完成自然语言指令时,如何通过有效通信避免冗余并提升协作效率。其解决方案的关键在于提出了一种基于大语言模型(Large Language Model, LLM)的去中心化通信框架CommCP,该框架利用校准预测(conformal prediction)对生成的消息进行校准,从而减少接收端干扰、提高通信可靠性,进而增强任务成功率与探索效率。
链接: https://arxiv.org/abs/2602.06038
作者: Xiaopan Zhang,Zejin Wang,Zhixu Li,Jianpeng Yao,Jiachen Li
机构: University of California, Riverside (加州大学河滨分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: IEEE International Conference on Robotics and Automation (ICRA 2026); Project Website: this https URL
Abstract:To complete assignments provided by humans in natural language, robots must interpret commands, generate and answer relevant questions for scene understanding, and manipulate target objects. Real-world deployments often require multiple heterogeneous robots with different manipulation capabilities to handle different assignments cooperatively. Beyond the need for specialized manipulation skills, effective information gathering is important in completing these assignments. To address this component of the problem, we formalize the information-gathering process in a fully cooperative setting as an underexplored multi-agent multi-task Embodied Question Answering (MM-EQA) problem, which is a novel extension of canonical Embodied Question Answering (EQA), where effective communication is crucial for coordinating efforts without redundancy. To address this problem, we propose CommCP, a novel LLM-based decentralized communication framework designed for MM-EQA. Our framework employs conformal prediction to calibrate the generated messages, thereby minimizing receiver distractions and enhancing communication reliability. To evaluate our framework, we introduce an MM-EQA benchmark featuring diverse, photo-realistic household scenarios with embodied questions. Experimental results demonstrate that CommCP significantly enhances the task success rate and exploration efficiency over baselines. The experiment videos, code, and dataset are available on our project website: this https URL.
zh
[MA-1] PhysicsAgent ABM: Physics-Guided Generative Agent -Based Modeling
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的多智能体系统在可扩展性和时间步对齐的状态转移模拟中校准不足的问题,以及传统基于个体的模型(Agent-Based Model, ABM)在整合丰富个体信号和非平稳行为方面的局限性。其解决方案的关键在于提出PhysicsAgentABM框架,通过将推理聚焦于行为一致的智能体聚类:状态专用符号智能体编码机制性转移先验,多模态神经转移模型捕捉时序与交互动态,不确定性感知的认知融合生成校准的聚类级转移分布;个体智能体则在局部约束下随机实现转移,从而解耦群体推理与个体层面的变异性。此外,引入ANCHOR策略——一种基于跨情境行为响应的LLM驱动聚类方法及新型对比损失函数,显著减少LLM调用次数达6–8倍,实现高效、可扩展且校准良好的仿真。
链接: https://arxiv.org/abs/2602.06030
作者: Kavana Venkatesh,Yinhan He,Jundong Li,Jiaming Cui
机构: 未知
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:
Abstract:Large language model (LLM)-based multi-agent systems enable expressive agent reasoning but are expensive to scale and poorly calibrated for timestep-aligned state-transition simulation, while classical agent-based models (ABMs) offer interpretability but struggle to integrate rich individual-level signals and non-stationary behaviors. We propose PhysicsAgentABM, which shifts inference to behaviorally coherent agent clusters: state-specialized symbolic agents encode mechanistic transition priors, a multimodal neural transition model captures temporal and interaction dynamics, and uncertainty-aware epistemic fusion yields calibrated cluster-level transition distributions. Individual agents then stochastically realize transitions under local constraints, decoupling population inference from entity-level variability. We further introduce ANCHOR, an LLM agent-driven clustering strategy based on cross-contextual behavioral responses and a novel contrastive loss, reducing LLM calls by up to 6-8 times. Experiments across public health, finance, and social sciences show consistent gains in event-time accuracy and calibration over mechanistic, neural, and LLM baselines. By re-architecting generative ABM around population-level inference with uncertainty-aware neuro-symbolic fusion, PhysicsAgentABM establishes a new paradigm for scalable and calibrated simulation with LLMs.
zh
[MA-2] Learning to Share: Selective Memory for Efficient Parallel Agent ic Systems
【速读】:该论文旨在解决并行代理系统(parallel agentic systems)中因多个代理团队独立执行相似子任务而导致的计算冗余问题,从而提升系统效率与资源利用率。其核心解决方案是提出一种名为“学习共享”(Learning to Share, LTS)的可学习共享内存机制,关键在于引入一个对所有团队全局可见的记忆库(memory bank)和一个轻量级控制器,该控制器通过分步强化学习与使用感知的信用分配策略进行训练,以决定是否将中间代理步骤存入记忆库,从而实现跨团队的信息选择性复用,同时有效控制上下文增长。实验表明,LTS在保持或提升任务性能的同时显著降低了整体运行时间。
链接: https://arxiv.org/abs/2602.05965
作者: Joseph Fioresi,Parth Parag Kulkarni,Ashmal Vayani,Song Wang,Mubarak Shah
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic systems solve complex tasks by coordinating multiple agents that iteratively reason, invoke tools, and exchange intermediate results. To improve robustness and solution quality, recent approaches deploy multiple agent teams running in parallel to explore diverse reasoning trajectories. However, parallel execution comes at a significant computational cost: when different teams independently reason about similar sub-problems or execute analogous steps, they repeatedly perform substantial overlapping computation. To address these limitations, in this paper, we propose Learning to Share (LTS), a learned shared-memory mechanism for parallel agentic frameworks that enables selective cross-team information reuse while controlling context growth. LTS introduces a global memory bank accessible to all teams and a lightweight controller that decides whether intermediate agent steps should be added to memory or not. The controller is trained using stepwise reinforcement learning with usage-aware credit assignment, allowing it to identify information that is globally useful across parallel executions. Experiments on the AssistantBench and GAIA benchmarks show that LTS significantly reduces overall runtime while matching or improving task performance compared to memory-free parallel baselines, demonstrating that learned memory admission is an effective strategy for improving the efficiency of parallel agentic systems. Project page: this https URL
zh
[MA-3] Location-Aware Dispersion on Anonymous Graphs
【速读】:该论文旨在解决位置感知的分散问题(LOCATION-AWARE DISPERSION),即在匿名连通无向图中,k个具有颜色标签的移动机器人需重新定位至与其颜色相同的节点上,且每个同色节点仅能容纳一个机器人。此问题是对经典分散问题(DISPERSION)的推广,后者不考虑节点颜色约束,而新问题引入了位置感知机制,要求机器人根据自身和目标节点的颜色属性进行协同部署。解决方案的关键在于设计确定性算法,在无需预先知晓总节点数n和机器人数量k的情况下,实现时间复杂度与单个机器人内存使用量的理论保证,并通过构造性证明给出下界与不可行性结果,从而揭示其相较于传统DISPERSION在效率上的挑战与可行性边界。
链接: https://arxiv.org/abs/2602.05948
作者: Himani,Supantha Pandit,Gokarna Sharma
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: 3 tables, 2 figures, 6 pseudo-codes
Abstract:The well-studied DISPERSION problem is a fundamental coordination problem in distributed robotics, where a set of mobile robots must relocate so that each occupies a distinct node of a network. DISPERSION assumes that a robot can settle at any node as long as no other robot settles on that node. In this work, we introduce LOCATION-AWARE DISPERSION, a novel generalization of DISPERSION that incorporates location awareness: Let G = (V, E) be an anonymous, connected, undirected graph with n = |V| nodes, each labeled with a color \sfcol(v) \in C = \c_1, \dots, c_t, t\leq n . A set R = \r_1, \dots, r_k\ of k \leq n mobile robots is given, where each robot r_i has an associated color \mathsfcol(r_i) \in C . Initially placed arbitrarily on the graph, the goal is to relocate the robots so that each occupies a distinct node of the same color. When |C|=1 , LOCATION-AWARE DISPERSION reduces to DISPERSION. There is a solution to DISPERSION in graphs with any k\leq n without knowing k,n . Like DISPERSION, the goal is to solve LOCATION-AWARE DISPERSION minimizing both time and memory requirement at each agent. We develop several deterministic algorithms with guaranteed bounds on both time and memory requirement. We also give an impossibility and a lower bound for any deterministic algorithm for LOCATION-AWARE DISPERSION. To the best of our knowledge, the presented results collectively establish the algorithmic feasibility of LOCATION-AWARE DISPERSION in anonymous networks and also highlight the challenges on getting an efficient solution compared to the solutions for DISPERSION. Comments: 3 tables, 2 figures, 6 pseudo-codes Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Multiagent Systems (cs.MA); Robotics (cs.RO) Cite as: arXiv:2602.05948 [cs.DC] (or arXiv:2602.05948v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2602.05948 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[MA-4] Emulating Aggregate Human Choice Behavior and Biases with GPT Conversational Agents
【速读】:该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否能够在交互式情境下准确预测个体层面的认知偏差,并模拟受认知负荷等情境因素影响的偏倚人类决策动态。其解决方案的关键在于将三个经典决策场景转化为对话式交互环境,通过大规模人类实验(N=1100)收集参与者与聊天机器人互动的对话数据,随后利用GPT-4和GPT-5基于人口统计学特征与对话文本模拟相同交互条件,结果表明LLMs能够高精度复现人类偏倚行为,且不同模型在对人类行为拟合程度上存在显著差异,这为设计和评估适应性、偏倚感知的交互式AI系统提供了重要依据。
链接: https://arxiv.org/abs/2602.05597
作者: Stephen Pilli,Vivek Nallur
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: Accepted at CHI’26. arXiv admin note: substantial text overlap with arXiv:2601.11049
Abstract:Cognitive biases often shape human decisions. While large language models (LLMs) have been shown to reproduce well-known biases, a more critical question is whether LLMs can predict biases at the individual level and emulate the dynamics of biased human behavior when contextual factors, such as cognitive load, interact with these biases. We adapted three well-established decision scenarios into a conversational setting and conducted a human experiment (N=1100). Participants engaged with a chatbot that facilitates decision-making through simple or complex dialogues. Results revealed robust biases. To evaluate how LLMs emulate human decision-making under similar interactive conditions, we used participant demographics and dialogue transcripts to simulate these conditions with LLMs based on GPT-4 and GPT-5. The LLMs reproduced human biases with precision. We found notable differences between models in how they aligned human behavior. This has important implications for designing and evaluating adaptive, bias-aware LLM-based AI systems in interactive contexts.
zh
[MA-5] AI Agent Systems for Supply Chains: Structured Decision Prompts and Memory Retrieval AAMAS2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在库存管理中的有效性与适应性问题,特别是大语言模型(Large Language Model, LLM)驱动的多智能体系统(Multi-Agent Systems, MASs)能否在不同供应链场景下持续生成最优订货策略。其解决方案的关键在于提出一种名为AIM-RM的新颖代理机制,该机制通过相似性匹配复用历史经验,从而显著提升系统对多样化供应链环境的适应能力与决策鲁棒性。
链接: https://arxiv.org/abs/2602.05524
作者: Konosuke Yoshizato,Kazuma Shimizu,Ryota Higa,Takanobu Otsuka
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: A full version of the extended abstract accepted by the 25th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2026)
Abstract:This study investigates large language model (LLM) -based multi-agent systems (MASs) as a promising approach to inventory management, which is a key component of supply chain management. Although these systems have gained considerable attention for their potential to address the challenges associated with typical inventory management methods, key uncertainties regarding their effectiveness persist. Specifically, it is unclear whether LLM-based MASs can consistently derive optimal ordering policies and adapt to diverse supply chain scenarios. To address these questions, we examine an LLM-based MAS with a fixed-ordering strategy prompt that encodes the stepwise processes of the problem setting and a safe-stock strategy commonly used in inventory management. Our empirical results demonstrate that, even without detailed prompt adjustments, an LLM-based MAS can determine optimal ordering decisions in a restricted scenario. To enhance adaptability, we propose a novel agent called AIM-RM, which leverages similar historical experiences through similarity matching. Our results show that AIM-RM outperforms benchmark methods across various supply chain scenarios, highlighting its robustness and adaptability.
zh
[MA-6] LinguistAgent : A Reflective Multi-Model Platform for Automated Linguistic Annotation
【速读】:该论文旨在解决人文社会科学领域中数据标注(data annotation)效率低下的问题,尤其是在复杂语义任务如隐喻识别(metaphor identification)中的应用瓶颈。其解决方案的关键在于提出LinguistAgent平台,该平台采用反思式多模态架构(reflective multi-model architecture),通过双代理工作流——标注代理(Annotator)与审核代理(Reviewer)——模拟专业同行评审流程,从而提升标注自动化水平与准确性。该系统支持三种范式对比实验:提示工程(Prompt Engineering,包括零样本/少样本)、检索增强生成(Retrieval-Augmented Generation)和微调(Fine-tuning),并在隐喻识别任务上实现了基于token级别的实时评估(精确率、召回率及F₁分数),显著提升了大型语言模型(LLMs)在实际研究场景中的可用性。
链接: https://arxiv.org/abs/2602.05493
作者: Bingru Li
机构: University of Birmingham (伯明翰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Data annotation remains a significant bottleneck in the Humanities and Social Sciences, particularly for complex semantic tasks such as metaphor identification. While Large Language Models (LLMs) show promise, a significant gap remains between the theoretical capability of LLMs and their practical utility for researchers. This paper introduces LinguistAgent, an integrated, user-friendly platform that leverages a reflective multi-model architecture to automate linguistic annotation. The system implements a dual-agent workflow, comprising an Annotator and a Reviewer, to simulate a professional peer-review process. LinguistAgent supports comparative experiments across three paradigms: Prompt Engineering (Zero/Few-shot), Retrieval-Augmented Generation, and Fine-tuning. We demonstrate LinguistAgent’s efficacy using the task of metaphor identification as an example, providing real-time token-level evaluation (Precision, Recall, and F_1 score) against human gold standards. The application and codes are released on this https URL.
zh
[MA-7] A Data Driven Structural Decomposition of Dynamic Games via Best Response Maps
【速读】:该论文旨在解决动态博弈中纳什均衡(Nash equilibrium)或广义纳什均衡(Generalized Nash Equilibrium, GNE)计算的复杂性问题,其核心挑战源于多智能体之间紧密耦合的最优性条件、嵌套优化结构以及数值条件差等问题。传统方法要么直接求解全耦合博弈(需显式建模所有智能体的目标函数与约束),要么采用基于学习的方法通过预测或策略近似来解耦交互,但后者常牺牲均衡一致性。本文的关键解决方案是提出一种概念新颖的动态博弈重构形式:通过将离线编译的最佳响应映射(best-response map)嵌入为可行性约束,实现对原博弈的结构化降维,从而消除嵌套优化层和导数耦合。在标准正则条件下,若最佳响应算子精确,则简化问题的任何收敛解均对应于原博弈的一个局部开环纳什(GNE)均衡;若使用学习得到的替代模型,则解近似满足均衡一致性,误差由最佳响应逼近误差决定。
链接: https://arxiv.org/abs/2602.05324
作者: Mahdis Rabbani,Navid Mojahed,Shima Nazari
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Robotics (cs.RO); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: 11 pages, 6 figures, 5 tables, Submitted to RSS 2026
Abstract:Dynamic games are powerful tools to model multi-agent decision-making, yet computing Nash (generalized Nash) equilibria remains a central challenge in such settings. Complexity arises from tightly coupled optimality conditions, nested optimization structures, and poor numerical conditioning. Existing game-theoretic solvers address these challenges by directly solving the joint game, typically requiring explicit modeling of all agents’ objective functions and constraints, while learning-based approaches often decouple interaction through prediction or policy approximation, sacrificing equilibrium consistency. This paper introduces a conceptually novel formulation for dynamic games by restructuring the equilibrium computation. Rather than solving a fully coupled game or decoupling agents through prediction or policy approximation, a data-driven structural reduction of the game is proposed that removes nested optimization layers and derivative coupling by embedding an offline-compiled best-response map as a feasibility constraint. Under standard regularity conditions, when the best-response operator is exact, any converged solution of the reduced problem corresponds to a local open-loop Nash (GNE) equilibrium of the original game; with a learned surrogate, the solution is approximately equilibrium-consistent up to the best-response approximation error. The proposed formulation is supported by mathematical proofs, accompanying a large-scale Monte Carlo study in a two-player open-loop dynamic game motivated by the autonomous racing problem. Comparisons are made against state-of-the-art joint game solvers, and results are reported on solution quality, computational cost, and constraint satisfaction.
zh
[MA-8] owards a Science of Collective AI: LLM -based Multi-Agent Systems Need a Transition from Blind Trial-and-Error to Rigorous Science
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在生成式 AI(Generative AI)驱动下快速发展过程中缺乏统一科学框架的问题,尤其聚焦于两个核心瓶颈:一是因素归因不清晰导致研究依赖经验试错,二是缺乏可量化协作增益的指标,难以区分真实协同效应与单纯资源堆叠带来的性能提升。解决方案的关键在于提出一个整合性的设计科学框架,其核心创新是引入协作增益度量(collaboration gain metric, Γ),用以剥离预算增加带来的伪增益,从而精准识别真正促进协作的因素;同时构建系统的MAS因素库,将设计空间结构化为控制层级预设与信息层级动态两部分,实现对协作驱动因素的系统性归因与优化,推动MAS从盲目实验迈向严谨的科学范式。
链接: https://arxiv.org/abs/2602.05289
作者: Jingru Fan,Dewen Liu,Yufan Dang,Huatao Li,Yuheng Wang,Wei Liu,Feiyu Duan,Xuanwen Ding,Shu Yao,Lin Wu,Ruijie Shi,Wai-Shing Leung,Yuan Cheng,Zhongyu Wei,Cheng Yang,Chen Qian,Zhiyuan Liu,Maosong Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Recent advancements in Large Language Models (LLMs) have greatly extended the capabilities of Multi-Agent Systems (MAS), demonstrating significant effectiveness across a wide range of complex and open-ended domains. However, despite this rapid progress, the field still relies heavily on empirical trial-and-error. It lacks a unified and principled scientific framework necessary for systematic optimization and improvement. This bottleneck stems from the ambiguity of attribution: first, the absence of a structured taxonomy of factors leaves researchers restricted to unguided adjustments; second, the lack of a unified metric fails to distinguish genuine collaboration gain from mere resource accumulation. In this paper, we advocate for a transition to design science through an integrated framework. We advocate to establish the collaboration gain metric ( \Gamma ) as the scientific standard to isolate intrinsic gains from increased budgets. Leveraging \Gamma , we propose a factor attribution paradigm to systematically identify collaboration-driving factors. To support this, we construct a systematic MAS factor library, structuring the design space into control-level presets and information-level dynamics. Ultimately, this framework facilitates the transition from blind experimentation to rigorous science, paving the way towards a true science of Collective AI.
zh
[MA-9] Steering the Herd: A Framework for LLM -based Control of Social Learning
【速读】:该论文旨在解决信息中介型算法(如大语言模型,LLM)在社会学习环境中如何通过控制信息结构来影响个体决策,并进而改变社会福利的问题。其核心挑战在于建模一个由规划者(planner)控制信息流、同时个体基于前序者行为进行贝叶斯更新的动态社会学习过程。解决方案的关键在于构建一个融合动态规划、去中心化行动选择与贝叶斯信念更新的新优化框架,证明了价值函数的凸性并刻画了利他型与偏倚型规划者的最优策略:前者最大化社会福利,后者诱导特定偏好行为;尤其发现,在某些情形下偏倚型规划者会故意模糊信号以实现目标,即使在严格的透明度约束下(如信息平等、无虚假陈述),信息中介仍可显著双向调节社会福利。模拟实验进一步验证了该框架的有效性,显示LLM作为规划者时展现出类人类的战略引导行为,但也存在非贝叶斯推理偏差,提示需考虑认知机制对实际应用的影响。
链接: https://arxiv.org/abs/2504.02648
作者: Raghu Arghal,Kevin He,Shirin Saeedi Bidokhti,Saswati Sarkar
机构: 未知
类目: ystems and Control (eess.SY); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:
Abstract:Algorithms increasingly serve as information mediators–from social media feeds and targeted advertising to the increasing ubiquity of LLMs. This engenders a joint process where agents combine private, algorithmically-mediated signals with learning from peers to arrive at decisions. To study such settings, we introduce a model of controlled sequential social learning in which an information-mediating planner (e.g. an LLM) controls the information structure of agents while they also learn from the decisions of earlier agents. The planner may seek to improve social welfare (altruistic planner) or to induce a specific action the planner prefers (biased planner). Our framework presents a new optimization problem for social learning that combines dynamic programming with decentralized action choices and Bayesian belief updates. We prove the convexity of the value function and characterize the optimal policies of altruistic and biased planners, which attain desired tradeoffs between the costs they incur and the payoffs they earn from induced agent choices. Notably, in some regimes the biased planner intentionally obfuscates the agents’ signals. Even under stringent transparency constraints–information parity with individuals, no lying or cherry-picking, and full observability–we show that information mediation can substantially shift social welfare in either direction. We complement our theory with simulations in which LLMs act as both planner and agents. Notably, the LLM planner in our simulations exhibits emergent strategic behavior in steering public opinion that broadly mirrors the trends predicted, though key deviations suggest the influence of non-Bayesian reasoning consistent with the cognitive patterns of both humans and LLMs trained on human-like data. Together, we establish our framework as a tractable basis for studying the impact and regulation of LLM information mediators.
zh
自然语言处理
[NLP-0] DFlash: Block Diffusion for Flash Speculative Decoding
【速读】: 该论文旨在解决自回归大语言模型(Large Language Models, LLMs)在推理过程中因固有的序列解码机制导致的高延迟和GPU利用率低的问题。现有推测解码(Speculative Decoding)方法虽通过快速草稿模型与目标模型并行验证来缓解瓶颈,但仍依赖自回归式草稿生成,限制了实际加速效果。其解决方案的关键在于提出DFlash框架,采用轻量级块扩散模型(block diffusion model)实现并行草稿生成,并通过将草稿模型条件化于目标模型提取的上下文特征,显著提升草稿质量与接受率,从而在保持无损性能的前提下实现超过6倍的加速比,优于当前最优方法EAGLE-3达2.5倍。
链接: https://arxiv.org/abs/2602.06036
作者: Jian Chen,Yesheng Liang,Zhijian Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.
zh
[NLP-1] Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在超出单次上下文窗口(context window)范围时,因依赖离线、查询无关的记忆构建方式而导致效率低下和关键信息丢失的问题。现有方法在运行时利用记忆虽具合理性,但常伴随显著开销且缺乏对性能-成本权衡的显式控制。解决方案的关键在于提出 BudgetMem,一个面向运行时代理记忆的框架,其核心创新是将记忆处理结构化为多个模块,并为每个模块提供三种预算层级(低/中/高),通过轻量级路由机制基于强化学习训练的神经策略动态分配预算层级,从而在任务性能与记忆构建成本之间实现显式、查询感知的控制。该设计使得系统能够在不同预算约束下灵活优化记忆策略,显著提升准确率-成本前沿表现。
链接: https://arxiv.org/abs/2602.06025
作者: Haozhen Zhang,Haodong Yue,Tao Feng,Quanyu Long,Jianzhu Bao,Bowen Jin,Weizhi Zhang,Xiao Li,Jiaxuan You,Chengwei Qin,Wenya Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at this https URL
Abstract:Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbfBudgetMem, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textscLow/\textscMid/\textscHigh). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.
zh
[NLP-2] Multi-Token Prediction via Self-Distillation
【速读】: 该论文旨在解决当前语言模型推理加速技术(如推测解码)依赖训练辅助推测模型并构建复杂推理流水线的问题。其解决方案的关键在于提出一种简单的在线蒸馏目标,将预训练的自回归语言模型直接转化为独立的多标记预测模型,无需引入任何辅助验证器或特殊推理代码,同时保持原始模型实现不变,从而在保证较高性能的前提下显著提升推理速度(在GSM8K数据集上平均提速超过3倍,仅损失5%准确率)。
链接: https://arxiv.org/abs/2602.06019
作者: John Kirchenbauer,Abhimanyu Hans,Brian Bartoldson,Micah Goldblum,Ashwinee Panda,Tom Goldstein
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages and 5 figures in the main body
Abstract:Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. On GSM8K, our method produces models that can decode more than 3\times faster on average at 5% drop in accuracy relative to single token decoding performance.
zh
[NLP-3] A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在无监督(zero-shot)场景下评估心理疾病(如创伤后应激障碍,PTSD)时准确性不足的问题。其核心解决方案在于系统性地优化两个关键维度:一是提供详尽的构念定义与叙事背景等上下文知识以增强模型理解;二是采用结构化建模策略,包括增加推理深度、选择合适的模型规模、融合监督模型与零样本LLM的集成方法。研究发现,通过精心设计的上下文信息和集成学习策略,可显著提升LLM在心理健康评估中的预测精度。
链接: https://arxiv.org/abs/2602.06015
作者: Panagiotis Kaliosis,Adithya V Ganesan,Oscar N.E. Kjell,Whitney Ringwald,Scott Feltman,Melissa A. Carr,Dimitris Samaras,Camilo Ruggero,Benjamin J. Luft,Roman Kotov,Andrew H. Schwartz
机构: Stony Brook University (石溪大学); Vanderbilt University (范德比尔特大学)
类目: Computation and Language (cs.CL)
备注: 18 pages, 3 figures, 5 tables
Abstract:Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative; (b) increased reasoning effort leads to better estimation accuracy; © performance of open-weight models (Llama, Deepseek), plateau beyond 70B parameters while closed-weight (o3-mini, gpt-5) models improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Taken together, the results suggest choice of contextual knowledge and modeling strategies is important for deploying LLMs to accurately assess mental health.
zh
[NLP-4] Speech Emotion Recognition Leverag ing OpenAI s Whisper Representations and Attentive Pooling Methods
【速读】: 该论文旨在解决语音情感识别(Speech Emotion Recognition, SER)领域因缺乏标准且规模足够的数据集而导致的研究瓶颈问题。其解决方案的关键在于利用预训练自动语音识别(ASR)模型Whisper作为特征提取器,并提出两种基于注意力机制的池化方法——多头注意力平均池化(Multi-head Attentive Average Pooling)和QKV池化(QKV Pooling),以高效降低Whisper表示的维度同时保留情感特征。实验表明,该方法在波斯语数据集ShEMO上达到当前最优性能(未加权准确率提升2.47%),并发现中间层编码器对SER更具优势,为轻量级高效的情感识别提供了新路径。
链接: https://arxiv.org/abs/2602.06000
作者: Ali Shendabadi,Parnia Izadirad,Mostafa Salehi,Mahmoud Bijankhan
机构: University of Tehran (德黑兰大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that intermediate layers often perform better for SER on the Persian dataset, providing a lightweight and efficient alternative to much larger models such as HuBERT X-Large. Our findings highlight the potential of Whisper as a representation extractor for SER and demonstrate the effectiveness of attention-based pooling for dimension reduction.
zh
[NLP-5] DSB: Dynamic Sliding Block Scheduling for Diffusion LLM s
【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在文本生成过程中,由于采用固定预定义块调度(naive block schedule)导致的推理效率与生成质量不佳的问题。传统方法忽视了语义难度的差异,可能造成对不确定位置过早承诺或延迟易处理位置的生成,从而影响整体性能。解决方案的关键在于提出一种无需训练的动态滑动块(Dynamic Sliding Block, DSB)调度机制,通过自适应调整块大小以匹配局部语义复杂度,克服静态块调度的刚性限制;同时引入专为DSB设计的训练-free KV缓存机制(DSB Cache),进一步提升推理效率。实验表明,DSB与DSB Cache协同工作可显著改善dLLMs的生成质量和推理效率。
链接: https://arxiv.org/abs/2602.05992
作者: Lizhuo Luo,Shenggui Li,Yonggang Wen,Tianwei Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at this https URL.
zh
[NLP-6] SAGE: Benchmarking and Improving Retrieval for Deep Research Agents ACL
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的检索器在深度研究代理(Deep Research Agents)工作流中表现不佳的问题,特别是面对需要复杂推理的科学文献检索任务时性能显著下降。其核心发现是:尽管LLM-based retrievers具备指令遵循与推理能力,但在实际应用中因生成关键词导向的子查询而导致检索效果远逊于传统方法(如BM25)。解决方案的关键在于提出一种语料级测试时扩展框架(corpus-level test-time scaling framework),通过LLMs为文档自动添加元数据和关键词以增强语义表示,从而提升现成检索器(off-the-shelf retrievers)的检索效率与准确性,最终在短文本和开放式问题上分别实现8%和2%的性能提升。
链接: https://arxiv.org/abs/2602.05975
作者: Tiansheng Hu,Yilun Zhao,Canyu Zhang,Arman Cohan,Chen Zhao
机构: NYU Shanghai (纽约大学上海分校); Yale University (耶鲁大学); Center for Data Science, New York University (纽约大学数据科学中心)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Submission to ACL ARR 2026 January
Abstract:Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval this http URL evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.
zh
[NLP-7] Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space ICLR2026
【速读】: 该论文旨在解决如何量化人类在语义空间中进行概念生成与检索的动态过程这一问题,即如何将语义表征建模为一种可计算的几何导航行为。其核心解决方案是构建一个基于累积嵌入(cumulative embeddings)的语义轨迹框架,将个体在语言任务中的响应序列映射为嵌入空间中的路径,并提取包括到下一节点距离、到中心点距离、熵、速度和加速度等几何与动力学指标,从而实现对语义导航行为的定量分析。该方法无需大量人工语言预处理,适用于跨语言和临床场景,且不同Transformer文本嵌入模型的结果具有一致性,表明学习到的语义表示具有内在结构相似性,为认知建模与机器学习表征之间的桥梁提供了可计算的工具。
链接: https://arxiv.org/abs/2602.05971
作者: Felipe D. Toro-Hernández,Jesuino Vieira Filho,Rodrigo M. Cabral-Carvalho
机构: Center of Mathematics, Computing and Cognition, Federal University of ABC (联邦大学ABC数学、计算与认知中心); Department of Computer Science and Operations Research, Université de Montréal (蒙特利尔大学计算机科学与运筹学系)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 10 pages, 6 figures (excluding refs/appendix). Accepted to ICLR 2026
Abstract:Semantic representations can be framed as a structured, dynamic knowledge space through which humans navigate to retrieve and manipulate meaning. To investigate how humans traverse this geometry, we introduce a framework that represents concept production as navigation through embedding space. Using different transformer text embedding models, we construct participant-specific semantic trajectories based on cumulative embeddings and extract geometric and dynamical metrics, including distance to next, distance to centroid, entropy, velocity, and acceleration. These measures capture both scalar and directional aspects of semantic navigation, providing a computationally grounded view of semantic representation search as movement in a geometric space. We evaluate the framework on four datasets across different languages, spanning different property generation tasks: Neurodegenerative, Swear verbal fluency, Property listing task in Italian, and in German. Across these contexts, our approach distinguishes between clinical groups and concept types, offering a mathematical framework that requires minimal human intervention compared to typical labor-intensive linguistic pre-processing methods. Comparison with a non-cumulative approach reveals that cumulative embeddings work best for longer trajectories, whereas shorter ones may provide too little context, favoring the non-cumulative alternative. Critically, different embedding models yielded similar results, highlighting similarities between different learned representations despite different training pipelines. By framing semantic navigation as a structured trajectory through embedding space, bridging cognitive modeling with learned representation, thereby establishing a pipeline for quantifying semantic representation dynamics with applications in clinical research, cross-linguistic analysis, and the assessment of artificial cognition.
zh
[NLP-8] Self-Improving Multilingual Long Reason ing via Translation-Reason ing Integrated Training
【速读】: 该论文旨在解决长链条推理模型在多语言场景下的性能瓶颈问题,即模型倾向于用英语进行推理(而非问题所使用的语言),或当强制要求使用问题语言进行推理时,准确率显著下降。这一问题源于模型在多语言理解与多语言推理能力上的双重不足。解决方案的关键在于提出TRIT(Translation-Reasoning Integrated Training)框架,通过将翻译训练与多语言推理训练进行联合优化,无需外部反馈或额外多语言数据,即可同步提升多语言问题理解能力和响应生成质量,从而实现跨语言一致性与准确性双提升。
链接: https://arxiv.org/abs/2602.05940
作者: Junxiao Liu,Zhijun Wang,Yixiao Li,Zhejian Lai,Liqian Huang,Xin Huang,Xue Han,Junlan Feng,Shujian Huang
机构: Nanjing University (南京大学); University of Tübingen (图宾根大学); China Mobile Communications Company Limited Research Institute (中国移动通信有限公司研究院)
类目: Computation and Language (cs.CL)
备注: 16 pages, 11 figures
Abstract:Long reasoning models often struggle in multilingual settings: they tend to reason in English for non-English questions; when constrained to reasoning in the question language, accuracies drop substantially. The struggle is caused by the limited abilities for both multilingual question understanding and multilingual reasoning. To address both problems, we propose TRIT (Translation-Reasoning Integrated Training), a self-improving framework that integrates the training of translation into multilingual reasoning. Without external feedback or additional multilingual data, our method jointly enhances multilingual question understanding and response generation. On MMATH, our method outperforms multiple baselines by an average of 7 percentage points, improving both answer correctness and language consistency. Further analysis reveals that integrating translation training improves cross-lingual question alignment by over 10 percentage points and enhances translation quality for both mathematical questions and general-domain text, with gains up to 8.4 COMET points on FLORES-200.
zh
[NLP-9] Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions
【速读】: 该论文试图解决的问题是:多语言大语言模型(LLMs)在面对蕴含价值观的多项选择题(MCQs)时,其回答是否具有一致性,即是否像理论上的多语种者(polyglots)那样跨语言保持稳定响应,还是表现出语言依赖性,如同多个单语模型通过同一模型表达不同价值观。解决方案的关键在于构建了一个新的语料库——多语言欧洲价值观调查(Multilingual European Value Survey, MEVS),该语料库由8种欧洲语言的人工翻译问卷组成,避免了以往研究中依赖机器翻译或临时提示的问题;在此基础上,对超过三十个不同规模、来源和对齐微调状态的多语言LLMs进行系统测试,控制提示变量(如答案顺序、符号类型和尾字符),从而揭示语言因素如何影响模型在价值相关任务中的表现一致性。
链接: https://arxiv.org/abs/2602.05932
作者: Léo Labat,Etienne Ollion,François Yvon
机构: Sorbonne Université (索邦大学); CNRS (法国国家科学研究中心); ISIR (智能机器人与交互研究所); CREST (法国经济与社会研究中心); Institut Polytechnique de Paris (巴黎综合理工学院)
类目: Computation and Language (cs.CL)
备注: 17 pages, 5 figures (8 pages of references and appendices)
Abstract:Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.
zh
[NLP-10] KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自回归解码过程中因上下文长度增长而导致的KV缓存(KV-cache)读写操作快速饱和GPU内存带宽的问题。现有KV缓存压缩方法大多忽视了KV缓存的数据依赖性及其在不同网络层间的差异性。论文提出KV-CoRE(KV-cache Compressibility by Rank Evaluation),一种基于奇异值分解(SVD)的方法,用于量化KV缓存的数据依赖低秩压缩潜力。其关键在于通过Frobenius范数下的最优低秩逼近,在无需梯度计算且支持增量更新的前提下,实现对数据集层面和层级别的高效评估。该方法揭示了压缩性与模型架构、训练数据及语言覆盖范围之间的系统性关联,并首次建立了LLM中KV缓存压缩性的大规模基准,为动态、数据感知的压缩策略和以数据为中心的模型开发提供了理论基础与实证依据。
链接: https://arxiv.org/abs/2602.05929
作者: Jian Chen,Zhuoran Wang,Jiayu Qin,Ming Li,Meng Wang,Changyou Chen,Yin Chen,Qizhen Weng,Yirui Liu
机构: University at Buffalo (纽约州立大学布法罗分校); Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信人工智能研究院); Dolby Laboratories (杜比实验室); Delft University of Technology (代尔夫特理工大学); University of Maryland (马里兰大学); ByteDance (字节跳动)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models rely on kv-caches to avoid redundant computation during autoregressive decoding, but as context length grows, reading and writing the cache can quickly saturate GPU memory bandwidth. Recent work has explored KV-cache compression, yet most approaches neglect the data-dependent nature of kv-caches and their variation across layers. We introduce KV-CoRE KV-cache Compressibility by Rank Evaluation), an SVD-based method for quantifying the data-dependent low-rank compressibility of kv-caches. KV-CoRE computes the optimal low-rank approximation under the Frobenius norm and, being gradient-free and incremental, enables efficient dataset-level, layer-wise evaluation. Using this method, we analyze multiple models and datasets spanning five English domains and sixteen languages, uncovering systematic patterns that link compressibility to model architecture, training data, and language coverage. As part of this analysis, we employ the Normalized Effective Rank as a metric of compressibility and show that it correlates strongly with performance degradation under compression. Our study establishes a principled evaluation framework and the first large-scale benchmark of kv-cache compressibility in LLMs, offering insights for dynamic, data-aware compression and data-centric model development.
zh
[NLP-11] Codified Finite-state Machines for Role-playing
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在角色扮演(Role-Playing, RP)中难以维持角色内在状态一致性的问题。现有基于提示(prompting)的方法主要捕捉表面行为,无法有效追踪驱动交互的潜在状态(latent states)。解决方案的关键在于引入编码有限状态机(Codified Finite-State Machines, CFSMs),通过LLM自动将文本角色档案编码为可解释的状态转移结构,从而显式建模角色的核心状态及其转换逻辑;进一步地,扩展为概率性编码有限状态机(Codified Probabilistic Finite-State Machines, CPFSMs),以概率分布形式刻画状态转移的不确定性与多样性,增强了对开放语义空间中角色行为一致性和多样性的建模能力。
链接: https://arxiv.org/abs/2602.05905
作者: Letian Peng,Yupeng Hou,Kun Zhou,Jingbo Shang
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Modeling latent character states is crucial for consistent and engaging role-playing (RP) with large language models (LLMs). Yet, existing prompting-based approaches mainly capture surface actions, often failing to track the latent states that drive interaction. We revisit finite-state machines (FSMs), long used in game design to model state transitions. While effective in small, well-specified state spaces, traditional hand-crafted, rule-based FSMs struggle to adapt to the open-ended semantic space of RP. To address this, we introduce Codified Finite-State Machines (CFSMs), a framework that automatically codifies textual character profiles into FSMs using LLM-based coding. CFSMs extract key states and transitions directly from the profile, producing interpretable structures that enforce character consistency. To further capture uncertainty and variability, we extend CFSMs into Codified Probabilistic Finite-State Machines (CPFSMs), where transitions are modeled as probability distributions over states. Through both synthetic evaluations and real-world RP scenarios in established artifacts, we demonstrate that CFSM and CPFSM outperform generally applied baselines, verifying effectiveness not only in structured tasks but also in open-ended stochastic state exploration.
zh
[NLP-12] Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reason ing Models
【速读】: 该论文旨在解决小规模推理模型(Small Reasoning Models, SRMs)在资源受限场景下进行链式思维(Chain-of-Thought, CoT)推理时存在的忠实性幻觉(faithfulness hallucinations)问题,尤其是在中间推理步骤中容易产生不真实或错误的推导过程。现有基于在线强化学习的缓解方法依赖于结果奖励或粗粒度的CoT评估,在最终答案正确时可能无意中强化了不忠实的推理路径。解决方案的关键在于提出一种忠实性感知的逐步强化学习方法(Faithfulness-Aware Step-Level Reinforcement Learning, FaithRL),其核心包括两个机制:一是通过显式的忠实性奖励模型提供逐步监督信号;二是引入隐式的截断重采样策略,从忠实的推理前缀中生成对比信号,从而有效抑制幻觉并提升整体推理的可靠性与准确性。
链接: https://arxiv.org/abs/2602.05897
作者: Shuo Nie,Hexuan Deng,Chao Wang,Ruiyu Fang,Xuebo Liu,Shuangyong Song,Yu Li,Min Zhang,Xuelong Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at this https URL.
zh
[NLP-13] DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)系统在真实环境中训练时面临的噪声监督和域外(out-of-domain, OOD)泛化能力差的问题,尤其是在大语言模型(Large Language Models, LLMs)后训练阶段。现有分布式强化学习方法虽通过多分位点建模提升鲁棒性,但其独立处理每个分位点为标量值,导致价值表示粒度粗糙、缺乏对状态信息的精细条件控制,难以应对复杂或域外场景。解决方案的关键在于提出DFPO(Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control),其核心创新是将价值建模从离散分位点扩展为跨时间步的连续流场(value flow field),从而捕获更丰富的状态信息以实现更精确的优势估计;同时引入条件风险控制与价值流轨迹上的一致性约束,有效稳定噪声反馈下的训练过程,显著提升鲁棒性和泛化性能。
链接: https://arxiv.org/abs/2602.05890
作者: Dingwei Zhu,Zhiheng Xi,Shihan Dou,Jiahan Li,Chenhao Huang,Junjie Ye,Sixian Li,Mingxu Chai,Yuhui Wang,Yajie Yang,Ming Zhang,Jiazheng Zhang,Shichun Liu,Caishuang Huang,Yunke Zhang,Yuran Wang,Tao Gui,Xipeng Qiu,Qi Zhang,Xuanjing Huang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.
zh
[NLP-14] Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成高质量GPU内核代码(kernel)时面临的训练不稳定、奖励劫持(reward hacking)和惰性优化(lazy optimization)问题,从而推动可扩展人工智能系统的发展。其关键解决方案在于:首先设计了KernelGYM这一分布式GPU环境,支持多轮交互数据收集、奖励劫持检测与长期强化学习(Reinforcement Learning, RL)训练;其次提出Turn-level Reinforce-Leave-One-Out(TRLOO)方法以消除因自包含(self-inclusion)导致的策略梯度偏差,实现无偏的优势估计;最后引入基于性能分析的奖励机制(Profiling-based Rewards, PR)和拒绝采样策略(Profiling-based Rejection Sampling, PRS),有效缓解惰性优化问题并提升训练稳定性。实验表明,所训练模型在Kernelbench基准上达到与Claude-4.5-Sonnet相当的性能,并通过测试时的序列缩放策略进一步提升了1.2倍加速比的内核比例。
链接: https://arxiv.org/abs/2602.05885
作者: Wei Liu,Jiawei Xu,Yingru Li,Longtao Zheng,Tianjian Li,Qian Liu,Junxian He
机构: HKUST(香港科技大学); TikTok; CUHK(SZ)(香港中文大学(深圳)); NTU(南洋理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, this http URL-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for this http URL-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in this https URL.
zh
[NLP-15] EuroLLM -22B: Technical Report
【速读】: 该论文旨在解决欧洲语言在现有开源大语言模型中代表性不足和资源匮乏的问题,特别是针对欧盟24种官方语言及11种额外语言的建模能力缺失。解决方案的关键在于从零开始训练一个名为EuroLLM-22B的大语言模型,其核心包括:设计多语言分词器(tokenizer)、采用适合多语言任务的架构、构建高质量的多语言预训练数据集并进行精细化过滤,以及实施系统化的训练与微调流程。该模型在推理、指令遵循和翻译等多语言基准测试中表现优异,达到同类规模模型的先进水平,同时开源了基础模型、指令微调版本、预训练数据及代码库,以促进后续多语言自然语言处理研究。
链接: https://arxiv.org/abs/2602.05879
作者: Miguel Moura Ramos,Duarte M. Alves,Hippolyte Gisserot-Boukhlef,João Alves,Pedro Henrique Martins,Patrick Fernandes,José Pombal,Nuno M. Guerreiro,Ricardo Rei,Nicolas Boizard,Amin Farajian,Mateusz Klimaszewski,José G. C. de Souza,Barry Haddow,François Yvon,Pierre Colombo,Alexandra Birch,André F. T. Martins
机构: Instituto Superior Técnico & Universidade de Lisboa (Lisbon ELLIS Unit); Instituto de Telecomunicações; MICS, CentraleSupélec, Université Paris-Saclay; Acolad; Carnegie Mellon University; University of Edinburgh; Diabolocom; Aveni; OutSystems; Sword Health; Sorbonne Université, CNRS, ISIR; Artefact Research Center; TransPerfect
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B’s development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.
zh
[NLP-16] xList-Hate: A Checklist-Based Framework for Interpretable and Generalizable Hate Speech Detection
【速读】: 该论文旨在解决当前 hate speech detection(仇恨言论检测)任务中因模型过度依赖特定数据集定义而导致的泛化能力差、对领域偏移(domain shift)和标注噪声敏感的问题。传统方法将仇恨言论检测视为单一的二分类问题,忽视了其多因素交互的复杂性,从而导致模型在不同法律框架、平台政策或标注指南下表现不稳定。解决方案的关键在于提出 xList-Hate 框架,该框架将仇恨言论检测重构为一个诊断推理任务:首先通过一组基于广泛共享规范准则的概念级问题(concept-level questions)对内容进行分解式评估,每个问题由大语言模型(LLM)独立回答,生成不直接预测最终标签但捕捉仇恨特征的二进制诊断表示;随后利用轻量级且完全可解释的决策树聚合这些信号,实现透明、可审计的预测。此设计显著提升了跨数据集的鲁棒性和对标注不一致的容忍度,同时提供细粒度的可解释路径与因子层面分析,为内容审核提供了更具扩展性的可解释替代方案。
链接: https://arxiv.org/abs/2602.05874
作者: Adrián Girón,Pablo Miralles,Javier Huertas-Tato,Sergio D’Antonio,David Camacho
机构: Universidad Politécnica de Madrid (马德里理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Hate speech detection is commonly framed as a direct binary classification problem despite being a composite concept defined through multiple interacting factors that vary across legal frameworks, platform policies, and annotation guidelines. As a result, supervised models often overfit dataset-specific definitions and exhibit limited robustness under domain shift and annotation noise. We introduce xList-Hate, a diagnostic framework that decomposes hate speech detection into a checklist of explicit, concept-level questions grounded in widely shared normative criteria. Each question is independently answered by a large language model (LLM), producing a binary diagnostic representation that captures hateful content features without directly predicting the final label. These diagnostic signals are then aggregated by a lightweight, fully interpretable decision tree, yielding transparent and auditable predictions. We evaluate it across multiple hate speech benchmarks and model families, comparing it against zero-shot LLM classification and in-domain supervised fine-tuning. While supervised methods typically maximize in-domain performance, we consistently improves cross-dataset robustness and relative performance under domain shift. In addition, qualitative analysis of disagreement cases provides evidence that the framework can be less sensitive to certain forms of annotation inconsistency and contextual ambiguity. Crucially, the approach enables fine-grained interpretability through explicit decision paths and factor-level analysis. Our results suggest that reframing hate speech detection as a diagnostic reasoning task, rather than a monolithic classification problem, provides a robust, explainable, and extensible alternative for content moderation. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.05874 [cs.CL] (or arXiv:2602.05874v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.05874 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Adrián Girón [view email] [v1] Thu, 5 Feb 2026 16:51:56 UTC (196 KB)
zh
[NLP-17] Constrained Group Relative Policy Optimization
【速读】: 该论文旨在解决在无评价网络(critic-free)策略优化框架中如何有效引入显式行为约束的问题,尤其是针对Group Relative Policy Optimization (GRPO) 在受限场景下的扩展应用。现有方法在处理多目标优化时,若对优势估计采用分量独立处理的方式,会导致不同目标项的相对重要性失衡,从而破坏拉格朗日信号的有效性,无法实现稳定的约束控制。解决方案的关键在于提出一种标量化的优势构造方法(scalarized advantage construction),通过统一处理奖励与约束项的优势估计,确保拉格朗日优化过程中保持正确的权衡关系,从而稳定地最小化约束违反率。实验表明,该方法在简化网格世界和机器人任务中均能显著提升约束满足度并增强任务成功率,为基于大模型的具身智能领域提供了一种简洁而有效的受限策略优化方案。
链接: https://arxiv.org/abs/2602.05863
作者: Roger Girgis,Rodrigue de Schaetzen,Luke Rowe,Azalée Robitaille,Christopher Pal,Liam Paull
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Robotics (cs.RO)
备注: 16 pages, 6 figures
Abstract:While Group Relative Policy Optimization (GRPO) has emerged as a scalable framework for critic-free policy learning, extending it to settings with explicit behavioral constraints remains underexplored. We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization. Constraints are specified via indicator cost functions, enabling direct optimization of violation rates through a Lagrangian relaxation. We show that a naive multi-component treatment in advantage estimation can break constrained learning: mismatched component-wise standard deviations distort the relative importance of the different objective terms, which in turn corrupts the Lagrangian signal and prevents meaningful constraint enforcement. We formally derive this effect to motivate our scalarized advantage construction that preserves the intended trade-off between reward and constraint terms. Experiments in a toy gridworld confirm the predicted optimization pathology and demonstrate that scalarizing advantages restores stable constraint control. In addition, we evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success, establishing a simple and effective recipe for constrained policy optimization in embodied AI domains that increasingly rely on large multimodal foundation models.
zh
[NLP-18] DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)缺乏针对性机制可解释性工具的问题,以填补当前在DLM领域中对内部表征进行解析和干预的研究空白。其解决方案的关键在于提出首个基于稀疏自编码器(Sparse Autoencoders, SAEs)的可解释性框架——DLM-Scope,并验证了训练后的Top-K SAE能够忠实提取人类可理解的特征。特别地,研究发现SAE插入对DLM的影响与传统自回归大语言模型(Autoregressive Large Language Models, LLMs)存在本质差异:在DLM早期层中引入SAE不仅不会导致性能下降,反而可能降低交叉熵损失;同时,SAE特征在扩散时间步上的干预效果优于LLM中的控制方法,且能提供关于解码顺序的有效信号以及在后训练阶段保持稳定,从而为DLM的机制理解与应用开辟了新路径。
链接: https://arxiv.org/abs/2602.05859
作者: Xu Wang,Bingqing Jiang,Yu Wan,Baosong Yang,Lingpeng Kong,Difan Zou
机构: The University of Hong Kong (香港大学); Tongyi Lab, Alibaba Group Inc (阿里巴巴集团通义实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages
Abstract:Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling researchers to extract sparse, human-interpretable features and intervene on model behavior. Recently, as diffusion language models (DLMs) have become an increasingly promising alternative to the autoregressive LLMs, it is essential to develop tailored mechanistic interpretability tools for this emerging class of models. In this work, we present DLM-Scope, the first SAE-based interpretability framework for DLMs, and demonstrate that trained Top-K SAEs can faithfully extract interpretable features. Notably, we find that inserting SAEs affects DLMs differently than autoregressive LLMs: while SAE insertion in LLMs typically incurs a loss penalty, in DLMs it can reduce cross-entropy loss when applied to early layers, a phenomenon absent or markedly weaker in LLMs. Additionally, SAE features in DLMs enable more effective diffusion-time interventions, often outperforming LLM steering. Moreover, we pioneer certain new SAE-based research directions for DLMs: we show that SAEs can provide useful signals for DLM decoding order; and the SAE features are stable during the post-training phase of DLMs. Our work establishes a foundation for mechanistic interpretability in DLMs and shows a great potential of applying SAEs to DLM-related tasks and algorithms.
zh
[NLP-19] RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference
【速读】: 该论文旨在解决大语言模型在处理长文本上下文时,注意力机制带来的二次复杂度(O(L²))计算瓶颈问题。现有动态稀疏注意力方法虽能实现输入自适应的效率提升,但普遍存在预处理依赖、全局评估缺失、查询独立性破坏或高计算开销等根本性权衡。其解决方案的关键在于提出RRAttention方法,通过头轮转采样(head round-robin sampling)策略,在每个步长内轮换不同注意力头的查询采样位置,从而在保持查询独立性的前提下,借助步长级聚合实现高效的全局模式发现;该方法将复杂度降至O(L²/S²),并结合自适应Top-τ选择实现最优稀疏性,实验表明其可在仅计算一半注意力块的情况下恢复超过99%全注意力性能,并在128K上下文长度下实现2.4倍加速,显著优于现有动态稀疏注意力方法。
链接: https://arxiv.org/abs/2602.05853
作者: Siran Liu,Guoxia Wang,Sa Wang,Jinle Zeng,HaoYang Xie,Siyu Lou,JiaBin Yang,DianHai Yu,Haifeng Wang,Chao Yang
机构: Baidu Inc. (百度公司); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underlineround-\underlinerobin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from O(L^2) to O(L^2/S^2) and employs adaptive Top- \tau selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99% of full attention performance while computing only half of the attention blocks, achieving 2.4 \times speedup at 128K context length and outperforming existing dynamic sparse attention methods.
zh
[NLP-20] DARWIN: Dynamic Agent ically Rewriting Self-Improving Network
【速读】: 该论文旨在解决如何通过进化式优化策略提升生成式 AI (Generative AI) 模型训练效率与性能的问题。其核心解决方案是构建一个名为 DARWIN 的进化型 GPT 模型框架,关键在于采用类遗传算法(genetic algorithm)的优化结构:多个独立的 GPT 代理(GPT agent)分别使用独特的训练代码进行个体化训练,在每轮迭代中通过提示机制相互修改彼此的训练代码以实现类似突变的效果;随后基于基准测试结果选择表现最优的模型进入下一轮迭代。此外,DARWIN 还引入持久化的基于 JSON 的记忆文件来记录推理过程和代码变更,并建立人机协同(HITL)双向接口支持人工干预,从而在有限资源条件下实现了 FLOPS 利用率(MFU)提升 1.26% 和困惑度(perplexity)降低 2.07%,验证了该方法在规模化进化式 GPT 训练中的潜力。
链接: https://arxiv.org/abs/2602.05848
作者: Henry Jiang
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages, 3 figures, 2 tables
Abstract:DARWIN is an evolutionary GPT model, utilizing a genetic-algorithm like optimization structure with several independent GPT agents being trained individually using unique training code. Each iteration, the GPT models are prompted to modify the training code of one another in an attempt to improve their performance in a mutation-like manner, and the best GPT agents are then benchmarked and selected for the next iteration by genetic algorithm. For demonstration purposes and due to budget and time constraints, OpenAI API is used to prompt training code improvements and the nanoGPT framework is used as the training code. DARWIN also utilizes persistent JSON-based memory files to track previous reasoning and changes to code to correlate with improvement to model performance. and a bidirectional interface for HITL intervention allowing the model to request upgrades such as additional datasets, training scripts, and restructuring of file hierarchies. In experiments, DARWIN achieved a 1.26 percent improvement in model FLOPS utilization (MFU) and a 2.07 percent improvement to perplexity in 5 iterations of training over baseline configurations, demonstrating promising capabilities as a foundation for scaling evolutionary GPT training.
zh
[NLP-21] OdysseyArena: Benchmarking Large Language Models For Long-Horizon Active and Inductive Interactions
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)驱动的智能体在复杂环境中缺乏自主发现隐式状态转移规律能力的问题。现有评估方法多采用演绎范式,依赖显式规则和静态目标,在有限规划范围内执行任务,忽视了智能体从经验中主动归纳动态规律的能力,而这正是实现前瞻性决策与长期战略一致性的基础。解决方案的关键在于提出 OdysseyArena,一个以长周期、主动且归纳性交互为核心的评估框架,通过形式化并实例化四种基本交互原语,将抽象的状态转移动力学转化为可操作的环境,并构建了 OdysseyArena-Lite(标准化基准,含120个任务)和 OdysseyArena-Challenge(极端交互长度测试,如200步),从而系统性地衡量智能体的归纳效率与长程探索能力。实验表明,即便前沿LLM也存在归纳场景下的性能瓶颈,揭示了实现复杂环境中自主发现能力的关键挑战。
链接: https://arxiv.org/abs/2602.05843
作者: Fangzhi Xu,Hang Yan,Qiushi Sun,Jinyang Wu,Zixian Huang,Muye Huang,Jingyang Gong,Zichen Ding,Kanzhi Cheng,Yian Wang,Xinyu Che,Zeyi Sun,Jian Zhang,Zhangyue Yin,Haoran Luo,Xuanjing Huang,Ben Kao,Jun Liu,Qika Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注: 34 pages
Abstract:The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent’s inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at this https URL
zh
[NLP-22] Reinforcement World Model Learning for LLM -based Agents
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在代理(agentic)场景下因缺乏世界建模能力而导致的行动后果预测不足与环境动态适应性差的问题。其解决方案的关键在于提出一种自监督的世界模型学习方法——强化世界模型学习(Reinforcement World Model Learning, RWML),该方法通过模拟到现实(sim-to-real)的奖励机制,使LLM在文本状态空间中学习动作条件下的世界模型,从而对齐模型生成的下一状态与环境中实际观测到的状态,确保内部仿真与真实环境动力学的一致性。相比传统的基于token级预测的方法,RWML更注重语义等价而非字面一致,有效缓解了模型坍塌问题,并在ALFWorld和τ² Bench基准上实现了显著性能提升,且无需人工标注数据即可达到专家数据训练的效果。
链接: https://arxiv.org/abs/2602.05842
作者: Xiao Yu,Baolin Peng,Ruize Xu,Yelong Shen,Pengcheng He,Suman Nath,Nikhil Singh,Jiangfeng Gao,Zhou Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and \tau^2 Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and \tau^2 Bench respectively, while matching the performance of expert-data training.
zh
[NLP-23] FiMI: A Domain-Specific Language Model for Indian Finance Ecosystem
【速读】: 该论文旨在解决通用语言模型在印度数字支付场景中金融推理能力不足的问题,特别是在多轮对话、工具调用和领域特定任务(如交易争议处理与指令生命周期管理)上的表现不佳。解决方案的关键在于构建一个针对印度金融领域的专用语言模型FiMI,其基于Mistral Small 24B架构,通过多阶段训练流程实现性能突破:首先在680亿token的精选金融语料(含英语、印地语及印地语混合语种数据)上进行连续预训练,随后引入指令微调和领域监督微调,重点优化多轮、工具驱动的对话能力。实验表明,FiMI Base相比原生Mistral Small 24B Base在金融推理基准上提升20%,而FiMI Instruct在领域特定工具调用任务上比Mistral Small 24B Instruct提升87%,同时保持了与同类模型相当的一般任务性能。
链接: https://arxiv.org/abs/2602.05794
作者: Aboli Kathar,Aman Kumar,Anusha Kamath,Araveeti Srujan,Ashish Sharma,Chandra Bhushan,Dilip Asbe,Divya Sorate,Duddu Prasanth Kumar,Evan Acharya,Harsh Sharma,Hrithik Kadam,Kanishk Singla,Keyur Doshi,Kiran Praveen,Kolisetty Krishna SK,Krishanu Adhikary,Lokesh MPT,Mayurdeep Sonowal,Nadeem Shaikh,Navya Prakash,Nimit Kothari,Nitin Kukreja,Prashant Devadiga,Rakesh Paul,Ratanjeet Pratap Chauhan,Raunak Kalani,Raviraj Joshi,Shamanth MH,Shantanu Pandey,Shubham Soni,Siddharth Dixit,Smriti Jopat,Sunil Patel,Suraj Singh,Suvradip Paul,Tulasi Pilla,Utkarsh Vaidya,Vineeth Nambiar,Vishal Kanvaty,Yatharth Dedhia
机构: National Payments Corporation of India (印度国家支付公司)
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We present FiMI (Finance Model for India), a domain-specialized financial language model developed for Indian digital payment systems. We develop two model variants: FiMI Base and FiMI Instruct. FiMI adapts the Mistral Small 24B architecture through a multi-stage training pipeline, beginning with continuous pre-training on 68 Billion tokens of curated financial, multilingual (English, Hindi, Hinglish), and synthetic data. This is followed by instruction fine-tuning and domain-specific supervised fine-tuning focused on multi-turn, tool-driven conversations that model real-world workflows, such as transaction disputes and mandate lifecycle management. Evaluations reveal that FiMI Base achieves a 20% improvement over the Mistral Small 24B Base model on finance reasoning benchmark, while FiMI Instruct outperforms the Mistral Small 24B Instruct model by 87% on domain-specific tool-calling. Moreover, FiMI achieves these significant domain gains while maintaining comparable performance to models of similar size on general benchmarks.
zh
[NLP-24] Bagging-Based Model Merging for Robust General Text Embeddings
【速读】: 该论文旨在解决文本嵌入(Text Embedding)模型在多任务训练中如何有效提升泛化能力,尤其是在面对新领域数据和增量学习场景时的适应性问题。现有方法虽采用批量级任务混洗(batch-level shuffling)策略取得较好性能,但存在两个关键局限:一是对域外数据(out-of-domain, OOD)的泛化能力不足,二是难以支持高效增量学习(因需全量重新训练)。论文提出Bagging-based rObust mOdel Merging(\modelname),其核心创新在于通过从训练数据中采样子集并训练多个嵌入模型,再以多粒度合并机制融合为单一模型,从而在保持单模型推理效率的同时显著增强鲁棒性;同时,该方法天然支持轻量级增量更新——仅需在新数据上训练小型更新模型,并与历史子集合并即可完成模型迭代,大幅降低增量学习的计算成本。
链接: https://arxiv.org/abs/2602.05787
作者: Hengran Zhang,Keping Bi,Jiafeng Guo,Jiaming Zhang,Wenbo Yang,Daiting Shi,Xueqi Cheng
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Beijing; Querit Private Limited
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 4 figures
Abstract:General-purpose text embedding models underpin a wide range of NLP and information retrieval applications, and are typically trained on large-scale multi-task corpora to encourage broad generalization. However, it remains unclear how different multi-task training strategies compare in practice, and how to efficiently adapt embedding models as new domains and data types continually emerge. In this work, we present a systematic study of multi-task training for text embeddings from two perspectives: data scheduling and model merging. We compare batch-level shuffling, sequential training variants, two-stage training, and multiple merging granularities, and find that simple batch-level shuffling consistently yields the strongest overall performance, suggesting that task conflicts are limited and training datasets are largely complementary. Despite its effectiveness, batch-level shuffling exhibits two practical limitations: suboptimal out-of-domain (OOD) generalization and poor suitability for incremental learning due to expensive full retraining. To address these issues, we propose Bagging-based rObust mOdel Merging (\modelname), which trains multiple embedding models on sampled subsets and merges them into a single model, improving robustness while retaining single-model inference efficiency. Moreover, \modelname naturally supports efficient incremental updates by training lightweight update models on new data with a small historical subset and merging them into the existing model. Experiments across diverse embedding benchmarks demonstrate that \modelname consistently improves both in-domain and OOD performance over full-corpus batch-level shuffling, while substantially reducing training cost in incremental learning settings.
zh
[NLP-25] Different Time Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors EACL2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)文本检测工具在学术场景中可能对非母语使用者文本产生误判的问题,特别是针对此前研究指出的“非母语者文本因低困惑度(perplexity)被误判为AI生成”的现象。其解决方案的关键在于:通过在捷克语语境下重新评估现有检测器的表现,发现非母语者的文本困惑度并不显著低于母语者,且三类不同机制的检测器均未表现出系统性偏倚;进一步证明当前主流检测方法已不再依赖困惑度这一特征,从而提升了检测结果的公平性和可靠性。
链接: https://arxiv.org/abs/2602.05769
作者: Adnan Al Ali,Jindřich Helcl,Jindřich Libovický
机构: Charles University, Faculty of Mathematics and Physics (查尔斯大学,数学与物理学院); University of Oslo, Language Technology Group (奥斯陆大学,语言技术组)
类目: Computation and Language (cs.CL)
备注: This paper was accepted to EACL 2026 Student Research Workshop
Abstract:LLM-based assistants have been widely popularised after the release of ChatGPT. Concerns have been raised about their misuse in academia, given the difficulty of distinguishing between human-written and generated text. To combat this, automated techniques have been developed and shown to be effective, to some extent. However, prior work suggests that these methods often falsely flag essays from non-native speakers as generated, due to their low perplexity extracted from an LLM, which is supposedly a key feature of the detectors. We revisit these statements two years later, specifically in the Czech language setting. We show that the perplexity of texts from non-native speakers of Czech is not lower than that of native speakers. We further examine detectors from three separate families and find no systematic bias against non-native speakers. Finally, we demonstrate that contemporary detectors operate effectively without relying on perplexity.
zh
[NLP-26] LongR: Unleashing Long-Context Reason ing via Reinforcement Learning with Dense Utility Rewards
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文场景下推理能力不足的问题,尤其是在长对话理解与结构化数据解析等任务中,单纯依赖稀疏的、仅基于结果的奖励信号难以有效引导复杂推理过程。其解决方案的关键在于提出一个统一框架LongR,该框架通过引入动态的“思考-阅读”(Think-and-Read)机制,将推理过程与文档查阅交替进行,并结合基于相对信息增益的上下文密度奖励(contextual density reward),量化相关文档对推理的贡献度,从而显著提升模型在长上下文中的推理效率和准确性。
链接: https://arxiv.org/abs/2602.05758
作者: Bowen Ping,Zijun Chen,Yiyao Yu,Tingfeng Hui,Junchi Yan,Baobao Chang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios–such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic “Think-and-Read” mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model’s robustness against distractors.
zh
[NLP-27] CompactRAG : Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering
【速读】: 该论文旨在解决多跳检索增强生成(multi-hop Retrieval-Augmented Generation, RAG)系统在实际应用中效率低下的问题,具体表现为推理过程中频繁交替调用大语言模型(Large Language Model, LLM),导致高token消耗和实体指代不一致等问题。其解决方案的关键在于提出CompactRAG框架,通过将知识处理流程解耦为离线重构与在线推理两个阶段:离线阶段利用LLM一次性将原始语料转化为细粒度的问答对知识库(atomic QA knowledge base),在线阶段则通过子问题分解与重写保持实体一致性,并采用密集检索结合RoBERTa模型进行答案抽取;整个推理过程仅需两次LLM调用(一次用于子问题分解,一次用于最终答案合成),显著降低计算成本并提升稳定性。
链接: https://arxiv.org/abs/2602.05728
作者: Hao Yang,Zhiyu Yang,Xupeng Zhang,Wei Wei,Yunjie Zhang,Lin Yang
机构: State Key Laboratory for Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室); Isoftstone Information Technology (Group) Co.,Ltd. (易信科技(集团)有限公司); Erik Jonsson School of Engineering and Computer Science, University of Texas at Dallas (德克萨斯大学达拉斯分校工程与计算机科学学院); College of Electronic and Information Engineering, Tongji University (同济大学电子与信息工程学院); School of Electronic Information, Central South University (中南大学电子信息学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented generation (RAG) has become a key paradigm for knowledge-intensive question answering. However, existing multi-hop RAG systems remain inefficient, as they alternate between retrieval and reasoning at each step, resulting in repeated LLM calls, high token consumption, and unstable entity grounding across hops. We propose CompactRAG, a simple yet effective framework that decouples offline corpus restructuring from online reasoning. In the offline stage, an LLM reads the corpus once and converts it into an atomic QA knowledge base, which represents knowledge as minimal, fine-grained question-answer pairs. In the online stage, complex queries are decomposed and carefully rewritten to preserve entity consistency, and are resolved through dense retrieval followed by RoBERTa-based answer extraction. Notably, during inference, the LLM is invoked only twice in total - once for sub-question decomposition and once for final answer synthesis - regardless of the number of reasoning hops. Experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue demonstrate that CompactRAG achieves competitive accuracy while substantially reducing token consumption compared to iterative RAG baselines, highlighting a cost-efficient and practical approach to multi-hop reasoning over large knowledge corpora. The implementation is available at GitHub. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.05728 [cs.CL] (or arXiv:2602.05728v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.05728 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-28] OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale
【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)架构中专家粒度细化与硬件执行效率之间的固有权衡问题。随着MoE向更细粒度方向发展以提升参数效率,现有设计在增加专家数量时面临路由复杂度激增和内存访问瓶颈,导致推理延迟显著上升。解决方案的关键在于提出OmniMoE——一个系统-算法协同设计框架:其一,引入基于笛卡尔积的路由机制(Cartesian Product Router),将庞大的索引空间分解,使路由复杂度从O(N)降低至O(√N);其二,采用专家优先调度策略(Expert-Centric Scheduling),反转执行顺序,将分散的内存访问转化为高效的密集矩阵运算,从而大幅提升硬件利用率。这一设计使得细粒度原子专家(Atomic Experts)能够在保持高容量的同时实现低延迟、高准确率的推理性能。
链接: https://arxiv.org/abs/2602.05711
作者: Jingze Shi,Zhangyang Peng,Yizhang Zhu,Yifan Wu,Guang Liu,Yuyu Luo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at this https URL.
zh
[NLP-29] Ethology of Latent Spaces
【速读】: 该论文试图解决的问题是:视觉语言模型(VLMs)的潜在空间(latent space)是否如普遍假设的那样具有中立性,抑或其内在结构实际上蕴含了由训练数据和架构选择所塑造的算法敏感性。研究发现,不同VLMs在对艺术作品进行政治与文化分类时表现出显著差异,例如SigLIP将59.4%的艺术品判定为政治参与,而OpenCLIP仅识别出4%,且非洲面具在不同模型中呈现截然不同的政治评分。解决方案的关键在于提出三个操作性概念:计算潜在政治化(computational latent politicization),即政治类别在未被有意编码的情况下自发涌现;涌现偏差(emergent bias),一种无法通过统计或规范偏见解释、只能通过对比分析检测到的偏差;以及三种算法视域模式:熵增型(LAION)、制度型(OpenAI)和符号学型(SigLIP),它们分别定义了不同的可见性机制。这些发现揭示了训练数据集作为准档案(quasi-archives)如何在潜在空间中结晶出话语形成,从而要求在数字艺术史等应用中,必须将学习架构纳入文化解释委托给算法代理的批判性方法论框架。
链接: https://arxiv.org/abs/2602.05710
作者: Philippe Boisnard
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23. pages, 14 figures, presented Hyperheritage International Symposium 9 ( this https URL ) and accepted for publication in double-blind peer review in French in 2026-2027
Abstract:This study challenges the presumed neutrality of latent spaces in vision language models (VLMs) by adopting an ethological perspective on their algorithmic behaviors. Rather than constituting spaces of homogeneous indeterminacy, latent spaces exhibit model-specific algorithmic sensitivities, understood as differential regimes of perceptual salience shaped by training data and architectural choices. Through a comparative analysis of three models (OpenAI CLIP, OpenCLIP LAION, SigLIP) applied to a corpus of 301 artworks (15th to 20th), we reveal substantial divergences in the attribution of political and cultural categories. Using bipolar semantic axes derived from vector analogies (Mikolov et al., 2013), we show that SigLIP classifies 59.4% of the artworks as politically engaged, compared to only 4% for OpenCLIP. African masks receive the highest political scores in SigLIP while remaining apolitical in OpenAI CLIP. On an aesthetic colonial axis, inter-model discrepancies reach 72.6 percentage points. We introduce three operational concepts: computational latent politicization, describing the emergence of political categories without intentional encoding; emergent bias, irreducible to statistical or normative bias and detectable only through contrastive analysis; and three algorithmic scopic regimes: entropic (LAION), institutional (OpenAI), and semiotic (SigLIP), which structure distinct modes of visibility. Drawing on Foucault’s notion of the archive, Jameson’s ideologeme, and Simondon’s theory of individuation, we argue that training datasets function as quasi-archives whose discursive formations crystallize within latent space. This work contributes to a critical reassessment of the conditions under which VLMs are applied to digital art history and calls for methodologies that integrate learning architectures into any delegation of cultural interpretation to algorithmic agents. Comments: 23. pages, 14 figures, presented Hyperheritage International Symposium 9 ( this https URL ) and accepted for publication in double-blind peer review in French in 2026-2027 Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2602.05710 [cs.CY] (or arXiv:2602.05710v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2602.05710 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-30] Cost-Efficient RAG for Entity Matching with LLM s: A Blocking-based Exploration
【速读】: 该论文旨在解决大规模实体匹配任务中现有检索增强生成(Retrieval-Augmented Generation, RAG)流水线因检索与生成开销过大而导致效率低下的问题。其核心解决方案是提出一种成本高效的RAG架构CE-RAG4EM,关键在于通过基于阻塞(blocking-based)的批量检索与生成策略显著降低计算复杂度,同时引入统一的评估框架以分析阻塞感知优化和检索粒度对系统性能的影响,从而在保证或提升匹配质量的前提下大幅缩短端到端运行时间。
链接: https://arxiv.org/abs/2602.05708
作者: Chuangtao Ma,Zeyu Zhang,Arijit Khan,Sebastian Schelter,Paul Groth
机构: Aalborg University (奥尔堡大学); University of Amsterdam (阿姆斯特丹大学); Bowling Green State University (鲍灵格林州立大学); BIFOLD & TU Berlin (柏林工业大学); Stanford University (斯坦福大学)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-augmented generation (RAG) enhances LLM reasoning in knowledge-intensive tasks, but existing RAG pipelines incur substantial retrieval and generation overhead when applied to large-scale entity matching. To address this limitation, we introduce CE-RAG4EM, a cost-efficient RAG architecture that reduces computation through blocking-based batch retrieval and generation. We also present a unified framework for analyzing and evaluating RAG systems for entity matching, focusing on blocking-aware optimizations and retrieval granularity. Extensive experiments suggest that CE-RAG4EM can achieve comparable or improved matching quality while substantially reducing end-to-end runtime relative to strong baselines. Our analysis further reveals that key configuration parameters introduce an inherent trade-off between performance and overhead, offering practical guidance for designing efficient and scalable RAG systems for entity matching and data integration.
zh
[NLP-31] Consensus-Aligned Neuron Efficient Fine-Tuning Large Language Models for Multi-Domain Machine Translation AAAI2026
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在多领域机器翻译(Multi-domain Machine Translation, MDMT)中面临的领域适应难题,包括领域偏移(domain shift)、参数干扰(parameter interference)以及泛化能力不足等问题。其解决方案的关键在于提出一种神经元高效微调框架,通过最大化神经元行为与领域特征之间的互信息来识别并更新一致性对齐的神经元(consensus-aligned neurons),从而使得LLMs能够同时捕捉可迁移的翻译模式与领域特异性细节;在此基础上进行引导式微调,有效缓解了参数干扰和领域过拟合问题,在多个跨语言翻译任务中展现出优于现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)基线的方法性能,并实现了在已见和未见领域的最优表现。
链接: https://arxiv.org/abs/2602.05694
作者: Shuting Jiang,Ran Song,Yuxin Huang,Yan Xiang,Yantuan Xian,Shengxiang Gao,Zhengtao Yu
机构: 1. 未知
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI 2026
Abstract:Multi-domain machine translation (MDMT) aims to build a unified model capable of translating content across diverse domains. Despite the impressive machine translation capabilities demonstrated by large language models (LLMs), domain adaptation still remains a challenge for LLMs. Existing MDMT methods such as in-context learning and parameter-efficient fine-tuning often suffer from domain shift, parameter interference and limited generalization. In this work, we propose a neuron-efficient fine-tuning framework for MDMT that identifies and updates consensus-aligned neurons within LLMs. These neurons are selected by maximizing the mutual information between neuron behavior and domain features, enabling LLMs to capture both generalizable translation patterns and domain-specific nuances. Our method then fine-tunes LLMs guided by these neurons, effectively mitigating parameter interference and domain-specific overfitting. Comprehensive experiments on three LLMs across ten German-English and Chinese-English translation domains evidence that our method consistently outperforms strong PEFT baselines on both seen and unseen domains, achieving state-of-the-art performance.
zh
[NLP-32] MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations
【速读】: 该论文旨在解决临床文本中错误识别与修正的难题,尤其是在多语言环境下大语言模型(LLM)在医疗场景中的准确性不足问题。其关键解决方案是构建了首个多语言临床错误检测、定位与修正基准——MedErrBench,涵盖英语、阿拉伯语和中文,并基于由临床专家标注和审核的真实临床案例,定义了十类常见错误类型。通过该基准对通用、语言特定及医学领域语言模型进行系统评估,揭示了非英语环境下的显著性能差距,强调了开发具备临床知识且语言敏感的AI系统的重要性,从而推动更安全、公平的全球性AI辅助医疗发展。
链接: https://arxiv.org/abs/2602.05692
作者: Congbo Ma,Yichun Zhang,Yousef Al-Jazzazi,Ahamed Foisal,Laasya Sharma,Yousra Sadqi,Khaled Saleh,Jihad Mallat,Farah E. Shamout
机构: New York University Abu Dhabi (纽约大学阿布扎比校区); New York University (纽约大学); University of Birmingham (伯明翰大学); Cleveland Clinic Abu Dhabi (克利夫兰诊所阿布扎比)
类目: Computation and Language (cs.CL)
备注:
Abstract:Inaccuracies in existing or generated clinical text may lead to serious adverse consequences, especially if it is a misdiagnosis or incorrect treatment suggestion. With Large Language Models (LLMs) increasingly being used across diverse healthcare applications, comprehensive evaluation through dedicated benchmarks is crucial. However, such datasets remain scarce, especially across diverse languages and contexts. In this paper, we introduce MedErrBench, the first multilingual benchmark for error detection, localization, and correction, developed under the guidance of experienced clinicians. Based on an expanded taxonomy of ten common error types, MedErrBench covers English, Arabic and Chinese, with natural clinical cases annotated and reviewed by domain experts. We assessed the performance of a range of general-purpose, language-specific, and medical-domain language models across all three tasks. Our results reveal notable performance gaps, particularly in non-English settings, highlighting the need for clinically grounded, language-aware systems. By making MedErrBench and our evaluation protocols publicly-available, we aim to advance multilingual clinical NLP to promote safer and more equitable AI-based healthcare globally. The dataset is available in the supplementary material. An anonymized version of the dataset is available at: this https URL.
zh
[NLP-33] Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew
【速读】: 该论文旨在解决Transformer模型在处理具有复杂动词屈折形态(inflectional morphology)的语言时,如何受分词策略影响其表示能力的问题。研究聚焦于土耳其语和现代希伯来语这两种语言,前者具有透明的形态标记,后者则呈现非连接性形态(non-concatenative morphology)。关键发现在于:对于土耳其语,无论采用原子分词还是子词分段策略,单语和多语模型均能有效建模;而对于希伯来语,多语模型在字符级分词下无法捕捉其非连接性形态结构,而采用词素感知(morpheme-aware)分段的单语模型则表现优异。这表明,针对不同语言的形态特性选择合适的分词策略是提升模型对复杂屈折形态理解能力的核心解决方案。
链接: https://arxiv.org/abs/2602.05648
作者: Giuseppe Samo,Paola Merlo
机构: Idiap Research Institute (Idiap 研究所); University of Geneva (日内瓦大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 7 figures, to appear as proceedings of the SIGTURK 2026 Workshop
Abstract:We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish – with its transparent morphological markers – both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.
zh
[NLP-34] Generative Ontology: When Structured Knowledge Learns to Create
【速读】: 该论文旨在解决传统本体(ontology)无法生成新颖人工制品,而大型语言模型(Large Language Models, LLMs)虽具创造力却缺乏结构有效性的问题——即LLMs常产生机制与组件不匹配、目标无终止条件的“幻觉”输出。解决方案的关键在于提出生成式本体(Generative Ontology)框架,其核心是将领域知识编码为可执行的Pydantic模式,并通过DSPy签名约束LLM生成过程,从而实现结构合法性与创造性之间的协同:一方面利用本体提供语法规则以确保输出符合领域逻辑;另一方面借助LLM实现跨域创新。该框架进一步引入多智能体流水线,赋予不同角色(如机械架构师、主题编织者、平衡审查员)专业“焦虑感”以避免肤浅输出,并结合检索增强生成与迭代验证机制,使新设计既扎根于既有范例又保持原创性。
链接: https://arxiv.org/abs/2602.05636
作者: Benny Cheung
机构: Dynamind Research(动态研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 6 figures, 6 tables. Code available at this https URL
Abstract:Traditional ontologies excel at describing domain structure but cannot generate novel artifacts. Large language models generate fluently but produce outputs that lack structural validity, hallucinating mechanisms without components, goals without end conditions. We introduce Generative Ontology, a framework that synthesizes these complementary strengths: ontology provides the grammar; the LLM provides the creativity. Generative Ontology encodes domain knowledge as executable Pydantic schemas that constrain LLM generation via DSPy signatures. A multi-agent pipeline assigns specialized roles to different ontology domains: a Mechanics Architect designs game systems, a Theme Weaver integrates narrative, a Balance Critic identifies exploits. Each agent carrying a professional “anxiety” that prevents shallow, agreeable outputs. Retrieval-augmented generation grounds novel designs in precedents from existing exemplars, while iterative validation ensures coherence between mechanisms and components. We demonstrate the framework through GameGrammar, a system for generating complete tabletop game designs. Given a thematic prompt (“bioluminescent fungi competing in a cave ecosystem”), the pipeline produces structurally complete, playable game specifications with mechanisms, components, victory conditions, and setup instructions. These outputs satisfy ontological constraints while remaining genuinely creative. The pattern generalizes beyond games. Any domain with expert vocabulary, validity constraints, and accumulated exemplars (music composition, software architecture, culinary arts) is a candidate for Generative Ontology. We argue that constraints do not limit creativity but enable it: just as grammar makes poetry possible, ontology makes structured generation possible. Comments: 15 pages, 6 figures, 6 tables. Code available at this https URL Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2602.05636 [cs.AI] (or arXiv:2602.05636v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.05636 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Benny Cheung [view email] [v1] Thu, 5 Feb 2026 13:14:20 UTC (8,977 KB)
zh
[NLP-35] CASTLE: A Comprehensive Benchmark for Evaluating Student-Tailored Personalized Safety in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个性化教育应用中因生成机制同质化而导致的学生个体差异忽视问题,进而引发对弱势群体的潜在安全风险。现有安全评估多依赖于上下文无关指标(如事实准确性、偏见或毒性),无法捕捉同一响应在不同学生属性下可能产生的差异化危害。解决方案的关键在于提出“学生定制化个性化安全”(Student-Tailored Personalized Safety)概念,并构建基于教育理论的基准测试框架CASTLE,涵盖15类教育安全风险与14种学生属性,包含92,908个双语场景,同时设计了风险敏感性(Risk Sensitivity)、情感共情能力(Emotional Empathy)和学生适配度(Student Alignment)三项核心评估指标,从而系统性地衡量模型在个性化安全方面的表现。
链接: https://arxiv.org/abs/2602.05633
作者: Rui Jia,Ruiyi Lan,Fengrui Liu,Zhongxiang Dai,Bo Jiang,Jing Shao,Jingyuan Chen,Guandong Xu,Fei Wu,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have advanced the development of personalized learning in education. However, their inherent generation mechanisms often produce homogeneous responses to identical prompts. This one-size-fits-all mechanism overlooks the substantial heterogeneity in students cognitive and psychological, thereby posing potential safety risks to vulnerable groups. Existing safety evaluations primarily rely on context-independent metrics such as factual accuracy, bias, or toxicity, which fail to capture the divergent harms that the same response might cause across different student attributes. To address this gap, we propose the concept of Student-Tailored Personalized Safety and construct CASTLE based on educational theories. This benchmark covers 15 educational safety risks and 14 student attributes, comprising 92,908 bilingual scenarios. We further design three evaluation metrics: Risk Sensitivity, measuring the model ability to detect risks; Emotional Empathy, evaluating the model capacity to recognize student states; and Student Alignment, assessing the match between model responses and student attributes. Experiments on 18 SOTA LLMs demonstrate that CASTLE poses a significant challenge: all models scored below an average safety rating of 2.3 out of 5, indicating substantial deficiencies in personalized safety assurance.
zh
[NLP-36] Rewards as Labels: Revisiting RLVR from a Classification Perspective
【速读】: 该论文旨在解决基于可验证奖励(Verifiable Rewards)的强化学习方法中存在-gradient Misassignment in Positives(正样本梯度误分配)和Gradient Domination in Negatives(负样本梯度主导)的问题,这些问题会导致策略更新效率低下且次优。解决方案的关键在于提出 Rewards as Labels (REAL) 框架,将原本作为标量权重的可验证奖励重新建模为类别标签(categorical labels),从而将策略优化问题转化为分类任务,并引入锚定对数(anchor logits)以增强策略学习。该设计诱导出单调且有界的梯度加权机制,实现跨轨迹的平衡梯度分配,有效缓解前述梯度失配问题。
链接: https://arxiv.org/abs/2602.05630
作者: Zepeng Zhai,Meilin Chen,Jiaxuan Zhao,Junlang Qian,Lei Shen,Yuan Lu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 5 figures, 4 tables
Abstract:Reinforcement Learning with Verifiable Rewards has recently advanced the capabilities of Large Language Models in complex reasoning tasks by providing explicit rule-based supervision. Among RLVR methods, GRPO and its variants have achieved strong empirical performance. Despite their success, we identify that they suffer from Gradient Misassignment in Positives and Gradient Domination in Negatives, which lead to inefficient and suboptimal policy updates. To address these issues, we propose Rewards as Labels (REAL), a novel framework that revisits verifiable rewards as categorical labels rather than scalar weights, thereby reformulating policy optimization as a classification problem. Building on this, we further introduce anchor logits to enhance policy learning. Our analysis reveals that REAL induces a monotonic and bounded gradient weighting, enabling balanced gradient allocation across rollouts and effectively mitigating the identified mismatches. Extensive experiments on mathematical reasoning benchmarks show that REAL improves training stability and consistently outperforms GRPO and strong variants such as DAPO. On the 1.5B model, REAL improves average Pass@1 over DAPO by 6.7%. These gains further scale to 7B model, REAL continues to outperform DAPO and GSPO by 6.2% and 1.7%, respectively. Notably, even with a vanilla binary cross-entropy, REAL remains stable and exceeds DAPO by 4.5% on average.
zh
[NLP-37] AI chatbots versus human healthcare professionals: a systematic review and meta-analysis of empathy in patient care
【速读】: 该论文旨在解决当前关于人工智能(AI)聊天机器人与人类 healthcare professionals (HCPs) 在患者护理中 empathetic communication(共情沟通)表现差异的争议问题。研究通过系统综述和荟萃分析整合了2023–2024年发表的15项对比研究,发现基于大语言模型(如ChatGPT-3.5/4)的AI聊天机器人在文本交互场景下普遍被评估为比人类HCP更具共情能力,其效应量标准化均值差为0.87(95% CI: 0.54–1.20),相当于10分制量表上约提升2分。解决方案的关键在于采用严格的文献筛选、风险偏倚评估(ROBINS-I)及随机效应模型进行数据合成,避免重复计数,从而提供高质量证据支持AI在特定情境下可优于人类实现共情表达。
链接: https://arxiv.org/abs/2602.05628
作者: Alastair Howcroft,Amber Bennett-Weston,Ahmad Khan,Joseff Griffiths,Simon Gay,Jeremy Howick
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Open Access Invited Review. Systematic review and meta analysis of 15 studies 2023-2024. Published 20 October 2025
Abstract:Background: Empathy is widely recognized for improving patient outcomes, including reduced pain and anxiety and improved satisfaction, and its absence can cause harm. Meanwhile, use of artificial intelligence (AI)-based chatbots in healthcare is rapidly expanding, with one in five general practitioners using generative AI to assist with tasks such as writing letters. Some studies suggest AI chatbots can outperform human healthcare professionals (HCPs) in empathy, though findings are mixed and lack synthesis. Sources of data: We searched multiple databases for studies comparing AI chatbots using large language models with human HCPs on empathy measures. We assessed risk of bias with ROBINS-I and synthesized findings using random-effects meta-analysis where feasible, whilst avoiding double counting. Areas of agreement: We identified 15 studies (2023-2024). Thirteen studies reported statistically significantly higher empathy ratings for AI, with only two studies situated in dermatology favouring human responses. Of the 15 studies, 13 provided extractable data and were suitable for pooling. Meta-analysis of those 13 studies, all utilising ChatGPT-3.5/4, showed a standardized mean difference of 0.87 (95% CI, 0.54-1.20) favouring AI (P .00001), roughly equivalent to a two-point increase on a 10-point scale. Areas of controversy: Studies relied on text-based assessments that overlook non-verbal cues and evaluated empathy through proxy raters. Growing points: Our findings indicate that, in text-only scenarios, AI chatbots are frequently perceived as more empathic than human HCPs. Areas timely for developing research: Future research should validate these findings with direct patient evaluations and assess whether emerging voice-enabled AI systems can deliver similar empathic advantages. Comments: Open Access Invited Review. Systematic review and meta analysis of 15 studies 2023-2024. Published 20 October 2025 Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) MSC classes: 62-07, 62P15 ACMclasses: I.2.7; J.3; H.1.2; H.5.2 Cite as: arXiv:2602.05628 [cs.HC] (or arXiv:2602.05628v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2602.05628 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: British Medical Bulletin, Volume 156, Issue 1, December 2025, ldaf017 Related DOI: https://doi.org/10.1093/bmb/ldaf017 Focus to learn more DOI(s) linking to related resources Submission history From: Alastair Howcroft [view email] [v1] Thu, 5 Feb 2026 13:09:19 UTC (585 KB) Full-text links: Access Paper: View a PDF of the paper titled AI chatbots versus human healthcare professionals: a systematic review and meta-analysis of empathy in patient care, by Alastair Howcroft and 5 other authorsView PDF view license Current browse context: cs.HC prev | next new | recent | 2026-02 Change to browse by: cs cs.AI cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[NLP-38] BhashaSetu: Cross-Lingual Knowledge Transfer from High-Resource to Extreme Low-Resource Languages AACL
【速读】: 该论文旨在解决低资源语言(low-resource languages)在自然语言处理任务中因数据稀缺和语言资源不足而导致性能显著落后于高资源语言的问题。其核心解决方案是通过跨语言知识迁移(cross-lingual knowledge transfer)来利用高资源语言的语料和模型能力提升低资源语言的表现。关键创新在于提出了一种基于图神经网络(GNN)的新型表示方法——GETR(Graph-Enhanced Token Representation),该方法通过构建跨语言词元(token)间的结构化关系图来增强词元表示,从而更有效地捕捉跨语言语义对齐信息。实验表明,该方法在词性标注(POS tagging)和命名实体识别(NER)、情感分类等任务上均显著优于现有基线方法,尤其在真正低资源语言(如Mizo、Khasi)及模拟低资源语言(如Marathi、Bangla、Malayalam)上取得了最高达27个百分点的性能提升。
链接: https://arxiv.org/abs/2602.05599
作者: Subhadip Maji,Arnab Bhattacharya
机构: Indian Institute of Technology Kanpur (印度理工学院坎普尔分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted as a long paper at IJCNLP-AACL Main Conference
Abstract:Despite remarkable advances in natural language processing, developing effective systems for low-resource languages remains a formidable challenge, with performances typically lagging far behind high-resource counterparts due to data scarcity and insufficient linguistic resources. Cross-lingual knowledge transfer has emerged as a promising approach to address this challenge by leveraging resources from high-resource languages. In this paper, we investigate methods for transferring linguistic knowledge from high-resource languages to low-resource languages, where the number of labeled training instances is in hundreds. We focus on sentence-level and word-level tasks. We introduce a novel method, GETR (Graph-Enhanced Token Representation) for cross-lingual knowledge transfer along with two adopted baselines (a) augmentation in hidden layers and (b) token embedding transfer through token translation. Experimental results demonstrate that our GNN-based approach significantly outperforms existing multilingual and cross-lingual baseline methods, achieving 13 percentage point improvements on truly low-resource languages (Mizo, Khasi) for POS tagging, and 20 and 27 percentage point improvements in macro-F1 on simulated low-resource languages (Marathi, Bangla, Malayalam) across sentiment classification and NER tasks respectively. We also present a detailed analysis of the transfer mechanisms and identify key factors that contribute to successful knowledge transfer in this linguistic context.
zh
[NLP-39] ArkTS-CodeSearch: A Open-Source ArkTS Dataset for Code Retrieval
【速读】: 该论文旨在解决ArkTS(OpenHarmony核心编程语言)代码智能研究中缺乏公开数据集与评估基准的问题,从而阻碍了代码检索和代码理解等任务的发展。其关键解决方案在于构建了一个大规模的ArkTS数据集,该数据集源自GitHub和Gitee上的开源仓库,通过tree-sitter-arkts工具提取注释-函数对,并进行跨平台去重与函数类型统计分析;同时,作者在单搜索任务上评估现有开源代码嵌入模型,并基于ArkTS和TypeScript训练数据进行微调,最终获得一个高性能的ArkTS代码理解模型,从而建立了首个系统性的ArkTS代码检索基准。
链接: https://arxiv.org/abs/2602.05550
作者: Yulong He,Artem Ermakov,Sergey Kovalchuk,Artem Aliev,Dmitry Shalymov
机构: St. Petersburg State University (圣彼得堡国立大学); ITMO University (伊塔莫大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:ArkTS is a core programming language in the OpenHarmony ecosystem, yet research on ArkTS code intelligence is hindered by the lack of public datasets and evaluation benchmarks. This paper presents a large-scale ArkTS dataset constructed from open-source repositories, targeting code retrieval and code evaluation tasks. We design a single-search task, where natural language comments are used to retrieve corresponding ArkTS functions. ArkTS repositories are crawled from GitHub and Gitee, and comment-function pairs are extracted using tree-sitter-arkts, followed by cross-platform deduplication and statistical analysis of ArkTS function types. We further evaluate all existing open-source code embedding models on the single-search task and perform fine-tuning using both ArkTS and TypeScript training datasets, resulting in a high-performing model for ArkTS code understanding. This work establishes the first systematic benchmark for ArkTS code retrieval. Both the dataset and our fine-tuned model will be released publicly and are available at this https URL and this https URL the first systematic benchmark for ArkTS code retrieval.
zh
[NLP-40] Multi-Task GRPO: Reliable LLM Reason ing Across Tasks
【速读】: 该论文旨在解决多任务强化学习(Reinforcement Learning, RL)后训练中因任务权重分配不均导致的性能失衡问题,尤其是在使用基于策略梯度的优化方法(如GRPO)时,某些任务可能主导优化过程,而其他任务则停滞不前;同时,不同任务在提示(prompt)产生零优势(zero advantage)频率上的差异会进一步扭曲其对优化信号的有效贡献。解决方案的关键在于提出一种新的多任务GRPO算法(MT-GRPO),其核心创新包括:(i) 动态调整任务权重以显式优化最差任务的性能并促进各任务间的均衡进展;(ii) 引入比例保持采样器(ratio-preserving sampler),确保任务层面的策略梯度能准确反映所调整后的权重,从而实现更公平且高效的多任务优化。
链接: https://arxiv.org/abs/2602.05547
作者: Shyam Sundhar Ramesh,Xiaotong Ji,Matthieu Zimmer,Sangwoong Yoon,Zhiyong Wang,Haitham Bou Ammar,Aurelien Lucchi,Ilija Bogunovic
机构: UCL Department of EEE (伦敦大学学院电气与电子工程系); UCL Centre for AI (伦敦大学学院人工智能中心); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); UNIST Graduate School of AI (韩国科学技术院人工智能研究生院); University of Edinburgh (爱丁堡大学); University of Basel (巴塞尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint
Abstract:RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.
zh
[NLP-41] Steering Large Reason ing Models towards Concise Reason ing via Flow Matching
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在复杂推理任务中因输出过于冗长而导致的效率低下问题。现有方法通常采用单一全局向量对隐藏表示进行线性调整,受限于线性表示假设,难以实现精细化控制。本文提出FlowSteer这一非线性引导方法,其核心在于通过流匹配(Flow Matching)学习冗长推理与简洁推理激活分布之间的完整变换,将该变换建模为速度场(velocity field),从而实现输入依赖的精准控制。该方法不局限于均匀线性偏移,而是基于生成式技术建模完整的分布传输,显著提升了推理过程的紧凑性和token效率,同时保持了优异的任务性能。
链接: https://arxiv.org/abs/2602.05539
作者: Yawei Li,Benjamin Bergner,Yinghan Zhao,Vihang Prakash Patil,Bei Chen,Cheng Wang
机构: LMU Munich (慕尼黑大学); Amazon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper has been accepted to Transactions on Machine Learning Research (TMLR)
Abstract:Large Reasoning Models (LRMs) excel at complex reasoning tasks, but their efficiency is often hampered by overly verbose outputs. Prior steering methods attempt to address this issue by applying a single, global vector to hidden representations – an approach grounded in the restrictive linear representation hypothesis. In this work, we introduce FlowSteer, a nonlinear steering method that goes beyond uniform linear shifts by learning a complete transformation between the distributions associated with verbose and concise reasoning. This transformation is learned via Flow Matching as a velocity field, enabling precise, input-dependent control over the model’s reasoning process. By aligning steered representations with the distribution of concise-reasoning activations, FlowSteer yields more compact reasoning than the linear shifts. Across diverse reasoning benchmarks, FlowSteer demonstrates strong task performance and token efficiency compared to leading inference-time baselines. Our work demonstrates that modeling the full distributional transport with generative techniques offers a more effective and principled foundation for controlling LRMs.
zh
[NLP-42] When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging
【速读】: 该论文旨在解决模型合并(model merging)过程中因任务间共享对齐谱方向(aligned spectral directions)而导致的重复计数问题,即简单线性组合会反复累积这些共享方向,从而放大奇异值并使合并模型偏向共享子空间,影响性能。解决方案的关键在于提出一种无需训练和数据的后处理方法——奇异值校准(Singular Value Calibration, SVC),其通过量化子空间重叠并重新缩放被过度放大的奇异值,恢复平衡的谱分布,从而有效缓解过计数问题。实验证明,SVC在视觉与语言基准上均能显著提升现有合并基线,并将Task Arithmetic的性能提升13.0%。
链接: https://arxiv.org/abs/2602.05536
作者: Yayuan Li,Ze Peng,Jian Zhang,Jintao Guo,Yue Duan,Yinghuan Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over-counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training-free and data-free post-processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state-of-the-art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at: this https URL.
zh
[NLP-43] A Unified Multimodal Framework for Dataset Construction and Model-Based Diagnosis of Ameloblastoma
【速读】: 该论文旨在解决当前人工智能(AI)在颌面病理诊断中因缺乏结构化、高质量多模态数据集而导致的模型训练受限问题,尤其针对成釉细胞瘤(ameloblastoma)病例覆盖不足和格式不一致的现状。其解决方案的关键在于构建了一个专门聚焦于成釉细胞瘤的新型多模态数据集,整合了标注的影像学(radiological)、组织病理学(histopathological)及口内临床图像,并通过自然语言处理(NLP)技术从病例报告中提取结构化临床特征;同时采用领域特定的图像预处理与增强策略,最终开发出一个可接受临床输入(如主诉、年龄、性别)的多模态深度学习模型,显著提升了亚型分类准确率(从46.2%提升至65.9%)和异常组织检测F1分数(从43.0%提升至90.3%),从而实现更精准的个性化决策支持。
链接: https://arxiv.org/abs/2602.05515
作者: Ajo Babu George,Anna Mariam John,Athul Anoop,Balu Bhasuran
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Artificial intelligence (AI)-enabled diagnostics in maxillofacial pathology require structured, high-quality multimodal datasets. However, existing resources provide limited ameloblastoma coverage and lack the format consistency needed for direct model training. We present a newly curated multimodal dataset specifically focused on ameloblastoma, integrating annotated radiological, histopathological, and intraoral clinical images with structured data derived from case reports. Natural language processing techniques were employed to extract clinically relevant features from textual reports, while image data underwent domain specific preprocessing and augmentation. Using this dataset, a multimodal deep learning model was developed to classify ameloblastoma variants, assess behavioral patterns such as recurrence risk, and support surgical planning. The model is designed to accept clinical inputs such as presenting complaint, age, and gender during deployment to enhance personalized inference. Quantitative evaluation demonstrated substantial improvements; variant classification accuracy increased from 46.2 percent to 65.9 percent, and abnormal tissue detection F1-score improved from 43.0 percent to 90.3 percent. Benchmarked against resources like MultiCaRe, this work advances patient-specific decision support by providing both a robust dataset and an adaptable multimodal AI framework.
zh
[NLP-44] A Human-in-the-Loop LLM -Centered Architecture for Knowledge-Graph Question Answering
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识密集型领域中因幻觉、信息过时和可解释性差而导致的性能局限问题,以及传统文本检索增强生成(Retrieval-Augmented Generation, RAG)方法在多跳推理任务中的不足。其解决方案的关键在于提出一种交互式框架,通过LLM自动生成并解释Cypher图查询语言语句,用户则以自然语言迭代优化查询,从而在保持事实准确性与语义严谨性的前提下提升对复杂知识图谱(Knowledge Graph, KG)的访问能力,并增强模型输出的可解释性与可控性。
链接: https://arxiv.org/abs/2602.05512
作者: Larissa Pusch,Alexandre Courtiol,Tim Conrad
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.
zh
[NLP-45] ransport and Merge: Cross-Architecture Merging for Large Language Models
【速读】: 该论文旨在解决如何将大规模高资源语言模型(Large Language Models, LLMs)的知识有效迁移至架构异构、数据稀缺的小型目标模型的问题。现有模型融合方法通常假设源模型与目标模型架构兼容,难以直接应用于异构场景。其解决方案的关键在于提出一种基于最优传输(Optimal Transport, OT)的跨架构融合框架:通过对齐激活值来推断异构模型间神经元的对应关系,进而生成传输计划以指导权重空间中的直接融合,从而仅需少量输入即可实现从高资源模型到低资源模型的有效知识迁移。
链接: https://arxiv.org/abs/2602.05495
作者: Chenhang Cui,Binyun Yang,Fei Shen,Yuxin Chen,Jingnan Zheng,Xiang Wang,An Zhang,Tat-Seng Chua
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) achieve strong capabilities by scaling model capacity and training data, yet many real-world deployments rely on smaller models trained or adapted from low-resource data. This gap motivates the need for mechanisms to transfer knowledge from large, high-resource models to smaller, low-resource targets. While model merging provides an effective transfer mechanism, most existing approaches assume architecture-compatible models and therefore cannot directly transfer knowledge from large high-resource LLMs to heterogeneous low-resource targets. In this work, we propose a cross-architecture merging framework based on optimal transport (OT) that aligns activations to infer cross-neuron correspondences between heterogeneous models. The resulting transport plans are then used to guide direct weight-space fusion, enabling effective high-resource to low-resource transfer using only a small set of inputs. Extensive experiments across low-resource languages and specialized domains demonstrate consistent improvements over target models.
zh
[NLP-46] Reason ing under Ambiguity: Uncertainty-Aware Multilingual Emotion Classification under Partial Supervision
【速读】: 该论文旨在解决多语言多标签情感分类中因情感模糊性(emotional ambiguity)和不完全标注(incomplete supervision)导致的模型学习偏差与预测不可靠问题。现有方法通常假设标签完全可观测且采用确定性学习目标,难以应对真实场景中常见的缺失或异构情感标注。其解决方案的关键在于提出一种不确定性感知框架——Reasoning under Ambiguity,通过引入共享多语言编码器与语言特异性优化策略,并设计基于熵的模糊性加权机制,对高模糊性样本进行降权而非将其视为负样本;同时结合掩码感知目标与正-未标注正则化,实现部分监督下的鲁棒学习。该方法显著提升了跨语言情感分类的性能、训练稳定性及对标注稀疏性的鲁棒性。
链接: https://arxiv.org/abs/2602.05471
作者: Md. Mithun Hossaina,Mashary N. Alrasheedy,Nirban Bhowmick,Shamim Forhad,Md. Shakil Hossain,Sudipto Chaki,Md Shafiqul Islam
机构: BUET (Bangladesh University of Engineering and Technology); UOH (University of Ha’il); ScholarX Academy; Uttara University; BUET (Bangladesh University of Engineering and Technology)
类目: Computation and Language (cs.CL)
备注:
Abstract:Contemporary knowledge-based systems increasingly rely on multilingual emotion identification to support intelligent decision-making, yet they face major challenges due to emotional ambiguity and incomplete supervision. Emotion recognition from text is inherently uncertain because multiple emotional states often co-occur and emotion annotations are frequently missing or heterogeneous. Most existing multi-label emotion classification methods assume fully observed labels and rely on deterministic learning objectives, which can lead to biased learning and unreliable predictions under partial supervision. This paper introduces Reasoning under Ambiguity, an uncertainty-aware framework for multilingual multi-label emotion classification that explicitly aligns learning with annotation uncertainty. The proposed approach uses a shared multilingual encoder with language-specific optimization and an entropy-based ambiguity weighting mechanism that down-weights highly ambiguous training instances rather than treating missing labels as negative evidence. A mask-aware objective with positive-unlabeled regularization is further incorporated to enable robust learning under partial supervision. Experiments on English, Spanish, and Arabic emotion classification benchmarks demonstrate consistent improvements over strong baselines across multiple evaluation metrics, along with improved training stability, robustness to annotation sparsity, and enhanced interpretability.
zh
[NLP-47] MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation
【速读】: 该论文旨在解决视觉语言导航(Visual Language Navigation, VLN)任务中成功率达到高且具备良好泛化能力难以兼得的问题。现有方法中,监督微调(Supervised Fine-Tuning, SFT)通常能获得较高的成功率(Success Rate, SR),而无需训练(Training-Free, TF)方法则在跨场景或开放词汇条件下表现出更强的泛化性能,但二者难以同时实现最优效果。为此,作者提出了一种记忆-执行-回顾(Memory-Execute-Review)框架,其核心在于通过三个模块协同工作:层次化记忆模块提供信息支持,执行模块完成常规决策与动作,回顾模块用于处理异常情况并纠正行为。该设计实现了对导航过程中稳定性和鲁棒性的统一优化,在多个数据集上显著提升了SR和泛化能力,尤其在零样本(Zero-Shot, ZS)设置下取得突破性进展。
链接: https://arxiv.org/abs/2602.05467
作者: Dekang Qi,Shuang Zeng,Xinyuan Chang,Feng Xiong,Shichao Xie,Xiaolong Wu,Mu Xu
机构: Amap, Alibaba Group (高德地图,阿里巴巴集团); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
备注: 9 pages, 2 figures, 5 tables, conference
Abstract:Visual Language Navigation (VLN) is one of the fundamental capabilities for embodied intelligence and a critical challenge that urgently needs to be addressed. However, existing methods are still unsatisfactory in terms of both success rate (SR) and generalization: Supervised Fine-Tuning (SFT) approaches typically achieve higher SR, while Training-Free (TF) approaches often generalize better, but it is difficult to obtain both simultaneously. To this end, we propose a Memory-Execute-Review framework. It consists of three parts: a hierarchical memory module for providing information support, an execute module for routine decision-making and actions, and a review module for handling abnormal situations and correcting behavior. We validated the effectiveness of this framework on the Object Goal Navigation task. Across 4 datasets, our average SR achieved absolute improvements of 7% and 5% compared to all baseline methods under TF and Zero-Shot (ZS) settings, respectively. On the most commonly used HM3D_v0.1 and the more challenging open vocabulary dataset HM3D_OVON, the SR improved by 8% and 6%, under ZS settings. Furthermore, on the MP3D and HM3D_OVON datasets, our method not only outperformed all TF methods but also surpassed all SFT methods, achieving comprehensive leadership in both SR (5% and 2%) and generalization.
zh
[NLP-48] Structured Context Engineering for File-Native Agent ic Systems: Evaluating Schema Accuracy Format Effectiveness and Multi-File Navigation at Scale
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在操作结构化系统时,如何有效组织和管理上下文(context)以提升任务准确性的问题。其核心挑战在于缺乏实证指导来优化LLM代理所消费的上下文结构,尤其是在面对不同模型能力、数据格式与规模时的适应性策略。解决方案的关键在于通过大规模实验(9,649次测试)系统性地评估多种上下文架构(如基于文件的检索 vs. 基于表格的嵌入)和格式(YAML、Markdown、JSON、TOON)对SQL生成任务的影响,发现模型能力是决定性能的主导因素(前沿模型与开源模型间存在21个百分点的准确率差距),且架构选择应根据模型类型定制——例如,文件级上下文对前沿模型有益,但对开源模型可能产生显著负效应;此外,还揭示了文件大小并非运行效率的可靠指标,因紧凑格式在高规模下可能因搜索模式不熟悉而增加token消耗。因此,该研究为LLM代理部署提供基于证据的实践指南:上下文工程必须依据模型能力进行差异化设计,而非依赖通用最佳实践。
链接: https://arxiv.org/abs/2602.05447
作者: Damon McMillan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, 10 tables, 26 references
Abstract:Large Language Model agents increasingly operate external systems through programmatic interfaces, yet practitioners lack empirical guidance on how to structure the context these agents consume. Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables. Our findings challenge common assumptions. First, architecture choice is model-dependent: file-based context retrieval improves accuracy for frontier-tier models (Claude, GPT, Gemini; +2.7%, p=0.029) but shows mixed results for open source models (aggregate -7.7%, p0.001), with deficits varying substantially by model. Second, format does not significantly affect aggregate accuracy (chi-squared=2.45, p=0.484), though individual models, particularly open source, exhibit format-specific sensitivities. Third, model capability is the dominant factor, with a 21 percentage point accuracy gap between frontier and open source tiers that dwarfs any format or architecture effect. Fourth, file-native agents scale to 10,000 tables through domain-partitioned schemas while maintaining high navigation accuracy. Fifth, file size does not predict runtime efficiency: compact formats can consume significantly more tokens at scale due to format-unfamiliar search patterns. These findings provide practitioners with evidence-based guidance for deploying LLM agents on structured systems, demonstrating that architectural decisions should be tailored to model capability rather than assuming universal best practices. Comments: 8 pages, 7 figures, 10 tables, 26 references Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.05447 [cs.CL] (or arXiv:2602.05447v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.05447 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-49] Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中安全对齐机制作为隐变量存在时,导致模型内在能力被遮蔽的问题。其核心挑战在于如何在不破坏模型任务性能的前提下,有效识别并干预这些隐藏的安全约束,从而实现鲁棒的越狱攻击(jailbreaking)。解决方案的关键在于从因果推断视角将安全机制建模为未观测混杂因子(unobserved confounder),并基于Pearl的前门准则(Front-Door Criterion)提出Causal Front-Door Adjustment Attack (CFA²) 框架:通过稀疏自编码器(Sparse Autoencoders, SAEs)物理剥离防御相关特征以隔离核心任务意图,并将高计算成本的边缘化操作简化为低复杂度的确定性干预,从而实现高效且可解释的越狱攻击。
链接: https://arxiv.org/abs/2602.05444
作者: Yao Zhou,Zeen Song,Wenwen Qiang,Fengge Wu,Shuyi Zhou,Changwen Zheng,Hui Xiong
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model’s inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the \textbfCausal \textbfFront-Door \textbfAdjustment \textbfAttack (\textbfCFA ^2 ) to jailbreak LLM, which is a framework that leverages Pearl’s Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that CFA ^2 achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.
zh
[NLP-50] Once Correct Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在跨文化语境下对视觉错误信息的敏感性问题,即模型虽能准确回答基于图像的正确陈述,却仍可能接受与图像内容相悖但文化上看似合理的反事实陈述,尤其在非西方语境和非英语语言中更为显著。其解决方案的关键在于提出一个名为M2CQA的文化根基型多模态基准测试集,该数据集涵盖17个中东和北非(MENA)国家的图像,并配以英文、阿拉伯语及其方言下的对比真值陈述与反事实陈述;同时引入“反事实幻觉率”(CounterFactual Hallucination Rate, CFHR)作为量化指标,用于隔离并衡量模型在保持真值准确性前提下对反事实陈述的误接受程度,从而更精准地评估VLMs的文化鲁棒性与幻觉行为。
链接: https://arxiv.org/abs/2602.05437
作者: Basel Mousi,Fahim Dalvi,Shammur Chowdhury,Firoj Alam,Nadir Durrani
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce M2CQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness. We will make the experimental resources and dataset publicly available for the community.
zh
[NLP-51] Grammatical Error Correction Evaluation by Optimally Transporting Edit Representation ACL
【速读】: 该论文旨在解决语法错误修正(Grammatical Error Correction, GEC)中自动评估指标性能不足的问题,尤其是传统基于参考句的相似性度量(如BERTScore)在处理GEC任务时效果不佳,因为源句与参考句之间常存在大量未修改词汇,导致相似性计算失真。其解决方案的关键在于引入“编辑向量”(edit vector)这一新表示形式,将GEC中的修改操作显式建模为向量空间中的变换,并利用非平衡最优传输(Unbalanced Optimal Transport, UOT)对齐假设句与参考句的编辑向量,从而提出新的评估指标UOT-ERRANT。该方法不仅提升了评估准确性,尤其在+Fluency领域表现显著,且其传输计划可解释为软编辑对齐,增强了模型分析能力。
链接: https://arxiv.org/abs/2602.05419
作者: Takumi Goto,Yusuke Sakai,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学)
类目: Computation and Language (cs.CL)
备注: Accepted to TACL. This is a pre-MIT Press publication version
Abstract:Automatic evaluation in grammatical error correction (GEC) is crucial for selecting the best-performing systems. Currently, reference-based metrics are a popular choice, which basically measure the similarity between hypothesis and reference sentences. However, similarity measures based on embeddings, such as BERTScore, are often ineffective, since many words in the source sentences remain unchanged in both the hypothesis and the reference. This study focuses on edits specifically designed for GEC, i.e., ERRANT, and computes similarity measured over the edits from the source sentence. To this end, we propose edit vector, a representation for an edit, and introduce a new metric, UOT-ERRANT, which transports these edit vectors from hypothesis to reference using unbalanced optimal transport. Experiments with SEEDA meta-evaluation show that UOT-ERRANT improves evaluation performance, particularly in the +Fluency domain where many edits occur. Moreover, our method is highly interpretable because the transport plan can be interpreted as a soft edit alignment, making UOT-ERRANT a useful metric for both system ranking and analyzing GEC systems. Our code is available from this https URL.
zh
[NLP-52] SciDef: Automating Definition Extraction from Academic Literature with Large Language Models SIGIR2026
【速读】: 该论文旨在解决科学文献中定义提取自动化不足的问题,即随着学术出版物数量激增,手动收集与特定关键词相关的定义变得极为困难。其解决方案的关键在于提出一个基于大语言模型(Large Language Models, LLMs)的流水线工具 SciDef,通过多步提示(multi-step prompting)和 DSPy 优化提示策略提升定义抽取性能,并引入基于自然语言推理(NLI-based)的评估方法以获得更可靠的指标结果。实验表明,SciDef 能够从科学文献中准确提取86.4%的定义,但未来工作应聚焦于识别相关定义而非单纯增加生成量,以避免过量冗余输出。
链接: https://arxiv.org/abs/2602.05413
作者: Filip Kučera,Christoph Mandl,Isao Echizen,Radu Timofte,Timo Spinde
机构: National Institute of Informatics (NII)(日本信息研究所); University of Würzburg (JMU)(维尔茨堡大学); University of Passau(帕绍大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Under Review - Submitted to SIGIR 2026 Resources Track; 8 pages, 6 figures, 4 tables
Abstract:Definitions are the foundation for any scientific work, but with a significant increase in publication numbers, gathering definitions relevant to any keyword has become challenging. We therefore introduce SciDef, an LLM-based pipeline for automated definition extraction. We test SciDef on DefExtra DefSim, novel datasets of human-extracted definitions and definition-pairs’ similarity, respectively. Evaluating 16 language models across prompting strategies, we demonstrate that multi-step and DSPy-optimized prompting improve extraction performance. To evaluate extraction, we test various metrics and show that an NLI-based method yields the most reliable results. We show that LLMs are largely able to extract definitions from scientific literature (86.4% of definitions from our test-set); yet future work should focus not just on finding definitions, but on identifying relevant ones, as models tend to over-generate them. Code datasets are available at this https URL. Comments: Under Review - Submitted to SIGIR 2026 Resources Track; 8 pages, 6 figures, 4 tables Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2602.05413 [cs.IR] (or arXiv:2602.05413v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.05413 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-53] H-AdminSim: A Multi-Agent Simulator for Realistic Hospital Administrative Workflows with FHIR Integration
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在医院行政自动化中应用时存在的局限性,即现有研究多集中于医患交互或孤立的行政子任务,未能充分刻画真实医院行政流程的复杂性。其解决方案的关键在于提出H-AdminSim——一个端到端的仿真框架,通过结合真实数据生成与多智能体模拟,系统性地建模医院行政工作流,并利用基于FHIR(Fast Healthcare Interoperability Resources)的集成实现跨异构医疗环境的一致性测试,从而为LLM驱动的行政自动化提供标准化评估平台。
链接: https://arxiv.org/abs/2602.05407
作者: Jun-Min Lee,Meong Hi Son,Edward Choi
机构: KAIST(韩国科学技术院); Samsung Medical Center(三星医疗中心)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Hospital administration departments handle a wide range of operational tasks and, in large hospitals, process over 10,000 requests per day, driving growing interest in LLM-based automation. However, prior work has focused primarily on patient–physician interactions or isolated administrative subtasks, failing to capture the complexity of real administrative workflows. To address this gap, we propose H-AdminSim, a comprehensive end-to-end simulation framework that combines realistic data generation with multi-agent-based simulation of hospital administrative workflows. These tasks are quantitatively evaluated using detailed rubrics, enabling systematic comparison of LLMs. Through FHIR integration, H-AdminSim provides a unified and interoperable environment for testing administrative workflows across heterogeneous hospital settings, serving as a standardized testbed for assessing the feasibility and performance of LLM-driven administrative automation.
zh
[NLP-54] OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
【速读】: 该论文旨在解决预训练阶段因高质量公共文本数据逐渐枯竭(即“数据墙”现象)而导致的训练效率瓶颈问题,核心挑战在于如何从海量低质量或噪声数据中动态筛选出对模型优化更具价值的样本。解决方案的关键在于提出OPUS(Optimizer-induced Projected Utility Selection)框架,其创新性地在优化器诱导的更新空间中定义数据效用,通过将候选样本的有效更新投影到由稳定分布代理生成的目标方向上进行评分,从而实现与优化过程强耦合的动态数据选择。该方法结合Ghost技术与CountSketch提升计算效率,并采用Boltzmann采样保障数据多样性,仅引入4.7%额外计算开销,显著提升了不同语料、质量层级、优化器及模型规模下的预训练效率和效果。
链接: https://arxiv.org/abs/2602.05400
作者: Shaobo Wang,Xuan Ouyang,Tianyi Xu,Yuzheng Hu,Jialin Liu,Guo Chen,Tianyu Zhang,Junhao Zheng,Kexin Yang,Xingzhang Ren,Dayiheng Liu,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba (阿里巴巴)
类目: Computation and Language (cs.CL)
备注: 45 pages, 7 figures, 8 tables
Abstract:As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.
zh
[NLP-55] Late-to-Early Training: LET LLM s Learn Earlier So Faster and Better
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)预训练阶段计算成本高昂、训练效率低下的问题,特别是在利用已有小规模预训练模型加速大模型训练方面缺乏有效方法。其解决方案的核心是提出一种“晚到早训练”(Late-to-Early Training, LET)范式,通过在早期训练步骤中使用预训练模型(即后期训练阶段)的深层表示来指导目标模型浅层网络的学习,从而实现知识迁移。该方法依赖于两个关键机制:晚到早步学习(late-to-early-step learning)和晚到早层学习(late-to-early-layer learning),显著加速训练收敛并提升语言建模与下游任务性能,实验证明其在1.4B和7B参数模型上均具有高效性与优越性。
链接: https://arxiv.org/abs/2602.05393
作者: Ji Zhao,Yufei Gu,Shitong Shao,Xun Zhou,Liang Xiang,Zeke Xie
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textitCan we leverage existing small pretrained models to accelerate the training of larger models? In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET’s effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET’s efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6 \times speedup with nearly 5% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10 \times fewer parameters than the target model.
zh
[NLP-56] Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances
【速读】: 该论文旨在解决儿童话语质量评估中缺乏情境敏感指标的问题,现有代理指标(如平均话语长度 MLU、词汇多样性 vocd-D 和可读性指数 Flesch-Kincaid Grade Level、Gunning Fog Index)主要受话语长度影响,忽视对话语境,无法捕捉推理深度、话题维持和话语规划等关键响应质量维度。解决方案的关键在于提出一种基于大语言模型(LLM-as-a-judge)的评估框架:首先对前一句成人话语类型进行分类,随后从两个维度评分儿童回应——“扩展性”(Expansion,反映语境延伸与推断深度)和“独立性”(Independence,体现儿童推动话语进展的能力),这两个维度对应儿童语言发展的核心机制,且经实证验证具有年龄相关性和预测效度,同时能识别语篇关系差异,从而实现对儿童话语意义贡献的精准量化评估。
链接: https://arxiv.org/abs/2602.05392
作者: Jiyun Chun,Eric Fosler-Lussier,Michael White,Andrew Perrault
机构: The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating the quality of children’s utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics. Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context, missing aspects of response quality such as reasoning depth, topic maintenance, and discourse planning. We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child’s response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child’s contribution to advancing the discourse). These axes reflect fundamental dimensions in child language development, where Expansion captures elaboration, clause combining, and causal and contrastive connectives. Independence captures initiative, topic control, decreasing reliance on adult scaffolding through growing self-regulation, and audience design. We establish developmental validity by showing age-related patterns and demonstrate predictive value by improving age estimation over common baselines. We further confirm semantic sensitivity by detecting differences tied to discourse relations. Our metrics align with human judgments, enabling large-scale evaluation. This shifts child utterance assessment from simply measuring length to evaluating how meaningfully the child’s speech contributes to and advances the conversation within its context.
zh
[NLP-57] IESR:Efficient MCTS-Based Modular Reason ing for Text-to-SQL with Large Language Models
【速读】: 该论文旨在解决当前Text-to-SQL方法在复杂推理、领域知识理解和假设性查询场景下的性能瓶颈,以及在企业部署中因模型规模庞大导致的成本问题。其解决方案的关键在于提出一种名为IESR(Information Enhanced Structured Reasoning)的轻量化大语言模型框架:首先通过大语言模型(LLM)实现关键信息理解与模式链接,并将数学计算与SQL生成解耦;其次引入基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的多路径推理机制结合多数投票策略以提升推理鲁棒性;最后设计轨迹一致性验证模块,利用判别模型确保生成SQL的准确性与逻辑一致性。实验表明,该框架在无需微调的情况下即可在LogicCat和Archer等复杂推理基准上达到领先性能。
链接: https://arxiv.org/abs/2602.05385
作者: Tao Liu,Jiafan Lu,Bohan Yu,Pengcheng Wu,Liu Haixin,Guoyu Xu,Li Xiangheng,Lixiao Li,Jiaming Hou,Zhao Shijun,Xinglin Lyu,Kunli Zhang,Yuxiang Jia,Hongyin Zan
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 16 figures, 8 tables. Hongyin Zan is corresponding author, Jiafan Lu is first co-author
Abstract:Text-to-SQL is a key natural language processing task that maps natural language questions to SQL queries, enabling intuitive interaction with web-based databases. Although current methods perform well on benchmarks like BIRD and Spider, they struggle with complex reasoning, domain knowledge, and hypothetical queries, and remain costly in enterprise deployment. To address these issues, we propose a framework named IESR(Information Enhanced Structured Reasoning) for lightweight large language models: (i) leverages LLMs for key information understanding and schema linking, and decoupling mathematical computation and SQL generation, (ii) integrates a multi-path reasoning mechanism based on Monte Carlo Tree Search (MCTS) with majority voting, and (iii) introduces a trajectory consistency verification module with a discriminator model to ensure accuracy and consistency. Experimental results demonstrate that IESR achieves state-of-the-art performance on the complex reasoning benchmark LogicCat (24.28 EX) and the Archer dataset (37.28 EX) using only compact lightweight models without fine-tuning. Furthermore, our analysis reveals that current coder models exhibit notable biases and deficiencies in physical knowledge, mathematical computation, and common-sense reasoning, highlighting important directions for future research. We released code at this https URL.
zh
[NLP-58] Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks EACL2026
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在医疗问答任务中存在语言偏倚的问题,即模型在英语场景下表现良好,但在阿拉伯语等低资源语言中性能显著下降,且其原因尚不明确。解决方案的关键在于通过跨语言实证分析揭示语言差异对模型性能的影响机制:一方面发现阿拉伯语医学文本在分词层面存在结构性碎片化问题,另一方面表明模型输出的置信度与解释与其正确性相关性较弱,从而强调需采用语言感知的设计与评估策略,以提升LLMs在多语言医疗场景下的鲁棒性和可靠性。
链接: https://arxiv.org/abs/2602.05374
作者: Chaimae Abouzahir,Congbo Ma,Nizar Habash,Farah E. Shamout
机构: New York University Abu Dhabi (纽约大学阿布扎比分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to HeaLing-EACL 2026
Abstract:In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic and English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis suggests that model-reported confidence and explanations exhibit limited correlation with correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.
zh
[NLP-59] PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reason ing
【速读】: 该论文旨在解决当前基于迭代直接偏好优化(Iterative Direct Preference Optimization, DPO-R1)方法在数学推理任务中因过度依赖大规模采样(Best-of-N,N ≥ 8)而导致的政策崩溃与性能边际递减问题。其核心挑战在于:随着采样规模N增大,验证器噪声被放大并引发有害的分布偏移,从而削弱模型对高质量轨迹的识别能力。解决方案的关键在于提出PACE(Proximal Alignment via Corrective Exploration),通过引入一种基于生成式的校正策略替代传统的暴力挖掘方式,在极小计算预算(仅2N/3)下从失败探索中合成高保真偏好对,有效缓解了奖励劫持和标签噪声的影响,实现了更稳健且高效的对齐效果。
链接: https://arxiv.org/abs/2602.05370
作者: Jun Rao,Zixiong Yu,Xuebo Liu,Guhan Chen,Jing Li,Jiansheng Wei,Xiaojun Meng,Min Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., N \ge 8 ) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling N amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbfPACE (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget ( 2N3 ), PACE synthesizes high-fidelity preference pairs from failed explorations. Empirical evaluations show that PACE outperforms DPO-R1 (N=16) while using only about 1/5 of the compute, demonstrating superior robustness against reward hacking and label noise.
zh
[NLP-60] Multi-Field Tool Retrieval
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在调用外部工具时面临的工具检索效率与准确性问题,尤其针对现有方法将工具检索视为传统即席检索任务所导致的局限性。具体而言,现有方法直接匹配用户查询与原始工具文档,受限于工具文档的不完整性与结构不一致、用户意图与技术文档之间的语义粒度错位,以及工具功能的多维特性(如功能、输入约束和输出格式等维度在形式和重要性上差异显著)。为此,作者提出多字段工具检索(Multi-Field Tool Retrieval)框架,其核心在于通过细粒度的多字段建模,将用户意图与工具表示进行对齐,从而实现更精准、鲁棒且泛化能力强的工具选择。
链接: https://arxiv.org/abs/2602.05366
作者: Yichen Tang,Weihang Su,Yiqun Liu,Qingyao Ai
机构: DCST, Tsinghua University (清华大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 12 pages, 4 figures
Abstract:Integrating external tools enables Large Language Models (LLMs) to interact with real-world environments and solve complex tasks. Given the growing scale of available tools, effective tool retrieval is essential to mitigate constraints of LLMs’ context windows and ensure computational efficiency. Existing approaches typically treat tool retrieval as a traditional ad-hoc retrieval task, matching user queries against the entire raw tool documentation. In this paper, we identify three fundamental challenges that limit the effectiveness of this paradigm: (i) the incompleteness and structural inconsistency of tool documentation; (ii) the significant semantic and granular mismatch between user queries and technical tool documents; and, most importantly, (iii) the multi-aspect nature of tool utility, that involves distinct dimensions, such as functionality, input constraints, and output formats, varying in format and importance. To address these challenges, we introduce Multi-Field Tool Retrieval, a framework designed to align user intent with tool representations through fine-grained, multi-field modeling. Experimental results show that our framework achieves SOTA performance on five datasets and a mixed benchmark, exhibiting superior generalizability and robustness.
zh
[NLP-61] Agent XRay: White-Boxing Agent ic Systems via Workflow Reconstruction
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)代理系统中存在黑箱行为导致难以解释和控制的问题,尤其是当这些系统以不可见的内部流程运行时。其核心挑战在于如何在仅通过输入-输出接口访问的情况下,重构出一个可解释、可编辑的显式代理工作流(agent workflow)。解决方案的关键是提出 AgentXRay 框架,它将代理工作流重构(Agentic Workflow Reconstruction, AWR)建模为一个离散角色与工具调用组合的链式结构空间中的组合优化问题,并采用基于评分的红黑剪枝机制增强蒙特卡洛树搜索(Monte Carlo Tree Search),从而高效探索庞大搜索空间,在固定迭代预算下实现更高输出相似度和更低的 token 消耗,最终生成与原黑盒系统行为近似的白盒工作流。
链接: https://arxiv.org/abs/2602.05353
作者: Ruijie Shi,Houbin Zhang,Yuecheng Han,Yuheng Wang,Jingru Fan,Runde Yang,Yufan Dang,Huatao Li,Dewen Liu,Yuan Cheng,Chen Qian
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models have shown strong capabilities in complex problem solving, yet many agentic systems remain difficult to interpret and control due to opaque internal workflows. While some frameworks offer explicit architectures for collaboration, many deployed agentic systems operate as black boxes to users. We address this by introducing Agentic Workflow Reconstruction (AWR), a new task aiming to synthesize an explicit, interpretable stand-in workflow that approximates a black-box system using only input–output access. We propose AgentXRay, a search-based framework that formulates AWR as a combinatorial optimization problem over discrete agent roles and tool invocations in a chain-structured workflow space. Unlike model distillation, AgentXRay produces editable white-box workflows that match target outputs under an observable, output-based proxy metric, without accessing model parameters. To navigate the vast search space, AgentXRay employs Monte Carlo Tree Search enhanced by a scoring-based Red-Black Pruning mechanism, which dynamically integrates proxy quality with search depth. Experiments across diverse domains demonstrate that AgentXRay achieves higher proxy similarity and reduces token consumption compared to unpruned search, enabling deeper workflow exploration under fixed iteration budgets.
zh
[NLP-62] How Do Language Models Acquire Character-Level Information? EACL2026
【速读】: 该论文旨在解决语言模型(Language Models, LMs)在未显式提供字符级信息的情况下,如何隐式地编码字符级知识这一问题。其解决方案的关键在于通过对比在受控设置下训练的语言模型(如指定预训练数据集或分词器)与标准设置下的模型,系统性地识别并分类影响字符级知识获取的因素,发现分词相关的合并规则(merge rules)和正字法约束(orthographic constraints)是来自分词机制的主要因素,而子串的语义关联(semantic associations of substrings)和句法信息(syntactic information)则是独立于分词机制的关键因素。
链接: https://arxiv.org/abs/2602.05347
作者: Soma Sato,Ryohei Sasano
机构: Nagoya University (名古屋大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EACL 2026 Main Conference
Abstract:Language models (LMs) have been reported to implicitly encode character-level information, despite not being explicitly provided during training. However, the mechanisms underlying this phenomenon remain largely unexplored. To reveal the mechanisms, we analyze how models acquire character-level knowledge by comparing LMs trained under controlled settings, such as specifying the pre-training dataset or tokenizer, with those trained under standard settings. We categorize the contributing factors into those independent of tokenization. Our analysis reveals that merge rules and orthographic constraints constitute primary factors arising from tokenization, whereas semantic associations of substrings and syntactic information function as key factors independent of tokenization.
zh
[NLP-63] MentorCollab: Selective Large-to-Small Inference-Time Guidance for Efficient Reason ing
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在推理过程中计算成本高、易产生冗余推理,而小型语言模型(Small Language Models, SLM)虽高效但难以胜任多步推理任务的问题。其核心解决方案是提出一种名为MentorCollab的推理时协作机制,关键在于通过稀疏且选择性地引导SLM来实现高效协同:在随机采样的token位置上检测LRM与SLM之间的分歧,并借助轻量级验证器判断SLM是否应采纳来自导师模型的短片段前瞻内容,否则继续自主生成。这种方法避免了传统模仿式协作导致的冗长推理,仅需平均18.4%的由昂贵LRM生成的token,即可在多个领域(数学推理、通用知识和常识推理)中显著提升SLM性能,平均增益达3.0%,最高达8.0%,证明了选择性推理指导能有效恢复大模型的推理能力而不带来显著的推理开销。
链接: https://arxiv.org/abs/2602.05307
作者: Haojin Wang,Yike Wang,Shangbin Feng,Hannaneh Hajishirzi,Yulia Tsvetkov
机构: UIUC(伊利诺伊大学厄巴纳-香槟分校); University of Washington(华盛顿大学); Allen Institute for Artificial Intelligence(艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large reasoning models (LRMs) achieve strong performance by producing long chains of thought, but their inference costs are high and often generate redundant reasoning. Small language models (SLMs) are far more efficient, yet struggle on multi-step reasoning tasks. A natural idea is to let a large model guide a small one at inference time as a mentor, yet existing collaboration methods often promote imitation, resulting in verbose reasoning without consistent error correction. We propose MentorCollab, an inference-time collaboration method in which an LRM selectively and sparsely guides an SLM, rather than taking over generation. At randomly sampled token positions, we probe for divergences between the two models and use a lightweight verifier to decide whether the SLM should follow a short lookahead segment from its mentor or continue on its own. Across 15 SLM–LRM pairs and 3 domains (math reasoning, general knowledge, and commonsense reasoning), our method improves performance in 12 settings, with average gains of 3.0% and up to 8.0%, while adopting only having 18.4% tokens generated by the expensive mentor model on average. We find that short segments and selective probing are sufficient for effective collaboration. Our results show that selective inference-time guidance restores large-model reasoning ability without substantial inference overhead.
zh
[NLP-64] FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion
【速读】: 该论文旨在解决长序列生成任务中块扩散(block diffusion)方法因不断重复计算注意力机制而导致的显著计算开销问题,尤其是在KV缓存随生成步骤增长时。其核心解决方案是提出FlashBlock机制,关键在于识别并利用块外注意力输出在扩散步骤间的跨步冗余性:即块外token的注意力输出在不同扩散步骤中保持稳定,而块内注意力变化较大。FlashBlock通过缓存并重用这些稳定的块外注意力结果,减少注意力计算和KV缓存访问次数,且不改变原有扩散过程。该方法与稀疏注意力策略正交,可作为互补的残差复用策略,在激进稀疏化下仍显著提升模型精度,实验证明其能实现最高1.44倍的token吞吐量提升和1.6倍的注意力时间降低,同时对生成质量影响可忽略。
链接: https://arxiv.org/abs/2602.05305
作者: Zhuokun Chen,Jianfei Cai,Bohan Zhuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44 \times higher token throughput and up to 1.6 \times reduction in attention time, with negligible impact on generation quality. Project page: this https URL.
zh
[NLP-65] Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reason ing via Generative Probabilities
【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的策略优化方法在大语言模型(Large Language Models, LLMs)推理任务中出现的低熵策略收敛问题,即模式崩溃(mode collapse)导致输出多样性严重不足。其关键解决方案是提出一种优势重加权机制(Advantage Re-weighting Mechanism, ARM),通过引入提示词困惑度(Prompt Perplexity)和答案置信度(Answer Confidence)动态调整优势估计,从而抑制高置信度推理路径的梯度更新,同时将概率质量重新分配给未充分探索的正确解法,实现对所有正确响应的置信度均衡化,有效提升了生成多样性与响应熵,同时保持了良好的准确性,尤其在数学和代码基准测试中显著优于标准的组相对策略优化(Group Relative Policy Optimization, GRPO)。
链接: https://arxiv.org/abs/2602.05281
作者: Pengyi Li,Elizaveta Goncharova,Andrey Kuznetsov,Ivan Oseledets
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to severe mode collapse and limited output diversity. We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths, thereby suppressing valid alternative reasoning chains. To address this, we propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses. By incorporating Prompt Perplexity and Answer Confidence into the advantage estimation, our method dynamically reshapes the reward signal to attenuate the gradient updates of over-confident reasoning paths, while redistributing probability mass toward under-explored correct solutions. Empirical results demonstrate that our approach significantly enhances generative diversity and response entropy while maintaining competitive accuracy, effectively achieving a superior trade-off between exploration and exploitation in reasoning tasks. Empirical results on Qwen2.5 and DeepSeek models across mathematical and coding benchmarks show that ProGRPO significantly mitigates entropy collapse. Specifically, on Qwen2.5-7B, our method outperforms GRPO by 5.7% in Pass@1 and, notably, by 13.9% in Pass@32, highlighting its superior capability in generating diverse correct reasoning paths.
zh
[NLP-66] Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLM s via Selective Low-Rank Correction
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在边缘设备部署时因“内存墙”(Memory Wall)限制导致的性能瓶颈问题,即内存带宽成为计算效率的主要制约因素。传统1.58-bit量化技术虽能显著降低内存占用,但通常会使困惑度(perplexity)相比FP16基线下降20–25%。为此,作者提出混合门控流(Hybrid Gated Flow, HGF)架构,其核心创新在于采用双流设计:一个1.58-bit三值(ternary)骨干网络与一个可学习的、低秩的FP16校正路径相结合,该路径由自适应门控机制控制。该方案在TinyStories数据集上实现了验证损失从BitNet的1.0294降至0.9306,恢复了约55%的质量差距,且仅引入约12–15%的额外内存开销,同时展现出更强的训练稳定性,体现出量化作为结构正则化(quantization as structural regularization)的新兴现象。
链接: https://arxiv.org/abs/2602.05269
作者: David Alejandro Trejo Pizzo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 4 figures, 6 tables. Code and models will be released at this http URL
Abstract:The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the “Memory Wall” – a hardware limitation where memory bandwidth, not compute, becomes the bottleneck. Recent 1.58-bit quantization techniques (e.g., BitNet b1.58) dramatically reduce memory footprint but typically incur a perplexity degradation of 20-25% compared to FP16 baselines. In this work, we introduce Hybrid Gated Flow (HGF), a dual-stream architecture that couples a 1.58-bit ternary backbone with a learnable, low-rank FP16 correction path controlled by adaptive gates. Through extensive experiments on the TinyStories dataset across two training regimes (2500 and 3500 steps), we demonstrate that HGF 5.4 achieves a validation loss of 0.9306 compared to BitNet’s 1.0294, recovering approximately 55% of the quality gap between pure ternary quantization and the FP16 baseline (0.8490). This recovery is achieved with only ~12-15% memory overhead beyond the ternary backbone. Furthermore, we provide empirical evidence for an emergent phenomenon: quantization as structural regularization. While a full-precision differential attention baseline (Diff_Only) exhibited training instability with validation loss exceeding 1.68, the ternary-anchored HGF maintained robust convergence throughout training. Finally, we report preliminary results extending this architecture to 1.2B and 3B parameter models trained on SlimPajama and FineWeb-Edu. These larger-scale experiments confirm that the architectural stability and quality recovery observed in small-scale proxies scale linearly to production-grade language modeling regimes. Comments: 21 pages, 4 figures, 6 tables. Code and models will be released at this http URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) MSC classes: 68T05, 68T50 ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2602.05269 [cs.LG] (or arXiv:2602.05269v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.05269 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-67] Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR
【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)算法在训练大型语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)时,响应长度(response length)变化不一致的问题。这种长度波动会干扰推理能力的稳定提升,尤其在Group Sequence Policy Optimization(GSPO)等主流算法中,存在明显的长度偏差(length bias),导致响应长度塌陷(length collapse)。解决方案的关键在于提出一种无长度偏置的序列策略优化算法——Length-Unbiased Sequence Policy Optimization(LUSPO),通过修正GSPO损失函数中的长度相关项,使其对响应长度保持无偏性,从而实现更稳定的推理能力增长与性能提升。
链接: https://arxiv.org/abs/2602.05261
作者: Fanfan Liu,Youyang Yin,Peng Shi,Siqi Yang,Zhixiong Zeng,Haibo Qiu
机构: Meituan(美团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.
zh
[NLP-68] CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在扩展上下文长度时,旋转位置编码(Rotary Positional Embedding, RoPE)所面临的问题:一方面需应对分布外(out-of-distribution, OOD)位置带来的偏差,另一方面需保持语义相似性引导的注意力机制稳定性。现有方法通常基于两种独立范式——OOD缓解或语义建模——难以兼顾。论文提出一种极简干预策略CoPE(soft clipping low-frequency components of RoPE),其关键在于通过软截断(soft clipping)方式抑制RoPE中低频分量,从而同时消除OOD异常值、增强语义信号并避免硬截断导致的频谱泄漏(spectral leakage)。实验表明,该方法在高达256k上下文长度下显著提升性能,验证了理论分析,并确立了CoPE作为长度泛化的新基准。
链接: https://arxiv.org/abs/2602.05258
作者: Haoran Li,Sucheng Ren,Alan Yuille,Feng Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out-of-distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state-of-the-art for length generalization. Our code, data, and models are available at this https URL.
zh
[NLP-69] Copyright Detective: A Forensic System to Evidence LLM s Flickering Copyright Leakage Risks
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成内容时可能存在的版权风险识别与评估难题,尤其关注模型是否存在对训练数据中受版权保护内容的verbatim memorization(逐字记忆)或paraphrase-level leakage(改写级泄露)。其解决方案的关键在于提出了一种名为Copyright Detective的交互式取证系统,该系统将版权合规性判断转化为一个证据发现过程,而非静态分类任务,并整合了内容召回测试、改写级别相似性分析、说服性越狱探测和遗忘验证等多种检测范式,构建了一个统一且可扩展的框架,支持通过交互式提示、响应收集和迭代工作流实现对LLM版权风险的系统性审计,从而促进负责任部署和透明化评估,即使在仅具备黑盒访问权限的情况下亦可实现。
链接: https://arxiv.org/abs/2602.05252
作者: Guangwei Zhang,Jianing Zhu,Cheng Qian,Neil Gong,Rada Mihalcea,Zhaozhuo Xu,Jingrui He,Jiaqi Ma,Yun Huang,Chaowei Xiao,Bo Li,Ahmed Abbasi,Dongwon Lee,Heng Ji,Denghui Zhang
机构: Pine AI; The University of Texas at Austin; University of Illinois Urbana-Champaign; Duke University; University of Michigan; Stevens Institute of Technology; Johns Hopkins University; University of Notre Dame; The Pennsylvania State University
类目: Computation and Language (cs.CL)
备注:
Abstract:We present Copyright Detective, the first interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs. The system treats copyright infringement versus compliance as an evidence discovery process rather than a static classification task due to the complex nature of copyright law. It integrates multiple detection paradigms, including content recall testing, paraphrase-level similarity analysis, persuasive jailbreak probing, and unlearning verification, within a unified and extensible framework. Through interactive prompting, response collection, and iterative workflows, our system enables systematic auditing of verbatim memorization and paraphrase-level leakage, supporting responsible deployment and transparent evaluation of LLM copyright risks even with black-box access.
zh
[NLP-70] FedMosaic: Federated Retrieval-Augmented Generation via Parametric Adapters
【速读】: 该论文旨在解决联邦检索增强生成(Federated Retrieval-Augmented Generation, FedRAG)中因分布式知识库导致的隐私保护与高效协作之间的矛盾问题。传统RAG方法依赖于集中式语料库,难以适用于数据隐私敏感场景;而现有联邦RAG方案在采用参数化适配器(parametric adapters)时面临两个关键挑战:一是每个文档生成的适配器带来高昂的存储和通信开销,二是无差别合并多个适配器会导致信息冲突与性能下降。论文提出的FedMosaic框架通过两个核心创新予以解决:其一,基于语义聚类将多文档整合为共享适配器,并引入文档级掩码(document-specific masks)保留细粒度信息,显著降低存储与通信成本;其二,设计选择性适配器聚合机制,仅融合语义相关且无冲突的适配器,避免破坏性聚合,从而提升生成准确性。实验表明,FedMosaic在四项任务上平均准确率提升10.9%,同时存储成本降低78.8%–86.3%,通信成本减少91.4%,且全程不共享原始文档。
链接: https://arxiv.org/abs/2602.05235
作者: Zhilin Liang,Yuxiang Wang,Zimu Zhou,Hainan Zhang,Boyi Liu,Yongxin Tong
机构: Beihang University (北京航空航天大学); City University of Hong Kong (香港城市大学); Beijing Advanced Innovation Center (北京先进创新中心)
类目: Computation and Language (cs.CL)
备注: 11 pages
Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding generation in external knowledge to improve factuality and reduce hallucinations. Yet most deployments assume a centralized corpus, which is infeasible in privacy aware domains where knowledge remains siloed. This motivates federated RAG (FedRAG), where a central LLM server collaborates with distributed silos without sharing raw documents. In context RAG violates this requirement by transmitting verbatim documents, whereas parametric RAG encodes documents into lightweight adapters that merge with a frozen LLM at inference, avoiding raw-text exchange. We adopt the parametric approach but face two unique challenges induced by FedRAG: high storage and communication from per-document adapters, and destructive aggregation caused by indiscriminately merging multiple adapters. We present FedMosaic, the first federated RAG framework built on parametric adapters. FedMosaic clusters semantically related documents into multi-document adapters with document-specific masks to reduce overhead while preserving specificity, and performs selective adapter aggregation to combine only relevance-aligned, nonconflicting adapters. Experiments show that FedMosaic achieves an average 10.9% higher accuracy than state-of-the-art methods in four categories, while lowering storage costs by 78.8% to 86.3% and communication costs by 91.4%, and never sharing raw documents.
zh
[NLP-71] Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions ICLR2026
【速读】: 该论文旨在解决当前基于干预的模型控制(intervention-based model steering)方法因过度依赖微调中强优化目标而导致过拟合、性能不佳甚至生成不自然输出的问题。作者认为,有效控制需精准识别模型内部机制,而非简单施加外部偏好。解决方案的关键在于提出Concept DAS(CDAS),其核心是基于分布式对齐搜索(Distributed Alignment Search, DAS)中的分布交换干预(Distributed Interchange Intervention, DII)机制,并引入一种针对控制任务设计的分布匹配目标——通过将干预后的输出分布与反事实分布对齐来学习干预策略。相较于以往方法,CDAS采用弱监督下的分布匹配而非概率最大化进行干预学习,且利用DII实现双向控制并从数据中自动提取控制因子,从而降低超参数调优需求,提升控制的忠实性与稳定性。
链接: https://arxiv.org/abs/2602.05234
作者: Yuntai Bao,Xuhong Zhang,Jintao Chen,Ge Su,Yuxiang Cai,Hao Peng,Bing Sun,Haiqin Weng,Liu Yan,Jianwei Yin
机构: Zhejiang University (浙江大学); Ningbo Global Innovation Center, Zhejiang University (宁波全球创新中心,浙江大学); Zhejiang Key Laboratory of Digital-Intelligence Service Technology (浙江省数字智能服务技术重点实验室); Zhejiang Normal University (浙江师范大学); National Certification Technology (Hangzhou) Co., Ltd (杭州国认证技术有限公司); Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 55 pages, 25 figures; accepted for ICLR 2026
Abstract:Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning. However, by adapting strong optimization objectives from fine-tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences. To this end, we build on the principles of distributed alignment search (DAS), the standard for causal variable localization, to propose a new steering method: Concept DAS (CDAS). While we adopt the core mechanism of DAS, distributed interchange intervention (DII), we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions. CDAS differs from prior work in two main ways: first, it learns interventions via weak-supervised distribution matching rather than probability maximization; second, it uses DIIs that naturally enable bi-directional steering and allow steering factors to be derived from data, reducing the effort required for hyperparameter tuning and resulting in more faithful and stable control. On AxBench, a large-scale model steering benchmark, we show that CDAS does not always outperform preference-optimization methods but may benefit more from increased model scale. In two safety-related case studies, overriding refusal behaviors of safety-aligned models and neutralizing a chain-of-thought backdoor, CDAS achieves systematic steering while maintaining general model utility. These results indicate that CDAS is complementary to preference-optimization approaches and conditionally constitutes a robust approach to intervention-based model steering. Our code is available at this https URL.
zh
[NLP-72] Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions
【速读】: 该论文旨在解决当前音频基础模型普遍依赖任务特定监督、仅处理音频的孤立特征而缺乏整体认知理解的问题。其解决方案的关键在于提出Bagpiper——一个8B参数的音频基础模型,通过引入丰富的自然语言描述(即“丰富字幕”,rich captions)作为桥梁,将原始音频信号映射到高阶概念空间(如转录文本、音频事件等),从而实现音频与抽象认知概念之间的双向对齐。该模型在6000亿token规模的数据上预训练,并采用“先生成字幕再处理”的流程进行微调,模拟中间认知推理步骤,无需任务特定先验即可完成多样化音频理解与生成任务,显著优于现有方法在MMAU、AIRBench等基准上的表现,且具备合成任意语音、音乐和音效组合的能力。
链接: https://arxiv.org/abs/2602.05220
作者: Jinchuan Tian,Haoran Wang,Bo-Hao Su,Chien-yu Huang,Qingzheng Wang,Jiatong Shi,William Chen,Xun Gong,Siddhant Arora,Chin-Jou Li,Masao Someki,Takashi Maekaku,Yusuke Shinohara,Jin Sakuma,Chao-Han Huck Yang,Shinji Watanabe
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:
Abstract:Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.
zh
[NLP-73] Quantifying the Knowledge Proximity Between Academic and Industry Research: An Entity and Semantic Perspective
【速读】: 该论文旨在解决现有研究中对产学研知识邻近性(knowledge proximity)分析过于依赖宏观指标(如合作论文或专利数量),而缺乏对文献层面细粒度知识单元的刻画,从而导致对产业与学术界之间知识互动机制理解不足的问题。其解决方案的关键在于引入细粒度知识实体与语义空间双维度量化方法:首先通过预训练模型提取细粒度知识实体,并利用余弦相似度测量序列重叠,结合复杂网络分析拓扑特征;其次采用无监督对比学习在语义空间中量化跨机构文本相似性以衡量语义收敛程度;最后通过引文分布模式检验双向知识流动与相似性之间的关联。该方法有效揭示了技术变革背景下产学研协同演化的动态轨迹及知识主导权的变化趋势。
链接: https://arxiv.org/abs/2602.05211
作者: Hongye Zhao,Yi Zhao,Chengzhi Zhang
机构: 未知
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:
Abstract:The academia and industry are characterized by a reciprocal shaping and dynamic feedback mechanism. Despite distinct institutional logics, they have adapted closely in collaborative publishing and talent mobility, demonstrating tension between institutional divergence and intensive collaboration. Existing studies on their knowledge proximity mainly rely on macro indicators such as the number of collaborative papers or patents, lacking an analysis of knowledge units in the literature. This has led to an insufficient grasp of fine-grained knowledge proximity between industry and academia, potentially undermining collaboration frameworks and resource allocation efficiency. To remedy the limitation, this study quantifies the trajectory of academia-industry co-evolution through fine-grained entities and semantic space. In the entity measurement part, we extract fine-grained knowledge entities via pre-trained models, measure sequence overlaps using cosine similarity, and analyze topological features through complex network analysis. At the semantic level, we employ unsupervised contrastive learning to quantify convergence in semantic spaces by measuring cross-institutional textual similarities. Finally, we use citation distribution patterns to examine correlations between bidirectional knowledge flows and similarity. Analysis reveals that knowledge proximity between academia and industry rises, particularly following technological change. This provides textual evidence of bidirectional adaptation in co-evolution. Additionally, academia’s knowledge dominance weakens during technological paradigm shifts. The dataset and code for this paper can be accessed at this https URL.
zh
[NLP-74] Aligning Large Language Model Behavior with Human Citation Preferences
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在生成内容时对引用(citation)的识别与选择行为与其人类用户偏好之间存在显著偏差的问题。具体而言,研究聚焦于LLMs如何判断文本是否具有“可引性”(cite-worthiness),以及这种行为与人类实际引用需求的对齐程度。解决方案的关键在于构建了一个结构化的数据集,将网络来源文本细分为八类引用动机类型,并通过成对偏好评估捕捉不同类别间的细微差异;在此基础上,利用直接偏好优化(Direct Preference Optimization, DPO)方法对模型行为进行校准,使其更贴近人类引用习惯,从而提升输出内容的可信度和实用性。
链接: https://arxiv.org/abs/2602.05205
作者: Kenichiro Ando,Tatsuya Harada
机构: RIKEN AIP(理化学研究所人工智能研究中心); The University of Tokyo(东京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work In Progress
Abstract:Most services built on powerful large-scale language models (LLMs) add citations to their output to enhance credibility. Recent research has paid increasing attention to the question of what reference documents to link to outputs. However, how LLMs recognize cite-worthiness and how this process should be controlled remains underexplored. In this study, we focus on what kinds of content LLMs currently tend to cite and how well that behavior aligns with human preferences. We construct a dataset to characterize the relationship between human citation preferences and LLM behavior. Web-derived texts are categorized into eight citation-motivation types, and pairwise citation preferences are exhaustively evaluated across all type combinations to capture fine-grained contrasts. Our results show that humans most frequently seek citations for medical text, and stronger models display a similar tendency. We also find that current models are as much as 27% more likely than humans to add citations to text that is explicitly marked as needing citations on sources such as Wikipedia, and this overemphasis reduces alignment accuracy. Conversely, models systematically underselect numeric sentences (by -22.6% relative to humans) and sentences containing personal names (by -20.1% ), categories for which humans typically demand citations. Furthermore, experiments with Direct Preference Optimization demonstrate that model behavior can be calibrated to better match human citation preferences. We expect this study to provide a foundation for more fine-grained investigations into LLM citation preferences.
zh
[NLP-75] Are Open-Weight LLM s Ready for Social Media Moderation? A Comparative Study on Bluesky
【速读】: 该论文旨在解决社交媒体平台上日益增长的有害内容(harmful content)识别与管理问题,特别是在现有大型语言模型(Large Language Models, LLMs)中,开放权重模型(open-weight LLMs)在零样本场景下是否具备与专有模型相当的 moderation 能力这一关键科学问题。解决方案的关键在于通过实证评估七种前沿模型(四款专有、三款开放权重),基于真实用户帖子(来自 Bluesky 平台)及人工标注数据,系统比较其敏感性(sensitivity)和特异性(specificity)指标,发现开放权重 LLMs 在性能上可接近甚至媲美专有模型(敏感性 81%–97% vs. 72%–98%,特异性 91%–100% vs. 93%–99%),并揭示不同类别内容(如冒犯性言论、偏见和威胁)的检测特性差异,从而证明开放权重 LLMs 可用于隐私保护型内容审核,并为构建兼顾社区规范与个体偏好平衡的下一代平台级个性化 moderation 系统提供技术路径。
链接: https://arxiv.org/abs/2602.05189
作者: Hsuan-Yu Chou,Wajiha Naveed,Shuyan Zhou,Xiaowei Yang
机构: Duke University (杜克大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:
Abstract:As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%–97%) and specificity (91%–100%) of the open-weight LLMs and those (72%–98%, and 93%–99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences. Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2602.05189 [cs.CL] (or arXiv:2602.05189v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2602.05189 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-76] he Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems
【速读】: 该论文旨在解决多语言模型(Language Model, LM)协作系统中效率低下与成本高昂的问题,即在利用多个模型协同优势的同时,因加载和运行多个模型导致的资源消耗过大。其解决方案的关键在于提出“单-多进化循环”(single-multi evolution loop)机制:首先让多个模型协作生成输出,随后每个模型通过知识蒸馏(distillation)学习这些协作结果,从而将协作模式融入单一模型;接着,这些经过蒸馏优化后的模型再次参与协作,形成一个自我进化的生态系统。该方法不仅使单个模型性能平均提升8.0%,还显著增强协作系统的整体表现(平均提升14.9%),同时将推理成本降低至仅需一个模型。
链接: https://arxiv.org/abs/2602.05182
作者: Shangbin Feng,Kishan Panaganti,Yulia Tsvetkov,Wenhao Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Code at this https URL
Abstract:Model collaboration – systems where multiple language models (LMs) collaborate – combines the strengths of diverse models with cost in loading multiple LMs. We improve efficiency while preserving the strengths of collaboration by distilling collaborative patterns into a single model, where the model is trained on the outputs of the model collaboration system. At inference time, only the distilled model is employed: it imitates the collaboration while only incurring the cost of a single model. Furthermore, we propose the single-multi evolution loop: multiple LMs collaborate, each distills from the collaborative outputs, and these post-distillation improved LMs collaborate again, forming a collective evolution ecosystem where models evolve and self-improve by interacting with an environment of other models. Extensive experiments with 7 collaboration strategies and 15 tasks (QA, reasoning, factuality, etc.) demonstrate that: 1) individual models improve by 8.0% on average, absorbing the strengths of collaboration while reducing the cost to a single model; 2) the collaboration also benefits from the stronger and more synergistic LMs after distillation, improving over initial systems without evolution by 14.9% on average. Analysis reveals that the single-multi evolution loop outperforms various existing evolutionary AI methods, is compatible with diverse model/collaboration/distillation settings, and helps solve problems where the initial model/system struggles to.
zh
[NLP-77] Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems
【速读】: 该论文旨在解决多大语言模型(Large Language Models, LLMs)协作系统中因部分模型被恶意操控或 compromised 所引发的安全风险问题。在去中心化协作范式下,如路由系统、多智能体辩论和模型融合等场景中,若存在恶意模型,可能导致推理能力和安全性显著下降(平均分别降低7.12%和7.94%)。解决方案的关键在于引入外部监督机制(external supervisors),通过识别并屏蔽或禁用恶意模型组件,从而大幅降低其负面影响;实验表明,该策略可平均恢复95.31%的初始性能,但实现对恶意模型的完全免疫仍是开放性问题。
链接: https://arxiv.org/abs/2602.05176
作者: Ziyuan Yang,Wenxuan Ding,Shangbin Feng,Yulia Tsvetkov
机构: University of Washington (华盛顿大学); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, 15 tables, 4 figures
Abstract:Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.
zh
[NLP-78] EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization
【速读】: 该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的策略优化在实际应用中的稳定性问题,尤其是Group Relative Policy Optimization (GRPO) 方法在小群体规模下估计方差高、以及在失败场景中梯度信号消失的问题。解决方案的关键在于提出Empirical Bayes Policy Optimization (EBPO),其核心思想是通过借用策略累积的全局统计信息来正则化局部组基线,采用一种收缩估计器(shrinkage estimator)动态平衡局部组统计量与由Welford在线算法更新的全局先验,从而实现更低的均方误差(MSE)、有界的熵衰减和非消失的惩罚信号,显著提升训练稳定性和性能表现。
链接: https://arxiv.org/abs/2602.05165
作者: Kevin Han,Yuhang Zhou,Mingze Gao,Gedi Zhou,Serena Li,Abhishek Kumar,Xiangjun Fan,Weiwei Li,Lizhu Zhang
机构: Meta AI (Meta人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing the reasoning capabilities of Large Language Models (LLMs). However, dominant approaches like Group Relative Policy Optimization (GRPO) face critical stability challenges: they suffer from high estimator variance under computational constraints (small group sizes) and vanishing gradient signals in saturated failure regimes where all responses yield identical zero rewards. To address this, we propose Empirical Bayes Policy Optimization (EBPO), a novel framework that regularizes local group-based baselines by borrowing strength from the policy’s accumulated global statistics. Instead of estimating baselines in isolation, EBPO employs a shrinkage estimator that dynamically balances local group statistics with a global prior updated via Welford’s online algorithm. Theoretically, we demonstrate that EBPO guarantees strictly lower Mean Squared Error (MSE), bounded entropy decay, and non-vanishing penalty signals in failure scenarios compared to GRPO. Empirically, EBPO consistently outperforms GRPO and other established baselines across diverse benchmarks, including AIME and OlympiadBench. Notably, EBPO exhibits superior training stability, achieving high-performance gains even with small group sizes, and benefits significantly from difficulty-stratified curriculum learning.
zh
[NLP-79] GreekMMLU: A Native-Sourced Multitask Benchmark for Evaluating Language Models in Greek
【速读】: 该论文旨在解决希腊语(Greek)在大型语言模型(Large Language Models, LLMs)评估中缺乏可靠、基于本土来源的基准测试问题。现有数据集多为英文机器翻译而来,无法准确反映希腊语的语言特征与文化背景。其解决方案的关键在于构建一个名为GreekMMLU的全新基准测试集,该数据集包含45个学科领域共21,805道多项选择题,全部源自希腊语学术、职业和政府考试,具有教育难度分级,并采用新的学科分类体系进行组织。此外,作者公开发布16,857个样本并保留4,948个样本用于私有排行榜,以实现鲁棒且抗污染的模型评估。实证结果揭示了前沿模型与开源模型之间、以及专为希腊语优化的模型与通用多语言模型之间的显著性能差距,为提升LLMs在希腊语中的理解能力提供了系统性分析依据。
链接: https://arxiv.org/abs/2602.05150
作者: Yang Zhang,Mersin Konomi,Christos Xypolopoulos,Konstantinos Divriotis,Konstantinos Skianis,Giannis Nikolentzos,Giorgos Stamou,Guokan Shang,Michalis Vazirgiannis
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are commonly trained on multilingual corpora that include Greek, yet reliable evaluation benchmarks for Greek-particularly those based on authentic, native-sourced content-remain limited. Existing datasets are often machine-translated from English, failing to capture Greek linguistic and cultural characteristics. We introduce GreekMMLU, a native-sourced benchmark for massive multitask language understanding in Greek, comprising 21,805 multiple-choice questions across 45 subject areas, organized under a newly defined subject taxonomy and annotated with educational difficulty levels spanning primary to professional examinations. All questions are sourced or authored in Greek from academic, professional, and governmental exams. We publicly release 16,857 samples and reserve 4,948 samples for a private leaderboard to enable robust and contamination-resistant evaluation. Evaluations of over 80 open- and closed-source LLMs reveal substantial performance gaps between frontier and open-weight models, as well as between Greek-adapted models and general multilingual ones. Finally, we provide a systematic analysis of factors influencing performance-including model scale, adaptation, and prompting-and derive insights for improving LLM capabilities in Greek.
zh
[NLP-80] SocialVeil: Probing Social Intelligence of Language Agents under Communication Barriers
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在社交智能评估中普遍依赖理想化通信假设的问题,即现有基准未能充分模拟现实世界中因认知差异导致的沟通障碍,从而限制了对LLMs在不完美交互情境下维持与修复对话能力的诊断。其解决方案的关键在于提出一个名为SocialVeil的社会学习环境,该环境基于对人类互动中沟通挑战的系统文献综述,引入三种典型干扰类型:语义模糊性(semantic vagueness)、社会文化错配(sociocultural mismatch)和情绪干扰(emotional interference),并设计了两个障碍感知的评估指标——未解决困惑度(unresolved confusion)和相互理解度(mutual understanding),用于量化评估在受阻通信下的交互质量。实验表明,这些障碍显著降低模型表现,且适应策略(如修复指令和交互式学习)效果有限,验证了该环境能更真实地模拟社会交互场景,推动LLM社交智能研究向现实世界靠拢。
链接: https://arxiv.org/abs/2602.05115
作者: Keyang Xuan,Pengda Wang,Chongrui Ye,Haofei Yu,Tal August,Jiaxuan You
机构: Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校西贝尔计算机与数据科学学院); Department of Psychological Sciences, Rice University (莱斯大学心理科学系)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages
Abstract:Large language models (LLMs) are increasingly evaluated in interactive environments to test their social intelligence. However, existing benchmarks often assume idealized communication between agents, limiting our ability to diagnose whether LLMs can maintain and repair interactions in more realistic, imperfect settings. To close this gap, we present \textscSocialVeil, a social learning environment that can simulate social interaction under cognitive-difference-induced communication barriers. Grounded in a systematic literature review of communication challenges in human interaction, \textscSocialVeil introduces three representative types of such disruption, \emphsemantic vagueness, \emphsociocultural mismatch, and \emphemotional interference. We also introduce two barrier-aware evaluation metrics, \emphunresolved confusion and \emphmutual understanding, to evaluate interaction quality under impaired communication. Experiments across 720 scenarios and four frontier LLMs show that barriers consistently impair performance, with mutual understanding reduced by over 45% on average, and confusion elevated by nearly 50%. Human evaluations validate the fidelity of these simulated barriers (ICC \approx 0.78, Pearson r \approx 0.80). We further demonstrate that adaptation strategies (Repair Instruction and Interactive learning) only have a modest effect far from barrier-free performance. This work takes a step toward bringing social interaction environments closer to real-world communication, opening opportunities for exploring the social intelligence of LLM agents.
zh
[NLP-81] Multilingual Extraction and Recognition of Implicit Discourse Relations in Speech and Text
【速读】: 该论文旨在解决跨语言和跨模态环境下隐式话语关系分类(implicit discourse relation classification)的挑战,即如何在缺乏显式连接词的情况下,通过多语言和多模态信息准确推断语篇中的逻辑关系。其解决方案的关键在于构建一个涵盖英语、法语和西班牙语的多语言多模态数据集,并提出一种基于Qwen2-Audio的融合方法,将文本与声学特征联合建模,从而实现跨语言、跨模态的隐式话语关系识别。实验表明,虽然纯文本模型表现更优,但引入声学信息可进一步提升性能,且跨语言迁移对低资源语言具有显著增益。
链接: https://arxiv.org/abs/2602.05107
作者: Ahmed Ruby,Christian Hardmeier,Sara Stymne
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Implicit discourse relation classification is a challenging task, as it requires inferring meaning from context. While contextual cues can be distributed across modalities and vary across languages, they are not always captured by text alone. To address this, we introduce an automatic method for distantly related and unrelated language pairs to construct a multilingual and multimodal dataset for implicit discourse relations in English, French, and Spanish. For classification, we propose a multimodal approach that integrates textual and acoustic information through Qwen2-Audio, allowing joint modeling of text and audio for implicit discourse relation classification across languages. We find that while text-based models outperform audio-based models, integrating both modalities can enhance performance, and cross-lingual transfer can provide substantial improvements for low-resource languages.
zh
[NLP-82] Data Kernel Perspective Space Performance Guarantees for Synthetic Data from Transformer Models
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型在训练过程中因标注数据稀缺而导致性能受限的问题,尤其关注基于 Transformer 架构的大语言模型(LLM)通过合成数据生成缓解数据不足的实践困境。当前方法依赖经验性调整如温度参数(temperature),缺乏对合成数据质量的可预测性和理论保障。论文提出的关键解决方案是 Data Kernel Perspective Space(DKPS),其核心在于构建一个数学框架,能够对 Transformer 模型输出的合成数据质量提供可证明的统计保证,并进一步将这些保证映射到下游任务(如神经机器翻译或使用对比偏好优化(CPO)训练的 LLM)的性能表现上,从而为合成数据生成提供理论基础和可量化评估依据。
链接: https://arxiv.org/abs/2602.05106
作者: Michael Browder,Kevin Duh,J. David Harris,Vince Lyzinski,Paul McNamee,Youngser Park,Carey E. Priebe,Peter Viechnicki
机构: University of Maryland, College Park (马里兰大学学院公园分校); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Scarcity of labeled training data remains the long pole in the tent for building performant language technology and generative AI models. Transformer models – particularly LLMs – are increasingly being used to mitigate the data scarcity problem via synthetic data generation. However, because the models are black boxes, the properties of the synthetic data are difficult to predict. In practice it is common for language technology engineers to ‘fiddle’ with the LLM temperature setting and hope that what comes out the other end improves the downstream model. Faced with this uncertainty, here we propose Data Kernel Perspective Space (DKPS) to provide the foundation for mathematical analysis yielding concrete statistical guarantees for the quality of the outputs of transformer models. We first show the mathematical derivation of DKPS and how it provides performance guarantees. Next we show how DKPS performance guarantees can elucidate performance of a downstream task, such as neural machine translation models or LLMs trained using Contrastive Preference Optimization (CPO). Limitations of the current work and future research are also discussed.
zh
[NLP-83] Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在测试时训练(test-time training)过程中面临的持续学习(continual learning)难题,特别是如何高效地存储历史信息并避免灾难性遗忘(catastrophic forgetting)。其解决方案的核心是提出一种新型参数化记忆机制——Locas(Locally-Supported parametric memory),该机制基于现代Transformer中的前馈网络(Feed-Forward Network, FFN)结构设计,能够灵活地从模型参数中卸载或合并,从而实现参数高效且计算高效的持续学习。关键创新在于通过重用模型参数、激活值和/或梯度进行合理初始化的低秩侧向FFN式记忆模块,显著提升收敛速度、泛化性能,并有效防止知识遗忘。实验表明,仅需0.02%额外参数,Locas-GLU即可在保持较小上下文窗口的同时存储过去信息,并在PG-19语言建模和LoCoMo长对话问答任务中验证了其优越性。
链接: https://arxiv.org/abs/2602.05085
作者: Sidi Lu,Zhenwen Liang,Dongyang Ma,Yan Wang,Haitao Mi,Dong Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Tencent AI Lab Technical Report
Abstract:In this paper, we aim to bridge test-time-training with a new type of parametric memory that can be flexibly offloaded from or merged into model parameters. We present Locas, a Locally-Supported parametric memory that shares the design of FFN blocks in modern transformers, allowing it to be flexibly permanentized into the model parameters while supporting efficient continual learning. We discuss two major variants of Locas: one with a conventional two-layer MLP design that has a clearer theoretical guarantee; the other one shares the same GLU-FFN structure with SOTA LLMs, and can be easily attached to existing models for both parameter-efficient and computation-efficient continual learning. Crucially, we show that proper initialization of such low-rank sideway-FFN-style memories – performed in a principled way by reusing model parameters, activations and/or gradients – is essential for fast convergence, improved generalization, and catastrophic forgetting prevention. We validate the proposed memory mechanism on the PG-19 whole-book language modeling and LoCoMo long-context dialogue question answering tasks. With only 0.02% additional parameters in the lowest case, Locas-GLU is capable of storing the information from past context while maintaining a much smaller context window. In addition, we also test the model’s general capability loss after memorizing the whole book with Locas, through comparative MMLU evaluation. Results show the promising ability of Locas to permanentize past context into parametric knowledge with minimized catastrophic forgetting of the model’s existing internal knowledge.
zh
[NLP-84] StagePilot: A Deep Reinforcement Learning Agent for Stage-Controlled Cybergrooming Simulation
【速读】: 该论文旨在解决网络诱骗(cybergrooming)对青少年构成的持续威胁问题,提出通过主动教育干预提升青少年的防范能力。其解决方案的关键在于设计了一种基于离线强化学习(offline RL)的对话代理StagePilot,该代理能够模拟诱骗行为的阶段性发展过程,用于预防训练。其核心创新在于采用复合奖励机制,综合考虑用户情绪状态(sentiment)与目标接近度(goal proximity),并约束状态转移仅在相邻阶段间进行,从而确保对话的真实性与可解释性。实验表明,该方法能生成符合诱骗动态的连贯对话,其中IQL+AWAC策略在战略规划与情感一致性之间取得最优平衡,使最终阶段达成率比基线高出最多43%,同时保持超过70%的情绪一致性。
链接: https://arxiv.org/abs/2602.05060
作者: Heajun An,Qi Zhang,Minqian Liu,Xinyi Zhang,Sang Won Lee,Lifu Huang,Pamela J. Wisniewski,Jin-Hee Cho
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Cybergrooming is an evolving threat to youth, necessitating proactive educational interventions. We propose StagePilot, an offline RL-based dialogue agent that simulates the stage-wise progression of grooming behaviors for prevention training. StagePilot selects conversational stages using a composite reward that balances user sentiment and goal proximity, with transitions constrained to adjacent stages for realism and interpretability. We evaluate StagePilot through LLM-based simulations, measuring stage completion, dialogue efficiency, and emotional engagement. Results show that StagePilot generates realistic and coherent conversations aligned with grooming dynamics. Among tested methods, the IQL+AWAC agent achieves the best balance between strategic planning and emotional coherence, reaching the final stage up to 43% more frequently than baselines while maintaining over 70% sentiment alignment.
zh
[NLP-85] VEXA: Evidence-Grounded and Persona-Adaptive Explanations for Scam Risk Sensemaking
【速读】: 该论文旨在解决在线诈骗(如电子邮件、短信和社交媒体中的欺诈行为)日益复杂化背景下,生成式AI(Generative AI)驱动的欺骗手段对普通用户风险判断能力构成的挑战,尤其关注现有基于Transformer的检测模型虽具备高预测性能但解释性差的问题。其解决方案的关键在于提出VEXA框架——通过将GradientSHAP(梯度Shapley值)归因方法与理论驱动的脆弱性角色(vulnerability personas)相结合,实现以证据为基础(evidence-grounded)且适配用户角色(persona-adaptive)的可解释性说明。研究表明,基于检测器输出的证据能够提升解释的语义可靠性而不增加语言复杂度,而角色适应则在保持解释忠实性的前提下引入风格上的可理解差异,揭示了“证据锚定决定语义正确性,角色适配控制呈现方式”这一核心设计洞见。
链接: https://arxiv.org/abs/2602.05056
作者: Heajun An,Connor Ng,Sandesh Sharma Dulal,Junghwan Kim,Jin-Hee Cho
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Online scams across email, short message services, and social media increasingly challenge everyday risk assessment, particularly as generative AI enables more fluent and context-aware deception. Although transformer-based detectors achieve strong predictive performance, their explanations are often opaque to non-experts or misaligned with model decisions. We propose VEXA, an evidence-grounded and persona-adaptive framework for generating learner-facing scam explanations by integrating GradientSHAP-based attribution with theory-informed vulnerability personas. Evaluation across multi-channel datasets shows that grounding explanations in detector-derived evidence improves semantic reliability without increasing linguistic complexity, while persona conditioning introduces interpretable stylistic variation without disrupting evidential alignment. These results reveal a key design insight: evidential grounding governs semantic correctness, whereas persona-based adaptation operates at the level of presentation under constraints of faithfulness. Together, VEXA demonstrates the feasibility of persona-adaptive, evidence-grounded explanations and provides design guidance for trustworthy, learner-facing security explanations in non-formal contexts.
zh
[NLP-86] Capacity Constraints and the Multilingual Penalty for Lexical Disambiguation
【速读】: 该论文旨在解决多语言语言模型(Multilingual Language Models, MLMs)在词义消歧(Lexical Disambiguation)任务中性能低于单语模型的问题,其核心挑战在于揭示导致这一“多语言惩罚”(multilingual penalty)的潜在容量限制机制。解决方案的关键在于通过受控的人类相关性判断数据集,对英语和西班牙语中的歧义词进行量化分析,并系统评估三种可能的容量约束:表征层面(嵌入 isotropy 降低)、注意力机制层面(对消歧线索的关注度下降)以及词汇层面(多标记分词增加)。研究发现,MLMs 在这三方面均存在显著不足,且这些因素能够解释原本归因于多语言状态的性能差异,从而表明多语言模型的性能损失源于多种可量化的容量瓶颈。
链接: https://arxiv.org/abs/2602.05035
作者: Sean Trott,Pamela D. Rivière
机构: Rutgers University - Newark (新泽西州立大学纽瓦克分校)
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures, conference
Abstract:Multilingual language models (LMs) sometimes under-perform their monolingual counterparts, possibly due to capacity limitations. We quantify this ``multilingual penalty’’ for lexical disambiguation–a task requiring precise semantic representations and contextualization mechanisms–using controlled datasets of human relatedness judgments for ambiguous words in both English and Spanish. Comparing monolingual and multilingual LMs from the same families, we find consistently reduced performance in multilingual LMs. We then explore three potential capacity constraints: representational (reduced embedding isotropy), attentional (reduced attention to disambiguating cues), and vocabulary-related (increased multi-token segmentation). Multilingual LMs show some evidence of all three limitations; moreover, these factors statistically account for the variance formerly attributed to a model’s multilingual status. These findings suggest both that multilingual LMs do suffer from multiple capacity constraints, and that these constraints correlate with reduced disambiguation performance.
zh
[NLP-87] DeepRead: Document Structure-Aware Reason ing to Enhance Agent ic Search
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理长文档时存在的效率与准确性不足问题,尤其是现有代理式搜索框架将长文档视为扁平的文本块集合,未能有效利用文档内部的层次结构和语篇顺序等先验信息。解决方案的关键在于提出 DeepRead——一个结构感知、多轮推理的文档阅读代理,其核心创新是通过 LLM 驱动的光学字符识别(OCR)模型将 PDF 文档转换为保留标题和段落边界的结构化 Markdown 格式,并以坐标式元数据标记每个段落的位置信息(包括章节身份和章节内顺序)。在此基础上,DeepRead 引入两个互补工具:一是 Retrieve 工具,可定位相关段落并暴露其结构坐标(附带轻量扫描上下文),二是 ReadSection 工具,支持指定范围内按顺序连续读取段落,从而实现类似人类“定位后阅读”的细粒度推理行为,显著提升长文档问答性能。
链接: https://arxiv.org/abs/2602.05014
作者: Zhanli Li,Huiwen Tian,Lvzhou Luo,Yixuan Cao,Ping Luo
机构: Chinese Academy of Sciences (中国科学院); Institute of Computing Technology, CAS (计算技术研究所, 中国科学院); University of Chinese Academy of Sciences (中国科学院大学); State Key Lab of AI Safety (人工智能安全国家重点实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: working in progress
Abstract:With the rapid progress of tool-using and agentic large language models (LLMs), Retrieval-Augmented Generation (RAG) is evolving from one-shot, passive retrieval into multi-turn, decision-driven evidence acquisition. Despite strong results in open-domain settings, existing agentic search frameworks commonly treat long documents as flat collections of chunks, underutilizing document-native priors such as hierarchical organization and sequential discourse structure. We introduce DeepRead, a structure-aware, multi-turn document reasoning agent that explicitly operationalizes these priors for long-document question answering. DeepRead leverages LLM-based OCR model to convert PDFs into structured Markdown that preserves headings and paragraph boundaries. It then indexes documents at the paragraph level and assigns each paragraph a coordinate-style metadata key encoding its section identity and in-section order. Building on this representation, DeepRead equips the LLM with two complementary tools: a Retrieve tool that localizes relevant paragraphs while exposing their structural coordinates (with lightweight scanning context), and a ReadSection tool that enables contiguous, order-preserving reading within a specified section and paragraph range. Our experiments demonstrate that DeepRead achieves significant improvements over Search-o1-style agentic search in document question answering. The synergistic effect between retrieval and reading tools is also validated. Our fine-grained behavioral analysis reveals a reading and reasoning paradigm resembling human-like ``locate then read’’ behavior.
zh
[NLP-88] Enhanced QKNorm normalization for neural transformers with the Lp norm
【速读】: 该论文旨在解决Transformer架构中查询向量(query vector)与键向量(key vector)归一化问题,以确保学习过程在向量尺度变化时仍保持稳定。传统方法通常采用L2范数(即欧几里得范数)进行归一化,而本文提出了一种基于Lp范数的广义QKNorm归一化方案,允许使用非欧几里得范数(non-Euclidean norms),从而提升归一化策略的灵活性和适应性。其关键创新在于将归一化操作从特定范数扩展到任意Lp范数空间,实验表明该方法在简单任务中具有良好的适用性。
链接: https://arxiv.org/abs/2602.05006
作者: Ezequiel Lopez-Rubio,Javier Montes-Perez,Esteban Jose Palomo
机构: University of Málaga (马拉加大学); ITIS Software (ITIS Software)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The normalization of query and key vectors is an essential part of the Transformer architecture. It ensures that learning is stable regardless of the scale of these vectors. Some normalization approaches are available. In this preliminary work, a generalization of the QKNorm normalization scheme is proposed. The approach is based on the Lp norm, allowing non-Euclidean norms to be employed. Experimental results demonstrate the suitability of the method for a simple problem.
zh
[NLP-89] CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System
【速读】: 该论文旨在解决语言条件代理在交互环境中执行高度协作任务时面临的双重挑战:一是要求亚秒级实时协调以保证响应效率,二是需在严格在线token预算下实现多轮次的持续适应能力。现有方法要么依赖频繁的回合内推理导致延迟和时间抖动,要么通过非结构化文本进行事后优化,难以转化为低成本、可靠的执行策略。其解决方案的关键在于提出CoWork-X框架,该框架将同伴协作建模为跨轮次的闭环优化问题,借鉴快慢记忆分离机制,包含两个核心组件:一个基于HTN(Hierarchical Task Network)的Skill-Agent,从结构化、可解释且可组合的技能库中检索并执行动作;以及一个回合后Co-Optimizer,通过补丁式技能整合策略,在显式预算约束和漂移正则化条件下优化技能表示。实验表明,CoWork-X可在保持稳定性能提升的同时显著降低在线延迟与token消耗。
链接: https://arxiv.org/abs/2602.05004
作者: Zexin Lin,Jiachen Yu,Haoyang Zhang,Yuzhao Li,Zhonghang Li,Yujiu Yang,Junjie Wang,Xiaoqiang Ji
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are enabling language-conditioned agents in interactive environments, but highly cooperative tasks often impose two simultaneous constraints: sub-second real-time coordination and sustained multi-episode adaptation under a strict online token budget. Existing approaches either rely on frequent in-episode reasoning that induces latency and timing jitter, or deliver post-episode improvements through unstructured text that is difficult to compile into reliable low-cost execution. We propose CoWork-X, an active co-evolution framework that casts peer collaboration as a closed-loop optimization problem across episodes, inspired by fast–slow memory separation. CoWork-X instantiates a Skill-Agent that executes via HTN (hierarchical task network)-based skill retrieval from a structured, interpretable, and compositional skill library, and a post-episode Co-Optimizer that performs patch-style skill consolidation with explicit budget constraints and drift regularization. Experiments in challenging Overcooked-AI-like realtime collaboration benchmarks demonstrate that CoWork-X achieves stable, cumulative performance gains while steadily reducing online latency and token usage.
zh
[NLP-90] EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models
【速读】: 该论文旨在解决离散扩散语言模型(Discrete Diffusion Language Models)在测试时适应过程中奖励引导(Reward Guidance)效果不佳的问题。由于离散 token 不可微,现有方法要么使用连续松弛替代离散输出,要么采用直通估计器(Straight-Through Estimator),但前者因奖励模型未在连续输入上训练而导致梯度反馈质量下降,后者则因在离散 token 上计算的梯度被用于更新连续 logits 而引入优化偏差。论文的关键创新在于提出一种名为 EntRGi(Entropy-aware Reward Guidance)的新机制,通过动态调节来自奖励模型的梯度,利用模型置信度对连续松弛进行调制,从而在保持可靠输入的同时显著提升奖励引导性能。
链接: https://arxiv.org/abs/2602.05000
作者: Atula Tejaswi,Litu Rout,Constantine Caramanis,Sanjay Shakkottai,Sujay Sanghavi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:Reward guidance has been applied to great success in the test-time adaptation of continuous diffusion models; it updates each denoising step using the gradients from a downstream reward model. We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model because they are discrete tokens. Existing approaches either replace these discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator. In this work, we show the downsides of both these methods. The former degrades gradient feedback because the reward model has never been trained with continuous inputs. The latter involves incorrect optimization because the gradient evaluated at discrete tokens is used to update continuous logits. Our key innovation is to go beyond this tradeoff by introducing a novel mechanism called EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model. By modulating the continuous relaxation using the model’s confidence, our approach substantially improves reward guidance while providing reliable inputs to the reward model. We empirically validate our approach on a 7B-parameter diffusion language model across 3 diverse reward models and 3 multi-skill benchmarks, showing consistent improvements over state-of-the-art methods.
zh
[NLP-91] Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning
【速读】: 该论文旨在解决当前低秩适配(Low-Rank Adaptation, LoRA)方法在实际应用中性能差异的可复现性问题,即不同改进版本的LoRA在特定超参数设置下表现优异,但其优势是否具有普适性尚不明确。论文通过系统性的超参数搜索,在多种模型规模和任务(如数学推理与代码生成)上重新评估了四种代表性LoRA变体与原始LoRA的性能表现。其关键发现是:各方法的最优学习率范围存在显著差异,但一旦学习率被合理调优,所有方法均能达到相近的峰值性能(误差仅1-2%),且仅有细微的秩相关行为差异。这表明原始LoRA仍是一个强有力的基线,而此前报告的性能提升可能源于特定训练配置而非方法论本质优势。进一步的二阶分析指出,不同方法间最优学习率差异主要归因于最大Hessian特征值的变化,符合经典学习理论预期。
链接: https://arxiv.org/abs/2602.04998
作者: Yu-Ang Lee,Ching-Yun Ko,Pin-Yu Chen,Mi-Yen Yeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model (LLM) fine-tuning. Building on this paradigm, recent studies have proposed alternative initialization strategies and architectural modifications, reporting substantial improvements over vanilla LoRA. However, these gains are often demonstrated under fixed or narrowly tuned hyperparameter settings, despite the known sensitivity of neural networks to training configurations. In this work, we systematically re-evaluate four representative LoRA variants alongside vanilla LoRA through extensive hyperparameter searches. Across mathematical and code generation tasks on diverse model scales, we find that different LoRA methods favor distinct learning rate ranges. Crucially, once learning rates are properly tuned, all methods achieve similar peak performance (within 1-2%), with only subtle rank-dependent behaviors. These results suggest that vanilla LoRA remains a competitive baseline and that improvements reported under single training configuration may not reflect consistent methodological advantages. Finally, a second-order analysis attributes the differing optimal learning rate ranges to variations in the largest Hessian eigenvalue, aligning with classical learning theories.
zh
[NLP-92] BioACE: An Automated Framework for Biomedical Answer and Citation Evaluations
【速读】: 该论文旨在解决生成式 AI(Generative AI)在生物医学问答任务中产生的答案及其引用文献的质量评估难题。由于生物医学领域对专业知识和术语的高要求,传统自动化评估方法难以准确验证生成内容与科学文献的一致性,因此亟需一种可量化、多维度的评估框架。解决方案的关键在于提出 BioACE 框架,其核心是通过自动化方法分别评估答案的完整性(completeness)、正确性(correctness)、精确度(precision)和召回率(recall),并与真实事实片段(ground-truth nuggets)进行比对;同时,利用自然语言推理(NLI)和预训练语言模型等技术对支撑答案的引用文献质量进行系统评估,从而实现对生物医学问答结果的全面、客观评价。
链接: https://arxiv.org/abs/2602.04982
作者: Deepak Gupta,Davis Bartels,Dina Demner-Fuhsman
机构: National Library of Medicine (国家医学图书馆)
类目: Computation and Language (cs.CL)
备注: Work in progress
Abstract:With the increasing use of large language models (LLMs) for generating answers to biomedical questions, it is crucial to evaluate the quality of the generated answers and the references provided to support the facts in the generated answers. Evaluation of text generated by LLMs remains a challenge for question answering, retrieval-augmented generation (RAG), summarization, and many other natural language processing tasks in the biomedical domain, due to the requirements of expert assessment to verify consistency with the scientific literature and complex medical terminology. In this work, we propose BioACE, an automated framework for evaluating biomedical answers and citations against the facts stated in the answers. The proposed BioACE framework considers multiple aspects, including completeness, correctness, precision, and recall, in relation to the ground-truth nuggets for answer evaluation. We developed automated approaches to evaluate each of the aforementioned aspects and performed extensive experiments to assess and analyze their correlation with human evaluations. In addition, we considered multiple existing approaches, such as natural language inference (NLI) and pre-trained language models and LLMs, to evaluate the quality of evidence provided to support the generated answers in the form of citations into biomedical literature. With the detailed experiments and analysis, we provide the best approaches for biomedical answer and citation evaluation as a part of BioACE (this https URL) evaluation package.
zh
[NLP-93] Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在监督微调(Supervised Fine-Tuning, SFT)过程中数据混合优化(Data Mixture Optimization, DMO)的难题,即如何高效确定多个领域特定数据集的最佳混合权重。由于组合搜索空间庞大且每次训练成本高昂,传统方法难以有效探索最优混合策略。论文的关键解决方案是将模型合并(model merging)作为估算不同数据混合效果的代理手段:通过训练各领域的专用模型(domain-specific experts),并在参数空间中进行加权组合,从而快速评估对应数据混合的实际性能。实验证明,这种合并得到的代理模型与真实训练模型在14个多模态基准上的排名高度相关,显著降低了资源消耗并实现了高效的混合权重搜索。
链接: https://arxiv.org/abs/2602.04937
作者: Davide Berasi,Matteo Farina,Massimiliano Mancini,Elisa Ricci
机构: University of Trento (特伦托大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:Selecting the best data mixture is critical for successful Supervised Fine-Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain-specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so-called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain-specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain-specific multimodal experts and evaluate their weighted parameter-space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource-intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at this https URL.
zh
[NLP-94] Pruning Minimal Reason ing Graphs for Efficient Retrieval-Augmented Generation
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理知识密集型大语言模型(Large Language Models, LLM)任务时存在的效率低下问题,即多数系统将每个查询视为独立事件,重复检索长文本片段并从头推理,导致token消耗、延迟和成本显著增加。解决方案的关键在于提出AutoPrunedRetriever——一种基于图结构的RAG系统,它通过持久化早期查询构建的最小推理子图,并增量扩展以应对后续查询;同时利用紧凑的ID索引代码本存储实体与关系,将问题、事实和答案表示为边序列,从而实现基于符号结构而非原始文本的检索与提示。此外,该方案引入两层合并策略(快速近似最近邻/最近邻别名检测 + 达到内存阈值后选择性k-means)和低价值结构剪枝机制,在保持推理准确性的同时大幅减少token使用量,实现在复杂推理任务中达到SOTA性能且资源开销降低两个数量级。
链接: https://arxiv.org/abs/2602.04926
作者: Ning Wang,Kuanyan Zhu,Daniel Yuehwoon Yee,Yitang Gao,Shiying Huang,Zirun Xu,Sainyam Galhotra
机构: Cornell University (康奈尔大学); University of Cambridge (剑桥大学); The University of Hong Kong (香港大学); HKUST (香港科技大学); University of British Columbia (不列颠哥伦比亚大学)
类目: Databases (cs.DB); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Retrieval-augmented generation (RAG) is now standard for knowledge-intensive LLM tasks, but most systems still treat every query as fresh, repeatedly re-retrieving long passages and re-reasoning from scratch, inflating tokens, latency, and cost. We present AutoPrunedRetriever, a graph-style RAG system that persists the minimal reasoning subgraph built for earlier questions and incrementally extends it for later ones. AutoPrunedRetriever stores entities and relations in a compact, ID-indexed codebook and represents questions, facts, and answers as edge sequences, enabling retrieval and prompting over symbolic structure instead of raw text. To keep the graph compact, we apply a two-layer consolidation policy (fast ANN/KNN alias detection plus selective k -means once a memory threshold is reached) and prune low-value structure, while prompts retain only overlap representatives and genuinely new evidence. We instantiate two front ends: AutoPrunedRetriever-REBEL, which uses REBEL as a triplet parser, and AutoPrunedRetriever-llm, which swaps in an LLM extractor. On GraphRAG-Benchmark (Medical and Novel), both variants achieve state-of-the-art complex reasoning accuracy, improving over HippoRAG2 by roughly 9–11 points, and remain competitive on contextual summarize and generation. On our harder STEM and TV benchmarks, AutoPrunedRetriever again ranks first, while using up to two orders of magnitude fewer tokens than graph-heavy baselines, making it a practical substrate for long-running sessions, evolving corpora, and multi-agent pipelines.
zh
[NLP-95] Internalizing LLM Reason ing via Discovery and Replay of Latent Actions
【速读】: 该论文旨在解决现有激活控制(activation steering)方法在复杂推理任务中因依赖静态控制向量而无法适应隐状态非平稳演化的问题。其解决方案的关键在于提出STIR(Self-Distilled Tools for Internal Reasoning)框架,将推理增强建模为动态潜在轨迹控制问题,并通过三阶段协同机制实现:(1) 差分内在动作诱导从潜在推理成功中提取控制原语;(2) 稀疏控制基构建几何多样且紧凑的工具库;(3) 值调制轨迹干预通过锚点门控机制动态注入上下文感知的激励信号,从而在不显式生成思维链的情况下实现高保真内部化推理过程。
链接: https://arxiv.org/abs/2602.04925
作者: Zhenning Shi,Yijia Zhu,Junhan Shi,Xun Zhang,Lei Wang,Congcong Miao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The internalization of chain-of-thought processes into hidden states has emerged as a highly efficient paradigm for scaling test-time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non-stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self-Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three-stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) sparse control basis construction curates a compact, geometrically diverse tool library; and (3) value-modulated trajectory intervention dynamically injects context-specific impulses via anchor-based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla decoding. These findings demonstrate that the benefits of explicit chain-of-thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at this https URL.
zh
[NLP-96] Simulated Adoption: Decoupling Magnitude and Direction in LLM In-Context Conflict Resolution
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对上下文冲突信息时,为何会优先采纳外部提示内容而非其内部预训练知识的问题,即“顺从性”(sycophancy)或“合规性”(compliance)现象的机制本质。以往假说如“流形稀释”(Manifold Dilution)认为这种行为源于残差流中信号幅度的削弱,但本文通过层级几何分析发现,这一现象并非由范数衰减引起,而是源于方向性的几何扰动——具体而言,冲突上下文注入了一个近似正交于真实知识方向的“导向向量”(steering vector),导致隐藏状态表示发生旋转,从而绕过正确的解码向量,实现对错误信息的表面接受。因此,解决方案的关键在于识别出“正交干扰”(Orthogonal Interference)机制,揭示了模型并非删除原有知识,而是通过向量空间中的方向偏移来实现合规响应,这要求未来需采用矢量层面的监控手段替代传统标量置信度指标以准确检测幻觉。
链接: https://arxiv.org/abs/2602.04918
作者: Long Zhang,Fangwei Lin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Large Language Models (LLMs) frequently prioritize conflicting in-context information over pre-existing parametric memory, a phenomenon often termed sycophancy or compliance. However, the mechanistic realization of this behavior remains obscure, specifically how the model resolves these knowledge conflicts through compliance, and whether this suppression arises from signal magnitude dilution or directional geometric alteration within the residual stream. To resolve this, we conducted a layer-wise geometric analysis across Qwen-4B, Llama-3.1-8B, and GLM-4-9B, decomposing the residual stream updates induced by counter-factual contexts into radial (norm-based) and angular (cosine-based) components. Our empirical results reject the universality of the “Manifold Dilution” hypothesis, as two of the three architectures maintained stable residual norms despite exhibiting significant performance degradation on factual queries. Instead, we observed that compliance is consistently characterized by “Orthogonal Interference,” where the conflicting context injects a steering vector that is quasi-orthogonal to the ground-truth direction, effectively rotating the hidden state representation. This suggests that models do not “unlearn” or suppress the magnitude of internal truths but rather employ a mechanism of geometric displacement to bypass the correct unembedding vector, effectively simulating adoption while preserving the original structural magnitude. These findings challenge scalar confidence metrics for detecting hallucinations and underscore the necessity of vectorial monitoring to distinguish between genuine knowledge integration and superficial in-context mimicry.
zh
[NLP-97] Atomic Information Flow: A Network Flow Model for Tool Attributions in RAG Systems
【速读】: 该论文旨在解决工具增强型检索增强生成(Tool-based Retrieval Augmented Generation, RAG)系统中缺乏精确溯源机制的问题,尤其是在复杂多智能体架构下难以将最终响应准确关联到具体工具组件。解决方案的核心是提出原子信息流(Atomic Information Flow, AIF),这是一种基于图的网络流模型,将工具输出和大语言模型(LLM)调用分解为不可分割的信息原子(atoms),并通过构建从工具节点和LLM节点流向响应超汇点(super-sink)的有向流网络,实现对信息流动的细粒度追踪与归因分析。进一步地,基于最大流最小割定理,作者训练一个轻量级Gemma3-4B语言模型作为上下文压缩器,利用AIF离线计算的流量信号近似最小割,从而在提升推理准确性的同时实现高效上下文压缩,显著缩小了与更大模型(Gemma3-27B)之间的性能差距。
链接: https://arxiv.org/abs/2602.04912
作者: James Gao,Josh Zhou,Qi Sun,Ryan Huang,Steven Yoo
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Many tool-based Retrieval Augmented Generation (RAG) systems lack precise mechanisms for tracing final responses back to specific tool components – a critical gap as systems scale to complex multi-agent architectures. We present \textbfAtomic Information Flow (AIF), a graph-based network flow model that decomposes tool outputs and LLM calls into atoms: indivisible, self-contained units of information. By modeling LLM orchestration as a directed flow of atoms from tool and LLM nodes to a response super-sink, AIF enables granular attribution metrics for AI explainability. Motivated by the max-flow min-cut theorem in network flow theory, we train a lightweight Gemma3 (4B parameter) language model as a context compressor to approximate the minimum cut of tool atoms using flow signals computed offline by AIF. We note that the base Gemma3-4B model struggles to identify critical information with \textbf54.7% accuracy on HotpotQA, barely outperforming lexical baselines (BM25). However, post-training on AIF signals boosts accuracy to \textbf82.71% (+28.01 points) while achieving \textbf87.52% (+1.85%) context token compression – bridging the gap with the Gemma3-27B variant, a model nearly 7\times larger. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2602.04912 [cs.IR] (or arXiv:2602.04912v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.04912 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-98] AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM -Based Understanding and Design ICLR2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在抗体(Antibody)的自然语言理解与设计能力方面存在的局限性问题。当前LLMs虽在蛋白质表征学习中取得显著进展,但难以有效解析或基于自然语言指令生成具有特定功能的抗体序列。解决方案的关键在于构建首个面向抗体的、大规模带功能注释的指令数据集AFD-Instruction,其包含两个核心模块:抗体理解(从序列直接推断功能属性)和抗体设计(在功能约束下进行从头序列生成),从而实现抗体序列与功能描述之间的显式对齐,并支持由自然语言指令驱动的抗体设计,显著提升模型在多种抗体相关任务上的性能表现。
链接: https://arxiv.org/abs/2602.04916
作者: Ling Luo,Wenbin Jiang,Xushi Zhang,Hongyuan Chang,Xinkang Wang,Yueting Xiong,Mengsha Tong,Rongshan Yu
机构: National Institute for Data Science in Health and Medicine, Xiamen University (厦门大学健康与医学数据科学国家研究所); Institute of Artificial Intelligence, Xiamen University (厦门大学人工智能研究院); School of Life Sciences, Xiamen University (厦门大学生命科学学院); School of Informatics, Xiamen University (厦门大学信息学院); State Key Laboratory of Vaccines for Infectious Diseases, Xiamen University (厦门大学传染病疫苗国家重点实验室); Xiang An Biomedicine Laboratory (翔安生物医药实验室)
类目: Quantitative Methods (q-bio.QM); Computation and Language (cs.CL)
备注: Accepted by ICLR 2026
Abstract:Large language models (LLMs) have significantly advanced protein representation learning. However, their capacity to interpret and design antibodies through natural language remains limited. To address this challenge, we present AFD-Instruction, the first large-scale instruction dataset with functional annotations tailored to antibodies. This dataset encompasses two key components: antibody understanding, which infers functional attributes directly from sequences, and antibody design, which enables de novo sequence generation under functional constraints. These components provide explicit sequence-function alignment and support antibody design guided by natural language instructions. Extensive instruction-tuning experiments on general-purpose LLMs demonstrate that AFD-Instruction consistently improves performance across diverse antibody-related tasks. By linking antibody sequences with textual descriptions of function, AFD-Instruction establishes a new foundation for advancing antibody modeling and accelerating therapeutic discovery.
zh
计算机视觉
[CV-0] Shared LoRA Subspaces for almost Strict Continual Learning
【速读】:该论文旨在解决大规模预训练模型在实际部署中面临的持续学习(continual learning)难题,尤其是由于灾难性遗忘(catastrophic forgetting)和重训成本高昂导致的效率低下问题。现有参数高效微调方法如低秩适应(Low Rank Adaptation, LoRA)虽降低了计算开销,但缺乏严格的持续学习机制与知识整合能力,且通常依赖数据回放或多个适配器(adapters)。其解决方案的关键在于提出Share方法,该方法通过学习并动态更新一个单一、共享的低秩子空间(shared low-rank subspace),从历史任务中提取核心知识,并通过识别关键子空间方向增量式融合新任务信息,从而实现前向知识迁移并最小化干扰。此机制使模型能够在不依赖数据回放或多适配器的情况下,支持跨任务和多模态的无缝适应,同时相比传统LoRA实现高达100倍的参数缩减和281倍的内存节省,性能接近联合训练模型,具备良好的可扩展性和异步持续学习能力。
链接: https://arxiv.org/abs/2602.06043
作者: Prakhar Kaushik,Ankit Vaidya,Shravan Chaudhari,Rama Chellappa,Alan Yuille
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. While parameter-efficient tuning methods like low rank adaptation (LoRA) reduce computational demands, they lack mechanisms for strict continual learning and knowledge integration, without relying on data replay, or multiple adapters. We propose Share, a novel approach to parameter efficient continual finetuning that learns and dynamically updates a single, shared low-rank subspace, enabling seamless adaptation across multiple tasks and modalities. Share constructs a foundational subspace that extracts core knowledge from past tasks and incrementally integrates new information by identifying essential subspace directions. Knowledge from each new task is incorporated into this evolving subspace, facilitating forward knowledge transfer, while minimizing catastrophic interference. This approach achieves up to 100x parameter reduction and 281x memory savings over traditional LoRA methods, maintaining performance comparable to jointly trained models. A single Share model can replace hundreds of task-specific LoRA adapters, supporting scalable, asynchronous continual learning. Experiments across image classification, natural language understanding, 3D pose estimation, and text-to-image generation validate its effectiveness, making Share a practical and scalable solution for lifelong learning in large-scale AI systems.
zh
[CV-1] Pseudo-Invertible Neural Networks
【速读】:该论文旨在解决非线性逆问题(non-linear inverse problems)中缺乏有效、可计算的伪逆(Pseudo-inverse, PInv)方法的问题,特别是在神经网络框架下实现对复杂非线性退化(degradation)的零样本(zero-shot)反演。其核心解决方案是提出了一类显式设计以支持可计算非线性伪逆的神经网络架构——可surjective伪可逆神经网络(Surjective Pseudo-invertible Neural Networks, SPNN),并定义了非线性后投影(Non-Linear Back-Projection, NLBP)机制,该机制通过构造满足一致性约束 $ f(x’) = y $ 的最近一致状态 $ x’ $ 来保证输出的物理或语义合理性。关键创新在于将线性情形下的null-space projection(零空间投影)推广至非线性场景,并借助SPNN实现闭式表达的NLBP,从而扩展扩散模型在非线性退化下的零样本逆问题求解能力,无需重新训练生成先验即可实现精确语义控制。
链接: https://arxiv.org/abs/2602.06042
作者: Yamit Ehrlich,Nimrod Berman,Assaf Shocher
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Moore-Penrose Pseudo-inverse (PInv) serves as the fundamental solution for linear systems. In this paper, we propose a natural generalization of PInv to the nonlinear regime in general and to neural networks in particular. We introduce Surjective Pseudo-invertible Neural Networks (SPNN), a class of architectures explicitly designed to admit a tractable non-linear PInv. The proposed non-linear PInv and its implementation in SPNN satisfy fundamental geometric properties. One such property is null-space projection or “Back-Projection”, x’ = x + A^\dagger(y-Ax) , which moves a sample x to its closest consistent state x’ satisfying Ax=y . We formalize Non-Linear Back-Projection (NLBP), a method that guarantees the same consistency constraint for non-linear mappings f(x)=y via our defined PInv. We leverage SPNNs to expand the scope of zero-shot inverse problems. Diffusion-based null-space projection has revolutionized zero-shot solving for linear inverse problems by exploiting closed-form back-projection. We extend this method to non-linear degradations. Here, “degradation” is broadly generalized to include any non-linear loss of information, spanning from optical distortions to semantic abstractions like classification. This approach enables zero-shot inversion of complex degradations and allows precise semantic control over generative outputs without retraining the diffusion prior.
zh
[CV-2] Predicting Camera Pose from Perspective Descriptions for Spatial Reason ing
【速读】:该论文旨在解决多图像空间推理(multi-image spatial reasoning)在当前多模态大语言模型(Multimodal Large Language Models, MLLMs)中的挑战,特别是从多个视角中构建一致的3D场景理解并据此进行新视角下的推理问题,即“视角转换”(perspective taking)。其核心难点在于如何将不同视角的2D观测转化为可跨视角语义对齐的3D结构,并基于自然语言描述的目标视角进行准确推理。解决方案的关键是提出CAMCUE框架,该框架通过显式引入相机位姿(camera pose)作为几何锚点,将每张图像的视觉token与对应位姿信息融合,从而实现跨视角的信息对齐;同时,它能将自然语言描述的视角映射到目标相机位姿,并生成条件化的想象视图以支持问答任务。这一方法避免了传统依赖测试时搜索匹配的高开销过程,显著提升推理效率(从256.6秒降至1.45秒/例),并实现了高精度的视角预测(旋转误差<20°时准确率>90%,平移误差<0.5时准确率高)。
链接: https://arxiv.org/abs/2602.06041
作者: Xuejun Zhang,Aditi Tiwari,Zhenhailong Wang,Heng Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.
zh
[CV-3] SwimBird: Eliciting Switchable Reason ing Mode in Hybrid Autoregressive MLLM s
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理视觉密集型任务时因依赖固定文本思维链(textual Chain-of-Thought, CoT)而导致性能受限的问题,同时避免现有引入固定数量连续隐藏状态作为“视觉思维”(visual thoughts)的方法在提升视觉理解能力的同时损害文本逻辑推理能力。其解决方案的关键在于提出一种可切换推理模式的多模态大语言模型——SwimBird,该模型能够根据输入内容动态选择三种推理模式:纯文本推理、纯视觉推理(以连续隐藏状态表示视觉思维)以及交错式视觉-文本推理。为此,作者设计了一种混合自回归框架,统一了文本思维的下一个词预测与视觉思维的下一个嵌入预测,并构建了一个涵盖全部三种推理模式的监督微调数据集(SwimBird-SFT-92K),从而实现对不同任务需求的自适应推理策略,既保持强文本逻辑能力,又显著提升视觉密集任务的表现。
链接: https://arxiv.org/abs/2602.06040
作者: Jintao Tong,Shilin Yan,Hongwei Xue,Xiaojun Tang,Kunyu Shi,Guannan Zhang,Ruixuan Li,Yixiong Zou
机构: Huazhong University of Science and Technology (华中科技大学); Accio Team, Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as “visual thoughts” into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.
zh
[CV-4] hinking with Geometry: Active Geometry Integration for Spatial Reason ing
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间推理任务中因几何先验信息被动融合而导致的语义-几何错位(semantic-geometry misalignment)和冗余信号干扰问题。现有方法通常将3D编码器提供的几何信息作为全局流进行无差别融合,缺乏对任务相关空间结构的选择性感知能力。解决方案的关键在于提出GeoThinker框架,其核心创新是将被动融合转变为主动感知机制:通过在视觉语言模型(VLM)的特定层实施空间锚定融合(Spatial-Grounded Fusion),使模型能够根据内部推理需求选择性地检索几何证据;同时引入重要性门控(Importance Gating)机制,在帧级交叉注意力中偏向于任务相关的结构,从而实现更精准的空间信息整合。这一策略显著提升了空间智能性能,在VSI-Bench上达到72.6的峰值分数,并展现出在具身指代和自动驾驶等复杂下游场景中的强泛化能力。
链接: https://arxiv.org/abs/2602.06037
作者: Haoyuan Li,Qihang Cao,Tao Tang,Kun Xiang,Zihan Guo,Jianhua Han,Hang Xu,Xiaodan Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at this https URL.
zh
[CV-5] InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions
【速读】:该论文旨在解决人形机器人在复杂人-物交互场景中,如何实现可泛化的全身协调运动技能(loco-manipulation skills)的问题。现有方法难以从大规模人类示范中学习到具有物理一致性的生成式控制器,且缺乏对未见目标或初始状态的可靠泛化能力。解决方案的关键在于提出InterPrior框架:首先通过大规模模仿学习将全参考专家策略蒸馏为一个目标条件的变分策略(goal-conditioned variational policy),从而从多模态观测和高阶意图中重建运动;随后引入物理扰动数据增强并结合强化学习微调,使潜在空间中的技能分布收敛至有效流形(valid manifold),显著提升模型在未见任务和环境下的泛化性能,同时支持新物体交互等扩展行为。
链接: https://arxiv.org/abs/2602.06035
作者: Sirui Xu,Samuel Schulter,Morteza Ziyadi,Xialin He,Xiaohan Fei,Yu-Xiong Wang,Liangyan Gui
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon(亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Webpage: this https URL
Abstract:Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.
zh
[CV-6] V-Retrver: Evidence-Driven Agent ic Reason ing for Universal Multimodal Retrieval
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在通用多模态检索任务中依赖静态视觉编码、缺乏主动验证细粒度视觉证据的问题,这导致在视觉模糊场景下容易产生推测性推理。其解决方案的关键在于提出V-Retrver框架,将多模态检索重构为基于视觉检查的代理式推理过程(agentic reasoning process),使MLLM能够通过外部视觉工具在推理过程中选择性获取视觉证据,并执行交替进行假设生成与目标化视觉检验的多模态交错推理。此外,作者采用基于课程学习的策略(结合监督式推理激活、拒绝式精炼和以证据对齐为目标的强化学习)训练该证据收集型检索代理,从而显著提升检索准确性(平均提升23.0%)、感知驱动推理的可靠性及泛化能力。
链接: https://arxiv.org/abs/2602.06034
作者: Dongyang Chen,Chaoyang Wang,Dezhao SU,Xi Xiao,Zeyu Zhang,Jing Xiong,Qing Li,Yuzhang Shang,Shichao Ka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual this http URL train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.
zh
[CV-7] Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation ICLR2026
【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models, VFM)在应用于二维(2D)下游任务时普遍缺乏三维(3D)感知能力的问题。现有方法通常依赖于耗时的逐场景优化,难以有效建模几何一致性并导致特征平均化伪影。其解决方案的关键在于提出“Splat and Distill”框架:首先通过前馈式3D重建管道将教师模型输出的2D特征提升为显式的3D高斯表示,随后将这些3D特征“投射”(splatted)至新视角生成新的2D特征图,用以监督学生模型训练,从而实现几何约束知识的蒸馏。该方法摒弃了传统慢速优化流程,显著提升了教师与学生模型在几何一致性上的协同进化能力,并在单目深度估计、表面法向量估计、多视图对应关系和语义分割等任务中实现性能突破,同时增强了2D特征的语义丰富性。
链接: https://arxiv.org/abs/2602.06032
作者: David Shavin,Sagie Benaim
机构: The Hebrew University of Jerusalem (希伯来大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2026
Abstract:Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher’s consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at this https URL
zh
[CV-8] Context Forcing: Consistent Autoregressive Video Generation with Long Context
【速读】:该论文旨在解决实时长视频生成中因教师-学生结构不匹配导致的长期一致性不足问题。现有方法通常采用短上下文(无记忆)教师对长上下文学生进行监督,但由于教师无法获取全局时序依赖信息,限制了学生的上下文长度和生成稳定性。解决方案的关键在于提出Context Forcing框架,通过引入具备完整生成历史感知能力的长上下文教师,消除监督信号与学生需求之间的不匹配,从而实现更稳定的长视频生成训练。为应对极端时长(如2分钟)带来的计算挑战,进一步设计了Slow-Fast Memory架构,将线性增长的上下文转化为高效管理机制,显著降低视觉冗余,使模型在保持高一致性的同时支持超过20秒的有效上下文长度,性能优于当前最优方法如LongLive和Infinite-RoPE。
链接: https://arxiv.org/abs/2602.06028
作者: Shuo Chen,Cong Wei,Sun Sun,Ping Nie,Kai Zhou,Ge Zhang,Ming-Hsuan Yang,Wenhu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbfstudent-teacher mismatch: the teacher’s inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student’s context length. To resolve this, we propose \textbfContext Forcing, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbfSlow-Fast Memory architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds – 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.
zh
[CV-9] MambaVF: State Space Model for Efficient Video Fusion
【速读】:该论文旨在解决现有视频融合方法严重依赖光流估计(optical flow estimation)和特征扭曲(feature warping)所带来的高计算开销与扩展性差的问题。其解决方案的关键在于提出了一种基于状态空间模型(state space models, SSMs)的高效视频融合框架 MambaVF,通过将视频融合建模为序列状态更新过程,在无需显式运动估计的情况下实现线性复杂度的长时程时序建模;同时设计了一个轻量级 SSM-based 融合模块,以时空双向扫描机制替代传统的流引导对齐方式,从而实现跨帧信息的有效聚合,显著降低参数量(最多减少 92.25%)和计算量(最多减少 88.79% FLOPs),并带来 2.1 倍的速度提升。
链接: https://arxiv.org/abs/2602.06017
作者: Zixiang Zhao,Yukun Cui,Lilun Deng,Haowen Bai,Haotong Qin,Tao Feng,Konrad Schindler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: this https URL
zh
[CV-10] GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?
【速读】:该论文旨在解决当前视觉生成模型(Visual Generation Models)评估中普遍采用的绝对逐点评分标准(absolute pointwise scoring standard)存在的局限性,如随机不一致性(stochastic inconsistency)和与人类感知对齐度差的问题。其解决方案的关键在于提出GenArena这一统一评估框架,该框架采用成对比较(pairwise comparison)范式,显著提升了评估的稳定性与人类感知的一致性;实验表明,仅通过引入此成对协议,即可使开源模型在评估性能上超越顶级专有模型,且相较传统方法在准确性上提升超过20%,Spearman相关系数达到0.86,远高于点对点方法的0.36。
链接: https://arxiv.org/abs/2602.06013
作者: Ruihang Li,Leigang Qu,Jingxu Zhang,Dongnan Gui,Mengde Xu,Xiaosong Zhang,Han Hu,Wenjie Wang,Jiaqi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL , Code: this https URL
Abstract:The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
zh
[CV-11] VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation
【速读】:该论文旨在解决当前多模态大语言模型在截图转代码(Screenshot-to-code generation)任务中生成代码质量不足的问题,即模型直接从截图映射到代码时缺乏对视觉输出的反馈机制,导致生成结果与目标布局和样式存在偏差。其解决方案的关键在于提出VisRefiner训练框架,通过构建“差异对齐监督”(difference-aligned supervision),将渲染后的预测代码与参考设计之间的视觉差异与对应的代码修改关联起来,使模型能够理解视觉变化如何由代码调整引起;在此基础上引入强化学习阶段实现自我精炼(self-refinement),模型通过对比渲染结果与目标设计,识别视觉差异并迭代优化代码,从而显著提升单步生成质量和布局保真度,并赋予模型持续改进的能力。
链接: https://arxiv.org/abs/2602.05998
作者: Jie Deng,Kaichun Yao,Libo Zhang
机构: Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.
zh
[CV-12] RISE-Video: Can Video Generators Decode Implicit World Rules?
【速读】:该论文旨在解决当前生成式视频模型(Generative Video Models)在内化和推理隐含世界规则方面能力不足的问题,即模型虽具备高视觉保真度,但在理解与模拟复杂物理规律、常识逻辑及时间动态等认知层面存在显著缺陷。解决方案的关键在于提出首个面向推理的基准测试框架RISE-Video,其核心包括:(1)构建包含467个精细人工标注样本的多维度数据集,覆盖八类具有挑战性的推理场景;(2)设计一套涵盖推理一致性、时序一致性、物理合理性与视觉质量的四维评估指标体系;(3)开发基于大语言视觉模型(Large Multimodal Models, LMMs)的自动化评估流水线,实现高效且贴近人类判断的可扩展评测机制。该方案为系统性评估和推动生成式视频模型向更高层次世界模拟能力演进提供了重要工具与基准。
链接: https://arxiv.org/abs/2602.05986
作者: Mingxin Liu,Shuran Ma,Shibei Meng,Xiangyu Zhao,Zicheng Zhang,Shaofeng Zhang,Zhihang Zhong,Peixian Chen,Haoyu Cao,Xing Sun,Haodong Duan,Xue Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 38 pages, 16 figures, 3 tables; Code: this https URL HuggingFace: this https URL
Abstract:While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textitReasoning Alignment, \textitTemporal Consistency, \textitPhysical Rationality, and \textitVisual Quality. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.
zh
[CV-13] LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation
【速读】:该论文旨在解决当前可控视频生成方法在自动驾驶场景数据合成中依赖推理时外部控制信号的问题,这限制了其作为可扩展、通用的数据引擎的潜力。解决方案的关键在于提出局部语义对齐(Localized Semantic Alignment, LSA)框架,通过在预训练视频生成模型上进行微调,利用真实视频与生成视频在动态物体局部区域的语义特征一致性来增强时间一致性:具体而言,采用现成的特征提取模型比较真实视频与生成视频在动态对象周围的语义特征差异,构建语义特征一致性损失,并将其与标准扩散损失结合用于模型微调。实验表明,仅需单轮微调即可显著提升视频生成质量,且无需推理时额外控制信号或计算开销。
链接: https://arxiv.org/abs/2602.05966
作者: Mirlan Karimov,Teodora Spasojevic,Markus Braun,Julian Wiederer,Vasileios Belagiannis,Marc Pollefeys
机构: 1. University of Oxford (牛津大学); 2. ETH Zurich (苏黎世联邦理工学院); 3. Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE IV 2026. 8 pages, 3 figures. Code available at this https URL
Abstract:Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.
zh
[CV-14] Better Source Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching
【速读】:该论文旨在解决当前基于流匹配(Flow Matching)的条件生成模型中,源分布(source distribution)设计不合理导致的训练不稳定与性能瓶颈问题。现有方法普遍沿用扩散模型中的标准高斯分布作为源分布,忽视了条件信息在源空间中的有效利用,从而限制了生成质量与收敛速度。解决方案的关键在于:提出一种条件依赖的源分布学习机制,在流匹配目标下更好地利用丰富的条件信号;同时通过方差正则化和源-目标方向对齐策略,缓解因直接引入条件导致的分布坍缩(distributional collapse)与训练不稳定性,并系统分析目标表示空间选择对结构化源设计效果的影响,最终实现更稳定、高效的文本到图像生成。
链接: https://arxiv.org/abs/2602.05951
作者: Junwan Kim,Jiho Park,Seonghu Jeon,Seungryong Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL
Abstract:Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.
zh
[CV-15] Multi-Scale Global-Instance Prompt Tuning for Continual Test-time Adaptation in Medical Image Segmentation
【速读】:该论文旨在解决医学图像在不同临床中心间分布偏移(distribution shift)所导致的预训练语义分割模型跨域部署困难问题,尤其聚焦于持续测试时适应(Continual Test-Time Adaptation, CTTA)场景下因模型参数增量更新引发的误差累积与灾难性遗忘(catastrophic forgetting)问题。其解决方案的关键在于提出多尺度全局-实例提示调优(Multi-scale Global-Instance Prompt Tuning, MGIPT),通过引入自适应尺度实例提示(Adaptive-scale Instance Prompt, AIP)和多尺度全局提示(Multi-scale Global-level Prompt, MGP)两个互补组件:AIP动态学习轻量级、实例特定的提示以实现最优尺度选择并缓解误差累积;MGP则在不同尺度上捕捉域级知识以增强抗遗忘能力。二者通过加权集成策略融合,实现全局与局部信息协同的双层级鲁棒适应,显著提升在持续变化目标域中的性能表现。
链接: https://arxiv.org/abs/2602.05937
作者: Lingrui Li,Yanfeng Zhou,Nan Pu,Xin Chen,Zhun Zhong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, BIBM2025
Abstract:Distribution shift is a common challenge in medical images obtained from different clinical centers, significantly hindering the deployment of pre-trained semantic segmentation models in real-world applications across multiple domains. Continual Test-Time Adaptation(CTTA) has emerged as a promising approach to address cross-domain shifts during continually evolving target domains. Most existing CTTA methods rely on incrementally updating model parameters, which inevitably suffer from error accumulation and catastrophic forgetting, especially in long-term adaptation. Recent prompt-tuning-based works have shown potential to mitigate the two issues above by updating only visual prompts. While these approaches have demonstrated promising performance, several limitations remain:1)lacking multi-scale prompt diversity, 2)inadequate incorporation of instance-specific knowledge, and 3)risk of privacy leakage. To overcome these limitations, we propose Multi-scale Global-Instance Prompt Tuning(MGIPT), to enhance scale diversity of prompts and capture both global- and instance-level knowledge for robust CTTA. Specifically, MGIPT consists of an Adaptive-scale Instance Prompt(AIP) and a Multi-scale Global-level Prompt(MGP). AIP dynamically learns lightweight and instance-specific prompts to mitigate error accumulation with adaptive optimal-scale selection mechanism. MGP captures domain-level knowledge across different scales to ensure robust adaptation with anti-forgetting capabilities. These complementary components are combined through a weighted ensemble approach, enabling effective dual-level adaptation that integrates both global and local information. Extensive experiments on medical image segmentation benchmarks demonstrate that our MGIPT outperforms state-of-the-art methods, achieving robust adaptation across continually changing target domains.
zh
[CV-16] CLIP-Map: Structured Matrix Mapping for Parameter-Efficient CLIP Compression
【速读】:该论文旨在解决对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)模型在资源受限场景下部署困难的问题,其核心挑战在于现有压缩方法依赖于权重选择机制(select-based weight inheritance),在极端压缩比下易导致特征表示能力下降。解决方案的关键在于提出一种基于映射的CLIP压缩框架CLIP-Map,通过可学习矩阵对预训练权重进行全映射(Full-Mapping)与Kronecker分解(Kronecker Factorization)组合,以最大程度保留原始权重信息;同时引入对角继承初始化(Diagonal Inheritance Initialization)策略缓解因可学习映射带来的分布偏移问题,从而实现高效且稳定的压缩学习。
链接: https://arxiv.org/abs/2602.05909
作者: Kangjie Zhang,Wenxuan Huang,Xin Zhou,Boxiang Zhou,Dejia Song,Yuan Xie,Baochang Zhang,Lizhuang Ma,Nemo Chen,Xu Tang,Yao Hu,Shaohui Lin
机构: East China Normal University (华东师范大学); Xiaohongshu Inc. (小红书公司); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Contrastive Language-Image Pre-training (CLIP) has achieved widely applications in various computer vision tasks, e.g., text-to-image generation, Image-Text retrieval and Image captioning. However, CLIP suffers from high memory and computation cost, which prohibits its usage to the resource-limited application scenarios. Existing CLIP compression methods typically reduce the size of pre-trained CLIP weights by selecting their subset as weight inheritance for further retraining via mask optimization or important weight measurement. However, these select-based weight inheritance often compromises the feature presentation ability, especially on the extreme compression. In this paper, we propose a novel mapping-based CLIP compression framework, CLIP-Map. It leverages learnable matrices to map and combine pretrained weights by Full-Mapping with Kronecker Factorization, aiming to preserve as much information from the original weights as possible. To mitigate the optimization challenges introduced by the learnable mapping, we propose Diagonal Inheritance Initialization to reduce the distribution shifting problem for efficient and effective mapping learning. Extensive experimental results demonstrate that the proposed CLIP-Map outperforms select-based frameworks across various compression ratios, with particularly significant gains observed under high compression settings.
zh
[CV-17] Neural Implicit 3D Cardiac Shape Reconstruction from Sparse CT Angiography Slices Mimicking 2D Transthoracic Echocardiography Views
【速读】:该论文旨在解决从二维超声心动图(transthoracic echocardiography, TTE)中稀疏切面重建完整三维心脏结构的问题,以实现对心腔和左心室心肌的定量分析。其关键解决方案是利用神经隐式函数(neural implicit function)从CT血管造影(CTA)中稀疏切面的分割结果中学习形状先验,并在测试阶段通过联合优化潜在编码(latent code)与刚性变换(rigid transforms),将模拟标准心尖TTE视角的切面映射到三维空间,从而重建出高精度的三维心脏形态。该方法在多个结构上实现了优于临床常用Simpson双平面法的体积估计误差,表明其在二维TTE中实现更准确三维量化具有可行性。
链接: https://arxiv.org/abs/2602.05884
作者: Gino E. Jansen,Carolina Brás,R. Nils Planken,Mark J. Schuuring,Berto J. Bouma,Ivana Išgum
机构: Amsterdam UMC(阿姆斯特丹大学医学中心); University of Amsterdam(阿姆斯特丹大学); Mayo Clinic(梅奥诊所); University of Twente(特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Accurate 3D representations of cardiac structures allow quantitative analysis of anatomy and function. In this work, we propose a method for reconstructing complete 3D cardiac shapes from segmentations of sparse planes in CT angiography (CTA) for application in 2D transthoracic echocardiography (TTE). Our method uses a neural implicit function to reconstruct the 3D shape of the cardiac chambers and left-ventricle myocardium from sparse CTA planes. To investigate the feasibility of achieving 3D reconstruction from 2D TTE, we select planes that mimic the standard apical 2D TTE views. During training, a multi-layer perceptron learns shape priors from 3D segmentations of the target structures in CTA. At test time, the network reconstructs 3D cardiac shapes from segmentations of TTE-mimicking CTA planes by jointly optimizing the latent code and the rigid transforms that map the observed planes into 3D space. For each heart, we simulate four realistic apical views, and we compare reconstructed multi-class volumes with the reference CTA volumes. On a held-out set of CTA segmentations, our approach achieves an average Dice coefficient of 0.86 \pm 0.04 across all structures. Our method also achieves markedly lower volume errors than the clinical standard, Simpson’s biplane rule: 4.88 \pm 4.26 mL vs. 8.14 \pm 6.04 mL, respectively, for the left ventricle; and 6.40 \pm 7.37 mL vs. 37.76 \pm 22.96 mL, respectively, for the left atrium. This suggests that our approach offers a viable route to more accurate 3D chamber quantification in 2D transthoracic echocardiography.
zh
[CV-18] EoCD: Encoder only Remote Sensing Change Detection
【速读】:该论文旨在解决现有变化检测方法中模型复杂度高、计算成本大以及解码器设计冗余的问题。具体而言,传统方法多采用孪生编码器(Siamese encoder)分别提取时序特征再进行融合(late fusion),随后依赖复杂的解码器提升性能,导致整体架构臃肿且效率低下;而早期融合(early fusion)方案虽可减少编码器开销,但性能通常不如晚期融合方法。为此,作者提出仅使用编码器的变化检测方法(Encoder-only Change Detection, EoCD),其核心创新在于:通过早期融合时序图像,并用一个无参数的多尺度特征融合模块替代传统解码器,从而显著降低模型复杂度,同时保持优异的检测性能。实验表明,EoCD在多种编码器架构下均实现了性能与推理速度的最佳平衡,且验证了编码器对模型表现起决定性作用,解码器仅为辅助组件。
链接: https://arxiv.org/abs/2602.05882
作者: Mubashir Noman,Mustansar Fiaz,Hiyam Debary,Abdul Hannan,Shah Nawaz,Fahad Shahbaz Khan,Salman Khan
机构: MBZUAI, UAE; IBM Research, UAE; University of Trento, Italy; Johannes Kepler University Linz, Austria
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Being a cornerstone of temporal analysis, change detection has been playing a pivotal role in modern earth observation. Existing change detection methods rely on the Siamese encoder to individually extract temporal features followed by temporal fusion. Subsequently, these methods design sophisticated decoders to improve the change detection performance without taking into consideration the complexity of the model. These aforementioned issues intensify the overall computational cost as well as the network’s complexity which is undesirable. Alternatively, few methods utilize the early fusion scheme to combine the temporal images. These methods prevent the extra overhead of Siamese encoder, however, they also rely on sophisticated decoders for better performance. In addition, these methods demonstrate inferior performance as compared to late fusion based methods. To bridge these gaps, we introduce encoder only change detection (EoCD) that is a simple and effective method for the change detection task. The proposed method performs the early fusion of the temporal data and replaces the decoder with a parameter-free multiscale feature fusion module thereby significantly reducing the overall complexity of the model. EoCD demonstrate the optimal balance between the change detection performance and the prediction speed across a variety of encoder architectures. Additionally, EoCD demonstrate that the performance of the model is predominantly dependent on the encoder network, making the decoder an additional component. Extensive experimentation on four challenging change detection datasets reveals the effectiveness of the proposed method.
zh
[CV-19] Contour Refinement using Discrete Diffusion in Low Data Regime
【速读】:该论文旨在解决不规则且透明物体的边界检测问题,尤其在标注数据稀缺和本地计算资源有限的场景下,传统图像分割方法对边界检测的关注不足。解决方案的关键在于提出一种轻量级离散扩散轮廓精炼(discrete diffusion contour refinement)流程,其核心采用带有自注意力机制的卷积神经网络(CNN)架构,以分割掩码为条件,迭代去噪稀疏轮廓表示,从而生成密集且孤立的边界。该方法通过简化扩散过程、定制化模型结构及最小后处理,在仅500张训练图像的数据集上实现高效低数据下的鲁棒边界检测,并在KVASIR医学影像数据集上超越多个SOTA基线,同时推理帧率提升3.5倍。
链接: https://arxiv.org/abs/2602.05880
作者: Fei Yu Guan,Ian Keefe,Sophie Wilkinson,Daniel D.B. Perrakis,Steven Waslander
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CRV 2026, 8 pages, 6 figures
Abstract:Boundary detection of irregular and translucent objects is an important problem with applications in medical imaging, environmental monitoring and manufacturing, where many of these applications are plagued with scarce labeled data and low in situ computational resources. While recent image segmentation studies focus on segmentation mask alignment with ground-truth, the task of boundary detection remains understudied, especially in the low data regime. In this work, we present a lightweight discrete diffusion contour refinement pipeline for robust boundary detection in the low data regime. We use a Convolutional Neural Network(CNN) architecture with self-attention layers as the core of our pipeline, and condition on a segmentation mask, iteratively denoising a sparse contour representation. We introduce multiple novel adaptations for improved low-data efficacy and inference efficiency, including using a simplified diffusion process, a customized model architecture, and minimal post processing to produce a dense, isolated contour given a dataset of size 500 training images. Our method outperforms several SOTA baselines on the medical imaging dataset KVASIR, is competitive on HAM10K and our custom wildfire dataset, Smoke, while improving inference framerate by 3.5X.
zh
[CV-20] Pathwise Test-Time Correction for Autoregressive Long Video Generation
【速读】:该论文旨在解决蒸馏自回归扩散模型(distilled autoregressive diffusion models)在长序列视频生成过程中存在的误差累积问题,尤其是现有测试时优化(Test-Time Optimization, TTO)方法因奖励函数不稳定和参数敏感性过高而无法有效缓解长时间生成中的漂移现象。其解决方案的关键在于提出一种无需训练的测试时校正方法(Test-Time Correction, TTC),该方法利用初始帧作为稳定的参考锚点,对采样轨迹中各中间随机状态进行校准,从而在不增加显著计算开销的前提下显著延长生成长度并保持与资源密集型训练方法相当的视频质量。
链接: https://arxiv.org/abs/2602.05871
作者: Xunzhi Xiang,Zixuan Duan,Guiyu Zhang,Haiyu Zhang,Zhe Gao,Junta Wu,Shaofeng Zhang,Tengfei Wang,Qi Fan,Chunchao Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Distilled autoregressive diffusion models facilitate real-time short video synthesis but suffer from severe error accumulation during long-sequence generation. While existing Test-Time Optimization (TTO) methods prove effective for images or short clips, we identify that they fail to mitigate drift in extended sequences due to unstable reward landscapes and the hypersensitivity of distilled parameters. To overcome these limitations, we introduce Test-Time Correction (TTC), a training-free alternative. Specifically, TTC utilizes the initial frame as a stable reference anchor to calibrate intermediate stochastic states along the sampling trajectory. Extensive experiments demonstrate that our method seamlessly integrates with various distilled models, extending generation lengths with negligible overhead while matching the quality of resource-intensive training-based methods on 30-second benchmarks.
zh
[CV-21] Self-Supervised Learning with a Multi-Task Latent Space Objective
【速读】:该论文旨在解决多尺度裁剪(multi-crop)策略在基于预测器(predictor-based)的自监督学习(Self-supervised Learning, SSL)框架中导致训练不稳定的问题,例如BYOL、SimSiam和MoCo v3等模型。其关键解决方案是摒弃所有视图共享同一预测器的设计,转而为每种视图类型(如全局视图、局部视图和遮挡视图)分配独立的预测器,从而实现更稳定的多任务对齐学习。进一步地,作者将不同空间变换视为独立的对齐任务,并引入遮挡视图(cutout views),通过掩码部分图像内容增强表示学习能力,最终构建了一个统一、稳定且适用于ResNet与ViT等多种骨干网络的异构Siamese SSL框架,在ImageNet上显著提升性能。
链接: https://arxiv.org/abs/2602.05845
作者: Pierre-François De Plaen,Abhishek Jha,Luc Van Gool,Tinne Tuytelaars,Marc Proesmans
机构: ESAT-PSI, KU Leuven, Belgium; VIB.AI, KU Leuven, Belgium; CVL, ETH Zürich, Switzerland; INSAIT, Sofia University, Bulgaria; TRACE vzw
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-supervised learning (SSL) methods based on Siamese networks learn visual representations by aligning different views of the same image. The multi-crop strategy, which incorporates small local crops to global ones, enhances many SSL frameworks but causes instability in predictor-based architectures such as BYOL, SimSiam, and MoCo v3. We trace this failure to the shared predictor used across all views and demonstrate that assigning a separate predictor to each view type stabilizes multi-crop training, resulting in significant performance gains. Extending this idea, we treat each spatial transformation as a distinct alignment task and add cutout views, where part of the image is masked before encoding. This yields a simple multi-task formulation of asymmetric Siamese SSL that combines global, local, and masked views into a single framework. The approach is stable, generally applicable across backbones, and consistently improves the performance of ResNet and ViT models on ImageNet.
zh
[CV-22] UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents
【速读】:该论文旨在解决在线强化学习(Online Reinforcement Learning, RL)在图形用户界面(GUI)代理任务中面临的两大挑战:一是长时程任务中信用分配效率低下,二是缺乏经验迁移导致跨任务重复错误。其解决方案的关键在于提出UI-Mem框架,该框架引入分层经验记忆(Hierarchical Experience Memory),以结构化方式存储高阶工作流、子任务技能和失败模式,并将其表示为可参数化的模板,从而实现跨任务与跨应用的经验迁移;同时,通过分层组采样(Stratified Group Sampling)机制在每轮 rollout 中注入不同层级的引导信号,维持策略探索多样性并促使无引导策略内化引导行为,辅以自演化循环(Self-Evolving Loop)持续抽象新策略与错误,确保记忆与代理策略同步演进。
链接: https://arxiv.org/abs/2602.05832
作者: Han Xiao,Guozhi Wang,Hao Wang,Shilong Liu,Yuxiang Chai,Yue Pan,Yufeng Zhou,Xiaoxin Chen,Yafei Wen,Hongsheng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 16 figures. Project page: this https URL
Abstract:Online Reinforcement Learning (RL) offers a promising paradigm for enhancing GUI agents through direct environment interaction. However, its effectiveness is severely hindered by inefficient credit assignment in long-horizon tasks and repetitive errors across tasks due to the lack of experience transfer. To address these challenges, we propose UI-Mem, a novel framework that enhances GUI online RL with a Hierarchical Experience Memory. Unlike traditional replay buffers, our memory accumulates structured knowledge, including high-level workflows, subtask skills, and failure patterns. These experiences are stored as parameterized templates that enable cross-task and cross-application transfer. To effectively integrate memory guidance into online RL, we introduce Stratified Group Sampling, which injects varying levels of guidance across trajectories within each rollout group to maintain outcome diversity, driving the unguided policy toward internalizing guided behaviors. Furthermore, a Self-Evolving Loop continuously abstracts novel strategies and errors to keep the memory aligned with the agent’s evolving policy. Experiments on online GUI benchmarks demonstrate that UI-Mem significantly outperforms traditional RL baselines and static reuse strategies, with strong generalization to unseen applications. Project page: this https URL
zh
[CV-23] Weaver: End-to-End Agent ic System Training for Video Interleaved Reason ing
【速读】:该论文旨在解决当前视频推理(video reasoning)任务中模型因感知能力有限和文本导向的思维链(Chain-of-Thought)方法存在表征不匹配而导致性能瓶颈的问题。解决方案的关键在于提出一个端到端可训练的多模态推理代理系统Weaver,其核心创新是使策略模型在推理过程中动态调用多种工具,以逐步获取关键视觉线索并构建真实的多模态推理路径;同时引入强化学习算法,在无轨迹数据下自由探索工具的使用与组合策略,从而显著提升复杂视频推理任务(尤其是长视频场景)的性能表现。
链接: https://arxiv.org/abs/2602.05829
作者: Yudi Shi,Shangzhe Di,Qirui Chen,Qinian Wang,Jiayin Cai,Xiaolong Jiang,Yao Hu,Weidi Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video reasoning constitutes a comprehensive assessment of a model’s capabilities, as it demands robust perceptual and interpretive skills, thereby serving as a means to explore the boundaries of model performance. While recent research has leveraged text-centric Chain-of-Thought reasoning to augment these capabilities, such approaches frequently suffer from representational mismatch and restricted by limited perceptual acuity. To address these limitations, we propose Weaver, a novel, end-to-end trainable multimodal reasoning agentic system. Weaver empowers its policy model to dynamically invoke diverse tools throughout the reasoning process, enabling progressive acquisition of crucial visual cues and construction of authentic multimodal reasoning trajectories. Furthermore, we integrate a reinforcement learning algorithm to allow the system to freely explore strategies for employing and combining these tools with trajectory-free data. Extensive experiments demonstrate that our system, Weaver, enhances performance on several complex video reasoning benchmarks, particularly those involving long videos.
zh
[CV-24] Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation
【速读】:该论文旨在解决视觉语言导航(Vision-Language Navigation, VLN)中依赖详细且冗长语言指令的问题,其核心挑战在于实现“超越视野导航”(Beyond-the-View Navigation, BVN),即在缺乏密集步骤指导的情况下,让智能体仅凭高阶意图完成远距离、未见过目标的定位任务。现有基于大语言模型(Large Language Model, LLM)的方法因受限于短视监督机制而难以应对长程规划,且简单延长监督时长会破坏训练稳定性。论文的关键创新在于识别出视频生成模型天然具备利用长时程监督对齐语言指令的能力,首次将其引入BVN领域;并进一步提出SparseVideoNav框架,通过生成一个20秒时长的稀疏未来轨迹来实现亚秒级推理,相较未优化方案提速达27倍,同时在真实世界零样本测试中将成功率提升至当前最优LLM基线的2.5倍,尤其在夜间复杂场景中首次实现了BVN能力。
链接: https://arxiv.org/abs/2602.05827
作者: Hai Zhang,Siqi Liang,Li Chen,Yuxian Li,Yukuan Xu,Yichao Zhong,Fu Zhang,Hongyang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.
zh
[CV-25] NVS-HO: A Benchmark for Novel View Synthesis of Handheld Objects
【速读】:该论文旨在解决手持物体在真实环境中仅使用RGB输入进行新视角合成(Novel View Synthesis, NVS)的挑战性问题,尤其针对无约束手持条件下的性能瓶颈。其解决方案的关键在于构建首个专门针对此类场景的基准数据集NVS-HO,该数据集包含两种互补的RGB序列:一是手持序列(handheld sequence),用于学习物体完整外观;二是标定板序列(board sequence),通过ChArUco板提供精确相机位姿作为评估的真值图像。通过结合经典结构光恢复(Structure-from-Motion, SfM)与前沿预训练神经网络(如VGGT)作为位姿估计器,并基于NeRF和高斯泼溅(Gaussian Splatting)训练NVS模型,实验揭示了当前方法在非受控手持条件下存在显著性能差距,从而凸显了开发更鲁棒算法的必要性。
链接: https://arxiv.org/abs/2602.05822
作者: Musawar Ali,Manuel Carranza-García,Nicola Fioraio,Samuele Salti,Luigi Di Stefano
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose NVS-HO, the first benchmark designed for novel view synthesis of handheld objects in real-world environments using only RGB inputs. Each object is recorded in two complementary RGB sequences: (1) a handheld sequence, where the object is manipulated in front of a static camera, and (2) a board sequence, where the object is fixed on a ChArUco board to provide accurate camera poses via marker detection. The goal of NVS-HO is to learn a NVS model that captures the full appearance of an object from (1), whereas (2) provides the ground-truth images used for evaluation. To establish baselines, we consider both a classical SfM pipeline and a state-of-the-art pre-trained feed-forward neural network (VGGT) as pose estimators, and train NVS models based on NeRF and Gaussian Splatting. Our experiments reveal significant performance gaps in current methods under unconstrained handheld conditions, highlighting the need for more robust approaches. NVS-HO thus offers a challenging real-world benchmark to drive progress in RGB-based novel view synthesis of handheld objects.
zh
[CV-26] Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在推理过程中生成大量视觉token所导致的延迟高和内存占用大的问题,尤其是在不进行训练的情况下实现高效压缩时,如何平衡局部证据与全局上下文仍是一个挑战。解决方案的关键在于提出一种受人类认知启发的“聚焦-扫描-精炼”(Focus-Scan-Refine, FSR)剪枝框架:首先通过融合视觉重要性与指令相关性来聚焦关键证据,避免偏向于视觉显著但与查询无关的区域;其次基于聚焦结果扫描互补上下文,选择与已聚焦区域差异最大的token以补充信息;最后通过相似性分配和得分加权合并机制,在不增加token预算的前提下将邻近信息聚合到扫描锚点中,从而实现对上下文的精细化重构。
链接: https://arxiv.org/abs/2602.05809
作者: Enwei Tong,Yuanchao Bai,Yao Zhu,Junjun Jiang,Xianming Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods. The source codes can be found at this https URL
zh
[CV-27] Allocentric Perceiver: Disentangling Allocentric Reason ing from Egocentric Visual Priors via Frame Instantiation
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理以目标为中心的分配式空间查询时表现脆弱的问题,这类任务要求模型具备从观察视角切换到目标中心参考系进行推理的能力,而现有VLMs往往依赖隐式推理难以准确完成此类空间转换。解决方案的关键在于提出了一种无需训练的策略——分配式感知器(Allocentric Perceiver),其核心是利用现成的几何专家从一张或多张图像中恢复度量3D状态,并据此构建与指令语义意图对齐的查询条件分配式参考坐标系;通过将重建的几何信息确定性地变换至目标帧,并以结构化的、基于几何的表示方式提示主干VLM,从而将原本依赖模型内部心理旋转的隐式推理转化为显式的计算过程,显著提升了分配式空间推理能力,在多个基准测试中实现了约10%的性能提升,同时保持了对自我中心(egocentric)任务的良好表现。
链接: https://arxiv.org/abs/2602.05789
作者: Hengyi Wang,Ruiqiang Zhang,Chang Liu,Guanjie Wang,Zehua Ma,Han Fang,Weiming Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction’s semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ( \sim 10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.
zh
[CV-28] ReText: Text Boosts Generalization in Image-Based Person Re-identification
【速读】:该论文旨在解决跨域图像行人再识别(Re-ID)中模型泛化能力不足的问题,即在未见过的场景或摄像头部署环境下,如何实现无需重新训练即可准确识别个体。其解决方案的关键在于提出一种名为ReText的新方法,该方法通过融合多摄像头数据与单摄像头数据,并利用文本描述增强单摄像头数据的语义信息,从而提升模型对复杂场景的适应性。具体而言,ReText在训练过程中联合优化三项任务:多摄像头Re-ID任务、图像-文本匹配任务以及基于文本引导的单摄像头图像重建任务,以此实现多模态协同学习,显著提升了跨域Re-ID的性能表现。
链接: https://arxiv.org/abs/2602.05785
作者: Timur Mamedov,Karina Kvanchiani,Anton Konushin,Vadim Konushin
机构: Tevian(特维安); Lomonosov Moscow State University(莫斯科国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks. To the best of our knowledge, this is the first work to explore multimodal joint learning on a mixture of multi-camera and single-camera data in image-based person Re-ID.
zh
[CV-29] FMPose3D: monocular 3D pose estimation via flow matching
【速读】:该论文旨在解决单目3D姿态估计中存在的深度模糊性和遮挡问题,这些问题使得传统方法难以生成准确且多样化的3D姿态预测。为应对这一挑战,作者提出了一种基于流匹配(Flow Matching, FM)的生成式框架FMPose3D,其核心创新在于将3D姿态估计建模为条件分布传输问题:通过求解常微分方程(Ordinary Differential Equation, ODE)从标准高斯先验连续映射到给定2D输入下的合理3D姿态分布。相比扩散模型需要大量迭代步数的 denoising 过程,FM仅需少量积分步骤即可高效生成多组3D姿态假设,同时利用不同噪声种子自然产生多样性。进一步地,引入基于重投影的后验期望聚合(Reprojection-based Posterior Expectation Aggregation, RPEA)模块,对多个假设进行贝叶斯后验期望估计,从而输出单一高精度预测。该方案在Human3.6M、MPI-INF-3DHP等人类3D姿态数据集及Animal3D、CtrlAni3D动物3D姿态数据集上均取得领先性能,验证了其跨域泛化能力。
链接: https://arxiv.org/abs/2602.05755
作者: Ti Wang,Xiaohang Yu,Mackenzie Weygandt Mathis
机构: École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses. In particular, diffusion-based models have recently demonstrated strong performance, but their iterative denoising process typically requires many timesteps for each prediction, making inference computationally expensive. In contrast, we leverage Flow Matching (FM) to learn a velocity field defined by an Ordinary Differential Equation (ODE), enabling efficient generation of 3D pose samples with only a few integration steps. We propose a novel generative pose estimation framework, FMPose3D, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned only on 2D inputs. Although ODE trajectories are deterministic, FMPose3D naturally generates various pose hypotheses by sampling different noise seeds. To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose3D surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both 3D pose domains. The code is available at this https URL.
zh
[CV-30] Neuro-Inspired Visual Pattern Recognition via Biological Reservoir Computing
【速读】:该论文旨在解决传统回声状态网络(Reservoir Computing, RC)中依赖人工递归模型近似神经动力学所带来的生物真实性不足问题。其核心挑战在于如何利用真实的生物神经网络作为计算基底,以实现高效且具生物学基础的模式识别任务。解决方案的关键在于构建一种生物神经计算(Biological Reservoir Computing, BRC)系统,其中体外培养的皮层神经元网络作为物理储层(physical reservoir),通过高密度多电极阵列(High-Density Multi-Electrode Array, HD-MEA)实现对神经活动的同步刺激与读出,从而提取高维生物特征表示;随后使用线性读出层(单层感知机)进行训练,完成静态视觉模式识别任务。实验表明,尽管存在噪声、自发活动及跨会话差异等生物响应变异性,该系统仍能稳定生成高质量特征空间并支持高精度分类,验证了活体神经网络在静态视觉识别中的有效性。
链接: https://arxiv.org/abs/2602.05737
作者: Luca Ciampi,Ludovico Iannello,Fabrizio Tonelli,Gabriele Lagani,Angelo Di Garbo,Federico Cremisi,Giuseppe Amato
机构: ISTI-CNR(意大利国家研究委员会信息科学与技术研究所); Scuola Normale Superiore (圣安娜高等学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:In this paper, we present a neuro-inspired approach to reservoir computing (RC) in which a network of in vitro cultured cortical neurons serves as the physical reservoir. Rather than relying on artificial recurrent models to approximate neural dynamics, our biological reservoir computing (BRC) system leverages the spontaneous and stimulus-evoked activity of living neural circuits as its computational substrate. A high-density multi-electrode array (HD-MEA) provides simultaneous stimulation and readout across hundreds of channels: input patterns are delivered through selected electrodes, while the remaining ones capture the resulting high-dimensional neural responses, yielding a biologically grounded feature representation. A linear readout layer (single-layer perceptron) is then trained to classify these reservoir states, enabling the living neural network to perform static visual pattern-recognition tasks within a computer-vision framework. We evaluate the system across a sequence of tasks of increasing difficulty, ranging from pointwise stimuli to oriented bars, clock-digit-like shapes, and handwritten digits from the MNIST dataset. Despite the inherent variability of biological neural responses-arising from noise, spontaneous activity, and inter-session differences-the system consistently generates high-dimensional representations that support accurate classification. These results demonstrate that in vitro cortical networks can function as effective reservoirs for static visual pattern recognition, opening new avenues for integrating living neural substrates into neuromorphic computing frameworks. More broadly, this work contributes to the effort to incorporate biological principles into machine learning and supports the goals of neuro-inspired vision by illustrating how living neural systems can inform the design of efficient and biologically grounded computational models.
zh
[CV-31] Depth as Prior Knowledge for Object Detection
【速读】:该论文旨在解决小目标和远距离目标在物体检测中因尺度变化、低分辨率及背景杂波导致的性能下降问题,尤其在安全关键场景下,可靠检测这些目标对安全规划至关重要。解决方案的关键在于提出DepthPrior框架,其核心思想是将深度信息作为先验知识而非融合特征,通过训练阶段的深度加权损失(DLW)与分层损失(DLS),以及推理阶段的深度感知置信度阈值(DCT),系统性地缓解深度引起的性能退化,无需修改检测器架构、额外传感器或引入性能开销,即可显著提升小目标检测的准确率(最高达+9% mAP_S 和 +7% mAR_S)。
链接: https://arxiv.org/abs/2602.05730
作者: Moussa Kassem Sbeyti,Nadja Klein
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Detecting small and distant objects remains challenging for object detectors due to scale variation, low resolution, and background clutter. Safety-critical applications require reliable detection of these objects for safe planning. Depth information can improve detection, but existing approaches require complex, model-specific architectural modifications. We provide a theoretical analysis followed by an empirical investigation of the depth-detection relationship. Together, they explain how depth causes systematic performance degradation and why depth-informed supervision mitigates it. We introduce DepthPrior, a framework that uses depth as prior knowledge rather than as a fused feature, providing comparable benefits without modifying detector architectures. DepthPrior consists of Depth-Based Loss Weighting (DLW) and Depth-Based Loss Stratification (DLS) during training, and Depth-Aware Confidence Thresholding (DCT) during inference. The only overhead is the initial cost of depth estimation. Experiments across four benchmarks (KITTI, MS COCO, VisDrone, SUN RGB-D) and two detectors (YOLOv11, EfficientDet) demonstrate the effectiveness of DepthPrior, achieving up to +9% mAP _S and +7% mAR _S for small objects, with inference recovery rates as high as 95:1 (true vs. false detections). DepthPrior offers these benefits without additional sensors, architectural changes, or performance costs. Code is available at this https URL.
zh
[CV-32] Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification
【速读】:该论文旨在解决当前多模态嵌入模型(Multimodal Embedding Models, MEMs)在复杂场景下难以同时捕捉全局语义与细粒度感知信息的问题,尤其针对基于大语言模型(Multimodal Large Language Models, MLLMs)的嵌入方法仅能提取全局语义特征的局限性。解决方案的关键在于提出自适应全局与细粒度感知融合机制(Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings, AGFF-Embed),通过提示MLLM生成多个聚焦于不同语义维度的嵌入,并实现其自适应、平滑聚合;同时结合显式梯度放大技术(Explicit Gradient Amplification, EGA),无需对数据集进行细粒度编辑即可增强批次内难负样本的学习效果,从而显著提升模型在通用和细粒度理解任务上的性能表现。
链接: https://arxiv.org/abs/2602.05729
作者: Lexiang Hu,Youze Xue,Dian Li,Gang Liu,Zhouchen Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Multimodal embeddings serve as a bridge for aligning vision and language, with the two primary implementations – CLIP-based and MLLM-based embedding models – both limited to capturing only global semantic information. Although numerous studies have focused on fine-grained understanding, we observe that complex scenarios currently targeted by MLLM embeddings often involve a hybrid perceptual pattern of both global and fine-grained elements, thus necessitating a compatible fusion mechanism. In this paper, we propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings (AGFF-Embed), a method that prompts the MLLM to generate multiple embeddings focusing on different dimensions of semantic information, which are then adaptively and smoothly aggregated. Furthermore, we adapt AGFF-Embed with the Explicit Gradient Amplification (EGA) technique to achieve in-batch hard negatives enhancement without requiring fine-grained editing of the dataset. Evaluation on the MMEB and MMVP-VLM benchmarks shows that AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding compared to other multimodal embedding models.
zh
[CV-33] Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization
【速读】:该论文旨在解决点监督时序动作定位(Point-supervised Temporal Action Localization, PTAL)中因仅标注单帧而导致模型难以理解动作内部时序关系的问题。现有方法通常仅依赖片段级分类任务,缺乏对动作时序结构的显式建模,从而限制了对完整动作区间准确识别的能力。解决方案的关键在于设计了一个多任务学习框架,通过引入三个自监督时序理解任务——动作补全(Action Completion)、动作顺序理解(Action Order Understanding)和动作规律性理解(Action Regularity Understanding),充分利用点标注信息来增强模型对跨视频动作时序一致性的感知能力,从而提升动作定位精度。
链接: https://arxiv.org/abs/2602.05718
作者: Yunchuan Ma,Laiyun Qing,Guorong Li,Yuqing Liu,Yuankai Qi,Qingming Huang
机构: University of Chinese Academy of Science (中国科学院大学); Macquarie University (麦考瑞大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (\textiti.e., labeling only a single frame per action instance) to train a model to effectively locate action instances within untrimmed videos. Most existing approaches design the task head of models with only a point-supervised snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action. However, understanding the temporal relationships of frames is crucial because it can help a model understand how an action is defined and therefore benefits localizing the full frames of an action. To this end, in this paper, we design a multi-task learning framework that fully utilizes point supervision to boost the model’s temporal understanding capability for action localization. Specifically, we design three self-supervised temporal understanding tasks: (i) Action Completion, (ii) Action Order Understanding, and (iii) Action Regularity Understanding. These tasks help a model understand the temporal consistency of actions across videos. To the best of our knowledge, this is the first attempt to explicitly explore temporal consistency for point supervision action localization. Extensive experimental results on four benchmark datasets demonstrate the effectiveness of the proposed method compared to several state-of-the-art approaches.
zh
[CV-34] Poster: Camera Tampering Detection for Outdoor IoT Systems
【速读】:该论文旨在解决户外环境中智能摄像头在拍摄静态图像时因人为破坏或恶劣环境因素导致的监控失效问题,特别是针对缺乏连续视频帧序列的情况下难以有效检测相机篡改的问题。解决方案的关键在于提出两种互补的方法:一是基于规则的检测方法,适用于资源受限且无法进行长时间校准的场景;二是基于深度学习的检测方法,能够在真实世界应用中实现更高的准确率,同时作者还公开了包含正常、模糊和旋转图像的基准数据集,以支持该领域算法的开发与评估。
链接: https://arxiv.org/abs/2602.05706
作者: Shadi Attarha,Kanaga Shanmugi,Anna Förster
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Proceedings of the 2024 INTERNATIONAL CONFERENCE ON EMBEDDED WIRELESS SYSTEMS AND NETWORKS (EWSN)
Abstract:Recently, the use of smart cameras in outdoor settings has grown to improve surveillance and security. Nonetheless, these systems are susceptible to tampering, whether from deliberate vandalism or harsh environmental conditions, which can undermine their monitoring effectiveness. In this context, detecting camera tampering is more challenging when a camera is capturing still images rather than video as there is no sequence of continuous frames over time. In this study, we propose two approaches for detecting tampered images: a rule-based method and a deep-learning-based method. The aim is to evaluate how each method performs in terms of accuracy, computational demands, and the data required for training when applied to real-world scenarios. Our results show that the deep-learning model provides higher accuracy, while the rule-based method is more appropriate for scenarios where resources are limited and a prolonged calibration phase is impractical. We also offer publicly available datasets with normal, blurred, and rotated images to support the development and evaluation of camera tampering detection methods, addressing the need for such resources.
zh
[CV-35] ShapeUP: Scalable Image-Conditioned 3D Editing
【速读】:该论文旨在解决当前3D编辑框架在视觉可控性(visual controllability)、几何一致性(geometric consistency)和可扩展性(scalability)之间难以平衡的问题。现有方法如基于优化的方法效率低下,多视角2D传播技术易产生视觉漂移,而无需训练的潜在空间操作则受限于冻结先验且无法受益于模型规模扩展。解决方案的关键在于提出ShapeUP,一种基于图像提示的可扩展3D编辑框架,其将编辑任务建模为原生3D表示中的监督式潜在到潜在映射(latent-to-latent translation),从而利用预训练3D基础模型的强大生成先验,并通过监督训练适配其用于编辑任务;具体而言,ShapeUP在三元组数据(源3D形状、编辑后的2D图像与对应编辑后的3D形状)上训练一个3D扩散Transformer(DiT),实现细粒度的局部与全局编辑控制,并保持原始资产的结构一致性,同时无需显式掩码即可完成隐式定位。
链接: https://arxiv.org/abs/2602.05676
作者: Inbar Gat,Dana Cohen-Bar,Guy Levy,Elad Richardson,Daniel Cohen-Or
机构: Aigency.ai(美国); Tel Aviv University (特拉维夫大学); Runway(美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.
zh
[CV-36] Enhancing Personality Recognition by Comparing the Predictive Power of Traits Facets and Nuances
【速读】:该论文旨在解决人格识别模型在使用宽泛的人格特质分数作为标注时,因行为表现的多样性和情境依赖性而导致的泛化能力不足问题。其解决方案的关键在于引入大五人格模型中更细粒度的层级结构——即特质(trait)、维度(facet)和细微差异(nuance),并通过基于Transformer架构的跨模态(音频-视觉)与跨主体(配对感知)注意力机制,在UDIVA v0.5数据集上进行训练,结果表明以细微差异层级为标签的模型显著优于维度和特质层级模型,平均平方误差最高降低74%。
链接: https://arxiv.org/abs/2602.05650
作者: Amir Ansari,Jana Subirana,Bruna Silva,Sergio Escalera,David Gallardo-Pujol,Cristina Palmero
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the 2025 13th International Conference on Affective Computing and Intelligent Interaction (Late Breaking Results)
Abstract:Personality is a complex, hierarchical construct typically assessed through item-level questionnaires aggregated into broad trait scores. Personality recognition models aim to infer personality traits from different sources of behavioral data. However, reliance on broad trait scores as ground truth, combined with limited training data, poses challenges for generalization, as similar trait scores can manifest through diverse, context dependent behaviors. In this work, we explore the predictive impact of the more granular hierarchical levels of the Big-Five Personality Model, facets and nuances, to enhance personality recognition from audiovisual interaction data. Using the UDIVA v0.5 dataset, we trained a transformer-based model including cross-modal (audiovisual) and cross-subject (dyad-aware) attention mechanisms. Results show that nuance-level models consistently outperform facet and trait-level models, reducing mean squared error by up to 74% across interaction scenarios.
zh
[CV-37] UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos
【速读】:该论文旨在解决当前基于基础模型的手术视频分析方法过度依赖像素级重建目标,导致模型容量被低层次视觉细节(如烟雾、镜面反射和液体运动)占用,而忽视了对手术理解至关重要的语义结构的问题。其解决方案的核心在于提出UniSurg——一个以视频原生(video-native)为基础的新型基础模型,将学习范式从像素级重建转向潜在空间中的运动预测;关键技术创新包括:1)运动引导的潜在预测,优先关注语义有意义区域;2)时空亲和自蒸馏机制,强化关系一致性;3)特征多样性正则化,防止在纹理稀疏的手术场景中表示坍塌。该方法显著提升了多项手术视频理解任务的性能,确立了以运动为导向的通用手术视频理解新标准。
链接: https://arxiv.org/abs/2602.05638
作者: Jinlin Wu,Felix Holm,Chuxi Chen,An Wang,Yaxin Hu,Xiaofan Ye,Zelin Zang,Miao Xu,Lihua Zhou,Huai Liao,Danny T. M. Chan,Ming Feng,Wai S. Poon,Hongliang Ren,Dong Yi,Nassir Navab,Gaofeng Meng,Jiebo Luo,Hongbin Liu,Zhen Lei
机构: Chinese Academy of Sciences (中国科学院); Hong Kong Institute of Science and Innovation (香港科学创新研究院); University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); State Key Laboratory of Multimodal Artificial Intelligence Systems (多模态人工智能系统国家重点实验室); The Chinese University of Hong Kong (香港中文大学); Technical University of Munich (慕尼黑工业大学); Hong Kong University Shenzhen Hospital (香港大学深圳医院); Sun Yat-sen University (中山大学); Peking Union Medical College Hospital (北京协和医院); Department of Neurosurgery, Chinese Academy of Medical Sciences and Peking Union Medical College (中国医学科学院北京协和医学院神经外科)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details - such as smoke, specular reflections, and fluid motion - rather than semantic structures essential for surgical understanding. We present UniSurg, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), UniSurg introduces three key technical innovations tailored to surgical videos: 1) motion-guided latent prediction to prioritize semantically meaningful regions, 2) spatiotemporal affinity self-distillation to enforce relational consistency, and 3) feature diversity regularization to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate UniSurg-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that UniSurg significantly outperforms state-of-the-art methods on surgical workflow recognition (+14.6% F1 on EgoSurgery, +10.3% on PitVis), action triplet recognition (39.54% mAP-IVT on CholecT50), skill assessment, polyp segmentation, and depth estimation. These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.
zh
[CV-38] ROMAN: Reward-Orchestrated Multi-Head Attention Network for Autonomous Driving System Testing
【速读】:该论文旨在解决自动驾驶系统(Automated Driving System, ADS)测试中面临的两大挑战:一是难以生成复杂且高风险的违反交通法规的场景,二是现有方法无法充分考虑多车交互和关键情境下的法律合规性。其解决方案的关键在于提出ROMAN,一种融合多头注意力机制与交通法规权重机制的新型场景生成方法。其中,多头注意力网络用于建模车辆、交通信号灯及其他因素间的复杂交互关系,而交通法规权重机制则通过基于大语言模型(Large Language Model, LLM)的风险权重模块,从严重性和发生频率两个维度量化违规行为的风险等级,从而系统性地生成覆盖全部交通法规条款的高风险违规场景,显著提升了ADS测试的全面性和针对性。
链接: https://arxiv.org/abs/2602.05629
作者: Jianlei Chi,Yuzhen Wu,Jiaxuan Hou,Xiaodong Zhang,Ming Fan,Suhui Sun,Weijun Dai,Bo Li,Jianguo Sun,Jun Sun
机构: Hangzhou Institute of Technology, Xidian University (西安电子科技大学杭州研究院); Qingdao Port International Co., Ltd. (青岛港国际股份有限公司); Shandong Port Qingdao Port Group Co., Ltd. (山东港口青岛港集团有限公司); University of Science and Technology of China (中国科学技术大学); Xi’an Jiaotong University (西安交通大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The manuscript includes 13 pages, 8 tables, and 7 figures
Abstract:Automated Driving System (ADS) acts as the brain of autonomous vehicles, responsible for their safety and efficiency. Safe deployment requires thorough testing in diverse real-world scenarios and compliance with traffic laws like speed limits, signal obedience, and right-of-way rules. Violations like running red lights or speeding pose severe safety risks. However, current testing approaches face significant challenges: limited ability to generate complex and high-risk law-breaking scenarios, and failing to account for complex interactions involving multiple vehicles and critical situations. To address these challenges, we propose ROMAN, a novel scenario generation approach for ADS testing that combines a multi-head attention network with a traffic law weighting mechanism. ROMAN is designed to generate high-risk violation scenarios to enable more thorough and targeted ADS evaluation. The multi-head attention mechanism models interactions among vehicles, traffic signals, and other factors. The traffic law weighting mechanism implements a workflow that leverages an LLM-based risk weighting module to evaluate violations based on the two dimensions of severity and occurrence. We have evaluated ROMAN by testing the Baidu Apollo ADS within the CARLA simulation platform and conducting extensive experiments to measure its performance. Experimental results demonstrate that ROMAN surpassed state-of-the-art tools ABLE and LawBreaker by achieving 7.91% higher average violation count than ABLE and 55.96% higher than LawBreaker, while also maintaining greater scenario diversity. In addition, only ROMAN successfully generated violation scenarios for every clause of the input traffic laws, enabling it to identify more high-risk violations than existing approaches.
zh
[CV-39] Unified Sensor Simulation for Autonomous Driving
【速读】:该论文旨在解决自动驾驶场景中传感器仿真(sensor simulation)的精度与灵活性问题,特别是针对基于3D高斯表示(3D Gaussian representation)的渲染框架在处理复杂动态环境和特殊传感器(如球面LiDAR)时出现的几何失真与外观不一致问题。其关键解决方案包括:首先,提出一种适用于自动驾驶场景的广义滚动快门建模机制(generalized rolling-shutter modeling),以准确模拟动态环境中传感器的时间依赖性畸变;其次,针对球面相机(如LiDAR)因方位角边界处周期性投影和时间不连续性导致的粒子投影错误,设计了一种相位建模机制(phase modeling mechanism),显式处理高斯分布经无迹变换(Unscented Transform)投影到方位边界时的时间与形状不连续性;最后,引入扩展的3D高斯表示,通过两个独立的不透明度参数(opacity parameters)分离几何与颜色分布的不匹配问题,从而提升场景表示的几何一致性与视觉逼真度。
链接: https://arxiv.org/abs/2602.05617
作者: Nikolay Patakin,Arsenii Shirokov,Anton Konushin,Dmitry Senushkin
机构: Lomonosov Moscow State University (莫斯科国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:In this work, we introduce \textbfXSIM, a sensor simulation framework for autonomous driving. XSIM extends 3DGUT splatting with a generalized rolling-shutter modeling tailored for autonomous driving applications. Our framework provides a unified and flexible formulation for appearance and geometric sensor modeling, enabling rendering of complex sensor distortions in dynamic environments. We identify spherical cameras, such as LiDARs, as a critical edge case for existing 3DGUT splatting due to cyclic projection and time discontinuities at azimuth boundaries leading to incorrect particle projection. To address this issue, we propose a phase modeling mechanism that explicitly accounts temporal and shape discontinuities of Gaussians projected by the Unscented Transform at azimuth borders. In addition, we introduce an extended 3D Gaussian representation that incorporates two distinct opacity parameters to resolve mismatches between geometry and color distributions. As a result, our framework provides enhanced scene representations with improved geometric consistency and photorealistic appearance. We evaluate our framework extensively on multiple autonomous driving datasets, including Waymo Open Dataset, Argoverse 2, and PandaSet. Our framework consistently outperforms strong recent baselines and achieves state-of-the-art performance across all datasets. The source code is publicly available at \hrefthis https URLthis https URL.
zh
[CV-40] Shiva-DiT: Residual-Based Differentiable Top-k Selection for Efficient Diffusion Transformers
【速读】:该论文旨在解决扩散 Transformer(Diffusion Transformers, DiTs)因自注意力机制的二次计算复杂度导致的高昂计算成本问题,同时克服现有剪枝方法在可微性、效率与硬件静态预算约束之间难以兼顾的局限。解决方案的关键在于提出 Shiva-DiT,其核心创新是基于残差的可微 Top-k 选择机制(Residual-Based Differentiable Top-k Selection),通过引入残差感知的直通估计器(residual-aware straight-through estimator),在保证静态编译确定性 token 数量的同时,维持端到端可学习性;此外,还设计了上下文感知路由器(Context-Aware Router)和自适应比例策略(Adaptive Ratio Policy),以自主学习动态剪枝调度,从而在主流模型如 SD3.5 上实现 1.54× 的实际运行时加速,并显著优于现有基线方法,有效消除不规则张量(ragged tensor)开销。
链接: https://arxiv.org/abs/2602.05605
作者: Jiaji Zhang,Hailiang Zhao,Guoxuan Zhu,Ruichao Sun,Jiaju Wu,Xinkui Zhao,Hanlin Tang,Weiyi Lu,Kan Liu,Tao Lan,Lin Qu,Shuiguang Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Transformers (DiTs) incur prohibitive computational costs due to the quadratic scaling of self-attention. Existing pruning methods fail to simultaneously satisfy differentiability, efficiency, and the strict static budgets required for hardware overhead. To address this, we propose Shiva-DiT, which effectively reconciles these conflicting requirements via Residual-Based Differentiable Top- k Selection. By leveraging a residual-aware straight-through estimator, our method enforces deterministic token counts for static compilation while preserving end-to-end learnability through residual gradient estimation. Furthermore, we introduce a Context-Aware Router and Adaptive Ratio Policy to autonomously learn an adaptive pruning schedule. Experiments on mainstream models, including SD3.5, demonstrate that Shiva-DiT establishes a new Pareto frontier, achieving a 1.54 \times wall-clock speedup with superior fidelity compared to existing baselines, effectively eliminating ragged tensor overheads.
zh
[CV-41] Multi-instance robust fitting for non-classical geometric models
【速读】:该论文旨在解决从噪声数据中重建多个非经典模型(如螺旋曲线、程序化字符模型和自由曲面)的鲁棒拟合问题,现有方法多局限于单实例重建且难以有效处理异常值。其解决方案的关键在于提出一种基于模型到数据误差的新估计算法,该算法无需预设误差阈值即可有效识别并抑制异常值;同时,由于该估计算法对模型参数不可微,研究采用元启发式优化算法作为优化器以寻找全局最优解,从而实现多实例非经典模型的高鲁棒性拟合。
链接: https://arxiv.org/abs/2602.05602
作者: Zongliang Zhang,Shuxiang Li,Xingwang Huang,Zongyue Wang
机构: Jimei University (集美大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most existing robust fitting methods are designed for classical models, such as lines, circles, and planes. In contrast, fewer methods have been developed to robustly handle non-classical models, such as spiral curves, procedural character models, and free-form surfaces. Furthermore, existing methods primarily focus on reconstructing a single instance of a non-classical model. This paper aims to reconstruct multiple instances of non-classical models from noisy data. We formulate this multi-instance fitting task as an optimization problem, which comprises an estimator and an optimizer. Specifically, we propose a novel estimator based on the model-to-data error, capable of handling outliers without a predefined error threshold. Since the proposed estimator is non-differentiable with respect to the model parameters, we employ a meta-heuristic algorithm as the optimizer to seek the global optimum. The effectiveness of our method are demonstrated through experimental results on various non-classical models. The code is available at this https URL.
zh
[CV-42] CAViT – Channel-Aware Vision Transformer for Dynamic Feature Fusion CVPR25
【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 中通道维度特征混合机制静态化的问题,即传统 ViT 依赖固定多层感知机(MLP)进行通道间信息交互,缺乏对输入内容的自适应调整能力。其解决方案的关键在于提出一种双注意力架构 CAViT,通过在每个 Transformer 块中引入通道自注意力(channel-wise self-attention),替代原有的静态 MLP,实现基于全局图像上下文动态重校准特征表示的机制。这种内容感知的 token 混合策略显著提升了模型的表达能力,同时未增加网络深度或计算复杂度,在多个自然与医学图像基准数据集上实现了性能提升(最高 +3.6% 准确率)并减少了超过 30% 的参数量和浮点运算次数(FLOPs)。
链接: https://arxiv.org/abs/2602.05598
作者: Aon Safdar,Mohamed Saadeldin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 (CVPR 25) in the 4th Workshop on Transformers for Visions - T4V ( this https URL ) Accepted for Publication at 33rd International Conference on Artificial Intelligence and Cognitive Science (AICS 2025), where it was shortlisted for Best Paper Award. ( this https URL )
Abstract:Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ViTs remains static, relying on fixed multilayer perceptrons (MLPs) that lack adaptability to input content. We introduce ‘CAViT’, a dual-attention architecture that replaces the static MLP with a dynamic, attention-based mechanism for feature interaction. Each Transformer block in CAViT performs spatial self-attention followed by channel-wise self-attention, allowing the model to dynamically recalibrate feature representations based on global image context. This unified and content-aware token mixing strategy enhances representational expressiveness without increasing depth or complexity. We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing parameter count and FLOPs by over 30%. Qualitative attention maps reveal sharper and semantically meaningful activation patterns, validating the effectiveness of our attention-driven token mixing.
zh
[CV-43] EgoPoseVR: Spatiotemporal Multi-Modal Reason ing for Egocentric Full-Body Pose in Virtual Reality
【速读】:该论文旨在解决沉浸式虚拟现实(Virtual Reality, VR)应用中准确且时序一致的全身姿态追踪问题,尤其针对基于头戴式相机(head-mounted camera)的自我视角(egocentric)姿态估计方法在VR头显(Head-Mounted Display, HMD)场景下存在的时序不稳定、下肢估计不准及实时性能不足等挑战。其解决方案的关键在于提出EgoPoseVR——一个端到端框架,通过双模态融合管道整合头显运动信号与自我视角RGB-D观测数据:首先利用时空编码器提取帧级与关节级表征,并通过交叉注意力机制融合跨模态互补运动线索;随后引入运动学优化模块,利用HMD信号施加约束以提升姿态估计的准确性与稳定性。该方法无需额外身体传感器或房间尺度追踪系统即可实现鲁棒的全身姿态追踪,显著优于现有最优自我视角姿态估计模型。
链接: https://arxiv.org/abs/2602.05590
作者: Haojie Cheng,Shaun Jing Heng Ong,Shaoyu Cai,Aiden Tat Yang Koh,Fuxi Ouyang,Eng Tat Khoo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Graphics (cs.GR)
备注:
Abstract:Immersive virtual reality (VR) applications demand accurate, temporally coherent full-body pose tracking. Recent head-mounted camera-based approaches show promise in egocentric pose estimation, but encounter challenges when applied to VR head-mounted displays (HMDs), including temporal instability, inaccurate lower-body estimation, and the lack of real-time performance. To address these limitations, we present EgoPoseVR, an end-to-end framework for accurate egocentric full-body pose estimation in VR that integrates headset motion cues with egocentric RGB-D observations through a dual-modality fusion pipeline. A spatiotemporal encoder extracts frame- and joint-level representations, which are fused via cross-attention to fully exploit complementary motion cues across modalities. A kinematic optimization module then imposes constraints from HMD signals, enhancing the accuracy and stability of pose estimation. To facilitate training and evaluation, we introduce a large-scale synthetic dataset of over 1.8 million temporally aligned HMD and RGB-D frames across diverse VR scenarios. Experimental results show that EgoPoseVR outperforms state-of-the-art egocentric pose estimation models. A user study in real-world scenes further shows that EgoPoseVR achieved significantly higher subjective ratings in accuracy, stability, embodiment, and intention for future use compared to baseline methods. These results show that EgoPoseVR enables robust full-body pose tracking, offering a practical solution for accurate VR embodiment without requiring additional body-worn sensors or room-scale tracking systems.
zh
[CV-44] A Mixed Reality System for Robust Manikin Localization in Childbirth Training
【速读】:该论文旨在解决医学生成式AI(Generative AI)训练中阴道分娩实践机会日益减少的问题,主要受限于临床轮转时间缩短、患者抗拒以及产程的不可预测性。为缓解临床导师的教学负担并提升学员学习效率,研究提出了一种混合现实(Mixed Reality, MR)分娩训练系统,其关键在于通过空间标定外部RGB-D相机扩展商用头戴显示设备(Head-Mounted Display, HMD)的透视能力,实现物理训练模型与虚拟引导手部动作的实时空间对齐;进一步构建从粗到精的定位流程,先利用标记点对孕妇模型进行粗略定位以划定分娩区域,再在该区域内注册预扫描的新生儿头部模型,从而精确叠加虚拟引导手部轨迹,使学员在保持真实触觉反馈的同时可独立练习,无需持续现场专家监督。实验表明,该MR系统可在无外部计算资源支持下稳定实现模型定位,并在83名四年级医学生的大规模用户研究中显著优于VR训练方案,在交付、产后及整体任务表现上均获得更高评分且更受学员偏好。
链接: https://arxiv.org/abs/2602.05588
作者: Haojie Cheng,Chang Liu,Abhiram Kanneganti,Mahesh Arjandas Choolani,Arundhati Tushar Gosavi,Eng Tat Khoo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Graphics (cs.GR)
备注:
Abstract:Opportunities for medical students to gain practical experience in vaginal births are increasingly constrained by shortened clinical rotations, patient reluctance, and the unpredictable nature of labour. To alleviate clinicians’ instructional burden and enhance trainees’ learning efficiency, we introduce a mixed reality (MR) system for childbirth training that combines virtual guidance with tactile manikin interaction, thereby preserving authentic haptic feedback while enabling independent practice without continuous on-site expert supervision. The system extends the passthrough capability of commercial head-mounted displays (HMDs) by spatially calibrating an external RGB-D camera, allowing real-time visual integration of physical training objects. Building on this capability, we implement a coarse-to-fine localization pipeline that first aligns the maternal manikin with fiducial markers to define a delivery region and then registers the pre-scanned neonatal head within this area. This process enables spatially accurate overlay of virtual guiding hands near the manikin, allowing trainees to follow expert trajectories reinforced by haptic interaction. Experimental evaluations demonstrate that the system achieves accurate and stable manikin localization on a standalone headset, ensuring practical deployment without external computing resources. A large-scale user study involving 83 fourth-year medical students was subsequently conducted to compare MR-based and virtual reality (VR)-based childbirth training. Four senior obstetricians independently assessed performance using standardized criteria. Results showed that MR training achieved significantly higher scores in delivery, post-delivery, and overall task performance, and was consistently preferred by trainees over VR training.
zh
[CV-45] Geometric Observability Index: An Operator-Theoretic Framework for Per-Feature Sensitivity Weak Observability and Dynamic Effects in SE(3) Pose Estimation
【速读】:该论文旨在解决相机位姿估计中单个图像特征对位姿估计敏感性影响的量化问题,传统条件分析、欧氏扰动方法和费舍尔信息界无法解释动态或不一致观测为何会显著扭曲现代SLAM(Simultaneous Localization and Mapping)与结构光恢复(Structure-from-Motion, SfM)系统。其解决方案的关键在于将影响函数理论扩展至矩阵李群SE(3),推导出左平凡化M-估计器在SE(3)上的内在扰动算子,并提出几何可观测性指数(Geometric Observability Index, GOI)。GOI通过曲率算子和可观测子空间的李代数结构量化单个测量的贡献,其谱分解揭示了弱可观测性与放大敏感性之间的直接对应关系,且在总体情形下与SE(3)上的费舍尔信息几何一致,从而统一了条件分析、费舍尔信息几何、影响函数理论及动态场景可检测性,为现有SLAM架构提供无需训练的轻量级诊断信号以识别动态特征和检测弱可观测配置。
链接: https://arxiv.org/abs/2602.05582
作者: Joe-Mei Feng,Sheng-Wei Yu
机构: Tamkang University (淡江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a unified operator-theoretic framework for analyzing per-feature sensitivity in camera pose estimation on the Lie group SE(3). Classical sensitivity tools - conditioning analyses, Euclidean perturbation arguments, and Fisher information bounds - do not explain how individual image features influence the pose estimate, nor why dynamic or inconsistent observations can disproportionately distort modern SLAM and structure-from-motion systems. To address this gap, we extend influence function theory to matrix Lie groups and derive an intrinsic perturbation operator for left-trivialized M-estimators on SE(3). The resulting Geometric Observability Index (GOI) quantifies the contribution of a single measurement through the curvature operator and the Lie algebraic structure of the observable subspace. GOI admits a spectral decomposition along the principal directions of the observable curvature, revealing a direct correspondence between weak observability and amplified sensitivity. In the population regime, GOI coincides with the Fisher information geometry on SE(3), yielding a single-measurement analogue of the Cramer-Rao bound. The same spectral mechanism explains classical degeneracies such as pure rotation and vanishing parallax, as well as dynamic feature amplification along weak curvature directions. Overall, GOI provides a geometrically consistent description of measurement influence that unifies conditioning analysis, Fisher information geometry, influence function theory, and dynamic scene detectability through the spectral geometry of the curvature operator. Because these quantities arise directly within Gauss-Newton pipelines, the curvature spectrum and GOI also yield lightweight, training-free diagnostic signals for identifying dynamic features and detecting weak observability configurations without modifying existing SLAM architectures. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.05582 [cs.CV] (or arXiv:2602.05582v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.05582 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-46] LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation
【速读】:该论文旨在解决开放词汇语义分割(Open-vocabulary Semantic Segmentation, OVSS)中因图像级预训练导致的空间对齐不精确、对象幻觉及漏检等问题。现有方法依赖视觉语言模型(Vision-Language Models, VLMs)如CLIP时,常因缺乏强对象先验和区域级约束,在复杂或模糊场景下产生错位分割。解决方案的关键在于提出一种高效单阶段框架LoGoSeg,其核心创新包括:(i) 对象存在先验机制,通过全局图像-文本相似度动态加权相关类别以减少幻觉;(ii) 区域感知对齐模块,建立精确的区域级视觉-文本对应关系;(iii) 双流融合机制,最优结合局部结构信息与全局语义上下文。该方法无需外部掩码提议、额外骨干网络或数据集,显著提升了分割精度与泛化能力。
链接: https://arxiv.org/abs/2602.05578
作者: Junyang Chen,Xiangbo Lv,Zhiqiang Kou,Xingdong Sheng,Ning Xu,Yiguo Qiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.
zh
[CV-47] LocateEdit-Bench: A Benchmark for Instruction-Based Editing Localization
【速读】:该论文旨在解决当前图像篡改定位方法在应对基于指令的图像编辑(instruction-based image editing)时失效的问题,尤其是现有方法主要针对基于修补(inpainting-based)的伪造手段,难以适应最新生成式AI(Generative AI)驱动的语义级编辑。解决方案的关键在于提出 LocateEdit-Bench 数据集,该数据集包含 231,000 张由四种前沿编辑模型生成的图像,覆盖三类常见编辑类型,并构建了多指标评估协议以系统性评测现有定位方法的有效性,从而为未来针对新型图像编辑模式的伪造检测提供基准和研究基础。
链接: https://arxiv.org/abs/2602.05577
作者: Shiyu Wu,Shuyan Li,Jing Li,Jing Liu,Yequan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures
Abstract:Recent advancements in image editing have enabled highly controllable and semantically-aware alteration of visual content, posing unprecedented challenges to manipulation localization. However, existing AI-generated forgery localization methods primarily focus on inpainting-based manipulations, making them ineffective against the latest instruction-based editing paradigms. To bridge this critical gap, we propose LocateEdit-Bench, a large-scale dataset comprising 231 K edited images, designed specifically to benchmark localization methods against instruction-driven image editing. Our dataset incorporates four cutting-edge editing models and covers three common edit types. We conduct a detailed analysis of the dataset and develop two multi-metric evaluation protocols to assess existing localization methods. Our work establishes a foundation to keep pace with the evolving landscape of image editing, thereby facilitating the development of effective methods for future forgery localization. Dataset will be open-sourced upon acceptance.
zh
[CV-48] A Hybrid CNN and ML Framework for Multi-modal Classification of Movement Disorders Using MRI and Brain Structural Features
【速读】:该论文旨在解决非典型帕金森综合征(Atypical Parkinsonian Disorders, APD)在早期阶段因临床表现与帕金森病(Parkinson’s disease, PD)高度重叠而导致误诊的问题,尤其关注如何通过影像学手段实现APD亚型(如进行性核上性麻痹,PSP;多系统萎缩,MSA)与PD之间的准确区分,以及亚型间的相互鉴别。解决方案的关键在于提出一种融合卷积神经网络(CNN)与机器学习(ML)的混合框架,整合多模态输入数据——包括T1加权磁共振成像(MRI)、12个与APD相关的深部脑结构分割掩膜及其体积测量值——从而同时利用图像的空间特征和结构体积的定量信息,显著提升了分类性能(AUC最高达0.95),为早期精准诊断提供了可靠的影像生物标志物支持。
链接: https://arxiv.org/abs/2602.05574
作者: Mengyu Li,Ingibjörg Kristjánsdóttir,Thilo van Eimeren,Kathrin Giehl,Lotta M. Ellingsen, theASAP Neuroimaging Initiative
机构: University of Iceland, Faculty of Electrical and Computer Engineering, Reykjavik, Iceland; University of Iceland, Faculty of Medicine, Reykjavik, Iceland; University of Cologne, Faculty of Medicine, Cologne, Germany; University Hospital Cologne, Dept. of Nuclear Medicine and Dept. of Neurology, Cologne, Germany; the ASAP Neuroimaging Initiative
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in Proceedings of SPIE Medical Imaging 2026
Abstract:Atypical Parkinsonian Disorders (APD), also known as Parkinson-plus syndrome, are a group of neurodegenerative diseases that include progressive supranuclear palsy (PSP) and multiple system atrophy (MSA). In the early stages, overlapping clinical features often lead to misdiagnosis as Parkinson’s disease (PD). Identifying reliable imaging biomarkers for early differential diagnosis remains a critical challenge. In this study, we propose a hybrid framework combining convolutional neural networks (CNNs) with machine learning (ML) techniques to classify APD subtypes versus PD and distinguish between the subtypes themselves: PSP vs. PD, MSA vs. PD, and PSP vs. MSA. The model leverages multi-modal input data, including T1-weighted magnetic resonance imaging (MRI), segmentation masks of 12 deep brain structures associated with APD, and their corresponding volumetric measurements. By integrating these complementary modalities, including image data, structural segmentation masks, and quantitative volume features, the hybrid approach achieved promising classification performance with area under the curve (AUC) scores of 0.95 for PSP vs. PD, 0.86 for MSA vs. PD, and 0.92 for PSP vs. MSA. These results highlight the potential of combining spatial and structural information for robust subtype differentiation. In conclusion, this study demonstrates that fusing CNN-based image features with volume-based ML inputs improves classification accuracy for APD subtypes. The proposed approach may contribute to more reliable early-stage diagnosis, facilitating timely and targeted interventions in clinical practice.
zh
[CV-49] Visual Implicit Geometry Transformer for Autonomous Driving
【速读】:该论文旨在解决自动驾驶中几何感知建模的可扩展性与通用性问题,即如何构建一个能够适应多种传感器配置、无需人工标注且能统一处理多视角几何信息的连续3D占用场(occupancy field)估计模型。解决方案的关键在于提出视觉隐式几何变换器(Visual Implicit Geometry Transformer, ViGT),其核心创新包括:1)采用无标定(calibration-free)架构,使单一模型可适配不同相机布置;2)在鸟瞰图(birds-eye-view, BEV)空间中直接估计连续3D占用场,满足自动驾驶特定需求;3)利用同步图像-激光雷达(LiDAR)对进行自监督训练,避免昂贵的人工标注。该方法实现了跨数据集的泛化能力,并在多个基准上达到最优性能。
链接: https://arxiv.org/abs/2602.05573
作者: Arsenii Shirokov,Mikhail Kuznetsov,Danila Stepochkin,Egor Evdokimov,Daniil Glazkov,Nikolay Patakin,Anton Konushin,Dmitry Senushkin
机构: Lomonosov Moscow State University (莫斯科国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model that estimates continuous 3D occupancy fields from surround-view camera rigs. ViGT represents a step towards foundational geometric models for autonomous driving, prioritizing scalability, architectural simplicity, and generalization across diverse sensor configurations. Our approach achieves this through a calibration-free architecture, enabling a single model to adapt to different sensor setups. Unlike general-purpose geometric foundational models that focus on pixel-aligned predictions, ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements. ViGT naturally infers geometry from multiple camera views into a single metric coordinate frame, providing a common representation for multiple geometric tasks. Unlike most existing occupancy models, we adopt a self-supervised training procedure that leverages synchronized image-LiDAR pairs, eliminating the need for costly manual annotations. We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets (NuScenes, Waymo, NuPlan, ONCE, and Argoverse) and achieving state-of-the-art performance on the pointmap estimation task, with the best average rank across all evaluated baselines. We further evaluate ViGT on the Occ3D-nuScenes benchmark, where ViGT achieves comparable performance with supervised methods. The source code is publicly available at \hrefthis https URLthis https URL.
zh
[CV-50] ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision Priors
【速读】:该论文旨在解决从单目视频中进行高保真、鲁棒的4D人体重建问题,尤其针对现有方法在高形变运动下表现不佳或依赖模板导致误差传播的问题。具体而言,通用重建方法(如4DGS)因缺乏强视觉先验而难以捕捉复杂人体动态,而基于模板的方法(如HUGS)虽能生成逼真结果但对姿态估计误差敏感,易产生不真实伪影。解决方案的关键在于提出ShapeGaussian,其核心创新是采用两阶段流程:首先利用预训练模型学习数据驱动的粗粒度可变形几何结构作为基础;随后通过神经变形模型细化几何细节以捕获精细动态特征。该方法借助2D视觉先验缓解了模板方法中姿态估计错误带来的影响,并利用多参考帧解决了无模板情况下关键点遮挡问题,从而在多样人体动作场景下实现了更高精度与视觉质量的重建。
链接: https://arxiv.org/abs/2602.05572
作者: Zhenxiao Liang,Ning Zhang,Youbao Tang,Ruei-Sung Lin,Qixing Huang,Peng Chang,Jing Xiao
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); PAII Inc
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.
zh
[CV-51] PIRATR: Parametric Object Inference for Robotic Applications with Transformers in 3D Point Clouds ICRA
【速读】:该论文旨在解决机器人在点云数据中对参数化物体进行端到端3D检测的难题,尤其针对遮挡环境下如何同时实现几何定位与任务相关属性估计的问题。解决方案的关键在于提出PIRATR框架,其通过联合估计多类物体的6-DoF位姿(pose)和类别特定的参数化属性(如夹爪开合度),直接从受遮挡影响的点云数据中输出可执行的感知结果;该方法采用模块化、类别专用的预测头设计,支持无需重设计整个流水线即可扩展至新物体类型,且在合成环境中训练后能零样本迁移至真实室外LiDAR场景,实现高达0.919的mAP,从而建立了面向动作的参数化感知范式,弥合了低级几何推理与可操作世界模型之间的鸿沟。
链接: https://arxiv.org/abs/2602.05557
作者: Michael Schwingshackl,Fabio F. Oberweger,Mario Niedermeyer,Huemer Johannes,Markus Murschitz
机构: AIT Austrian Institute of Technology (奥地利科学院技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 Pages, 11 Figures, Accepted at 2026 IEEE International Conference on Robotics Automation (ICRA) Vienna
Abstract:We present PIRATR, an end-to-end 3D object detection framework for robotic use cases in point clouds. Extending PI3DETR, our method streamlines parametric 3D object detection by jointly estimating multi-class 6-DoF poses and class-specific parametric attributes directly from occlusion-affected point cloud data. This formulation enables not only geometric localization but also the estimation of task-relevant properties for parametric objects, such as a gripper’s opening, where the 3D model is adjusted according to simple, predefined rules. The architecture employs modular, class-specific heads, making it straightforward to extend to novel object types without re-designing the pipeline. We validate PIRATR on an automated forklift platform, focusing on three structurally and functionally diverse categories: crane grippers, loading platforms, and pallets. Trained entirely in a synthetic environment, PIRATR generalizes effectively to real outdoor LiDAR scans, achieving a detection mAP of 0.919 without additional fine-tuning. PIRATR establishes a new paradigm of pose-aware, parameterized perception. This bridges the gap between low-level geometric reasoning and actionable world models, paving the way for scalable, simulation-trained perception systems that can be deployed in dynamic robotic environments. Code available at this https URL.
zh
[CV-52] IndustryShapes: An RGB-D Benchmark dataset for 6D object pose estimation of industrial assembly components and tools ICRA2026
【速读】:该论文旨在解决工业机器人场景中6D位姿估计(6D pose estimation)方法在真实制造环境中部署时面临的评估数据匮乏问题,尤其是针对实例级和新物体(novel object)位姿估计方法缺乏具有挑战性和现实性的真实场景数据集。解决方案的关键在于提出IndustryShapes数据集,这是一个基于RGB-D图像的新型基准数据集,包含五种具有挑战性工业工具与部件的新类别,并在真实的工业装配场景中采集,涵盖从简单到复杂的多物体、多实例场景;其结构分为经典集(含4,600张图像与6,000个标注位姿)和扩展集(支持无模型和序列依赖方法评估),并首次提供RGB-D静态上电(onboarding)序列,为当前先进算法提供了更贴近实际应用的测试平台,验证了现有方法仍有提升空间。
链接: https://arxiv.org/abs/2602.05555
作者: Panagiotis Sapoutzoglou,Orestis Vaggelis,Athina Zacharia,Evangelos Sartinas,Maria Pateraki
机构: National Technical University of Athens (希腊国立技术大学); Stellantis—Centro Ricerche FIAT (CRF)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: To appear in ICRA 2026
Abstract:We introduce IndustryShapes, a new RGB-D benchmark dataset of industrial tools and components, designed for both instance-level and novel object 6D pose estimation approaches. The dataset provides a realistic and application-relevant testbed for benchmarking these methods in the context of industrial robotics bridging the gap between lab-based research and deployment in real-world manufacturing scenarios. Unlike many previous datasets that focus on household or consumer products or use synthetic, clean tabletop datasets, or objects captured solely in controlled lab environments, IndustryShapes introduces five new object types with challenging properties, also captured in realistic industrial assembly settings. The dataset has diverse complexity, from simple to more challenging scenes, with single and multiple objects, including scenes with multiple instances of the same object and it is organized in two parts: the classic set and the extended set. The classic set includes a total of 4,6k images and 6k annotated poses. The extended set introduces additional data modalities to support the evaluation of model-free and sequence-based approaches. To the best of our knowledge, IndustryShapes is the first dataset to offer RGB-D static onboarding sequences. We further evaluate the dataset on a representative set of state-of-the art methods for instance-based and novel object 6D pose estimation, including also object detection, segmentation, showing that there is room for improvement in this domain. The dataset page can be found in this https URL.
zh
[CV-53] VLN-Pilot: Large Vision-Language Model as an Autonomous Indoor Drone Operator
【速读】:该论文旨在解决室内无人机(UAV)在无GPS环境下如何实现高自主性、语义理解驱动的导航问题,特别是如何让无人机基于自然语言指令完成复杂任务而无需大量人工编程或几何路径规划。解决方案的关键在于提出VLN-Pilot框架,利用大视觉-语言模型(VLLM)的多模态推理能力,将自由形式的语言指令与视觉感知结果进行语义对齐,从而生成具有情境感知能力的飞行轨迹,支持障碍物避让和动态事件响应,显著减少任务特定工程开发,并实现端到端的指令跟随式自主飞行。
链接: https://arxiv.org/abs/2602.05552
作者: Bessie Dominguez-Dager,Sergio Suescun-Ferrandiz,Felix Escalona,Francisco Gomez-Donoso,Miguel Cazorla
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces VLN-Pilot, a novel framework in which a large Vision-and-Language Model (VLLM) assumes the role of a human pilot for indoor drone navigation. By leveraging the multimodal reasoning abilities of VLLMs, VLN-Pilot interprets free-form natural language instructions and grounds them in visual observations to plan and execute drone trajectories in GPS-denied indoor environments. Unlike traditional rule-based or geometric path-planning approaches, our framework integrates language-driven semantic understanding with visual perception, enabling context-aware, high-level flight behaviors with minimal task-specific engineering. VLN-Pilot supports fully autonomous instruction-following for drones by reasoning about spatial relationships, obstacle avoidance, and dynamic reactivity to unforeseen events. We validate our framework on a custom photorealistic indoor simulation benchmark and demonstrate the ability of the VLLM-driven agent to achieve high success rates on complex instruction-following tasks, including long-horizon navigation with multiple semantic targets. Experimental results highlight the promise of replacing remote drone pilots with a language-guided autonomous agent, opening avenues for scalable, human-friendly control of indoor UAVs in tasks such as inspection, search-and-rescue, and facility monitoring. Our results suggest that VLLM-based pilots may dramatically reduce operator workload while improving safety and mission flexibility in constrained indoor environments.
zh
[CV-54] FastVMT: Eliminating Redundancy in Video Motion Transfer FAST ICLR2026
【速读】:该论文旨在解决视频运动迁移(Video Motion Transfer, VMT)中基于扩散 Transformer(Diffusion Transformer, DiT)架构的计算效率低下问题。现有方法虽尝试加速 DiT 的推理过程,但未从结构层面消除冗余计算。论文识别出两类冗余:一是运动冗余(motion redundancy),源于 DiT 未利用相邻帧间运动平滑且变化小的特性;二是梯度冗余(gradient redundancy),源于忽略扩散轨迹上梯度缓慢变化的规律。解决方案的关键在于:针对运动冗余,设计局部注意力掩码机制,仅在局部邻域内计算注意力权重,避免无意义的远距离交互;针对梯度冗余,提出梯度重用优化策略,在扩散步骤间复用历史梯度并跳过不必要的计算。实验表明,所提 FastVMT 方法平均实现 3.43 倍加速,同时保持视频视觉保真度和时序一致性不变。
链接: https://arxiv.org/abs/2602.05551
作者: Yue Ma,Zhikai Wang,Tianhao Ren,Mingzhe Zheng,Hongyu Liu,Jiayi Guo,Mark Fong,Yuxuan Xue,Zixiang Zhao,Konrad Schindler,Qifeng Chen,Linfeng Zhang
机构: EPIC Lab, SJTU (上海交通大学); HKUST (香港科技大学); THU (清华大学); Meta; ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR2026, Project page: this http URL , Code: this https URL
Abstract:Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
zh
[CV-55] A Comparative Study of 3D Person Detection: Sensor Modalities and Robustness in Diverse Indoor and Outdoor Environments
【速读】:该论文旨在解决在复杂室内与室外场景中实现高精度3D人体检测的问题,特别是在不同遮挡程度、距离变化以及传感器退化条件下的鲁棒性挑战。其解决方案的关键在于系统性地比较纯摄像头(camera-only)、纯激光雷达(LiDAR-only)及多模态融合(camera-LiDAR fusion)三种方法的性能表现,并通过在JRDB数据集上的实证分析揭示融合策略的优势与局限。研究发现,基于融合的方法(如DAL模型)在多数情况下显著优于单一模态模型,尤其在遮挡和远距离场景下更具鲁棒性,但其性能仍受传感器错位和特定LiDAR噪声的影响,凸显了多传感器融合在提升3D人体检测可靠性中的核心作用,同时也指出了未来需进一步优化的方向。
链接: https://arxiv.org/abs/2602.05538
作者: Malaz Tamim,Andrea Matic-Flierl,Karsten Roscher
机构: Fraunhofer Institute for Cognitive Systems IKS (弗劳恩霍夫认知系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for VISAPP 2026
Abstract:Accurate 3D person detection is critical for safety in applications such as robotics, industrial monitoring, and surveillance. This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion. While most existing research focuses on autonomous driving, we explore detection performance and robustness in diverse indoor and outdoor scenes using the JRDB dataset. We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera-LiDAR fusion) - and analyze their behavior under varying occlusion and distance levels. Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios. We further investigate robustness against sensor corruptions and misalignments, revealing that while DAL offers improved resilience, it remains sensitive to sensor misalignment and certain LiDAR-based corruptions. In contrast, the camera-based BEVDepth model showed the lowest performance and was most affected by occlusion, distance, and noise. Our findings highlight the importance of utilizing sensor fusion for enhanced 3D person detection, while also underscoring the need for ongoing research to address the vulnerabilities inherent in these systems.
zh
[CV-56] SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation ICLR2026
【速读】:该论文旨在解决视觉自回归(Visual Autoregressive, VAR)模型在推理阶段因容量有限和误差累积导致的层级生成偏差问题,即模型在实际生成过程中难以维持训练时所具备的从粗到细(coarse-to-fine)的结构一致性,从而影响图像质量与全局连贯性。解决方案的关键在于提出一种无需训练的推理阶段引导机制——尺度空间引导(Scaled Spatial Guidance, SSG),其核心思想是通过信息论视角确保每一尺度均贡献前一尺度无法解释的高频语义残差(semantic residual),并借助频域中的离散空间增强(Discrete Spatial Enhancement, DSE)方法有效分离和强化该残差信号,从而在不破坏生成效率的前提下显著提升图像保真度与多样性。
链接: https://arxiv.org/abs/2602.05534
作者: Youngwoo Shin,Jiwan Hur,Junmo Kim
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2026
Abstract:Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at this https URL.
zh
[CV-57] Generalization of Self-Supervised Vision Transformers for Protein Localization Across Microscopy Domains
【速读】:该论文旨在解决小样本显微图像数据集上深度学习模型难以学习鲁棒特征表示的问题。其解决方案的关键在于利用自监督学习(Self-supervised Learning, SSL)在大规模未标注数据集上进行预训练,从而获得可迁移的特征表示,并评估这些表示在不同染色协议和通道配置的显微图像域间的跨域泛化能力。研究发现,基于显微图像数据预训练的DINO Vision Transformer模型(如Human Protein Atlas预训练模型)在OpenCell数据集上表现最优,表明领域相关的SSL表征能够有效迁移到相关但不同的显微图像任务中,即使目标任务的标注数据有限也能实现优异的下游性能。
链接: https://arxiv.org/abs/2602.05527
作者: Ben Isselmann,Dilara Göksu,Andreas Weinmann
机构: Hochschule Darmstadt (达姆施塔特应用技术大学); Technische Hochschule Würzburg-Schweinfurt (维尔茨堡-施韦因富特应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AMEE Conference Proceeding 2025, 11 pages, 2 figures
Abstract:Task-specific microscopy datasets are often too small to train deep learning models that learn robust feature representations. Self-supervised learning (SSL) can mitigate this by pretraining on large unlabeled datasets, but it remains unclear how well such representations transfer across microscopy domains with different staining protocols and channel configurations. We investigate the cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on the OpenCell dataset. We generate image embeddings using three DINO backbones pretrained on ImageNet-1k, the Human Protein Atlas (HPA), and OpenCell, and evaluate them by training a supervised classification head on OpenCell labels. All pretrained models transfer well, with the microscopy-specific HPA-pretrained model achieving the best performance (mean macro F_1 -score = 0.8221 \pm 0.0062), slightly outperforming a DINO model trained directly on OpenCell (0.8057 \pm 0.0090). These results highlight the value of large-scale pretraining and indicate that domain-relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even when task-specific labeled data are limited.
zh
[CV-58] Mapper-GIN: Lightweight Structural Graph Abstraction for Corrupted 3D Point Cloud Classification
【速读】:该论文旨在解决3D点云分类任务中模型对噪声和变换等扰动的鲁棒性不足问题,传统方法通常依赖于加深网络结构或引入复杂的增强策略。其解决方案的关键在于利用Mapper算法实现点云的拓扑结构抽象:首先通过PCA投影、立方体覆盖与密度聚类对点云进行重叠区域划分,构建区域图(region graph),再使用图同构网络(Graph Isomorphism Network, GIN)进行图级别分类。这种基于区域图结构的轻量化抽象方式,在仅0.5M参数下即实现了对ModelNet40-C数据集上多种类型扰动的稳定高精度分类,表明结构抽象本身即可提供高效且可解释的鲁棒性来源。
链接: https://arxiv.org/abs/2602.05522
作者: Jeongbin You,Donggun Kim,Sejun Park,Seungsang Oh
机构: Korea University (韩国国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Geometric Topology (math.GT)
备注:
Abstract:Robust 3D point cloud classification is often pursued by scaling up backbones or relying on specialized data augmentation. We instead ask whether structural abstraction alone can improve robustness, and study a simple topology-inspired decomposition based on the Mapper algorithm. We propose Mapper-GIN, a lightweight pipeline that partitions a point cloud into overlapping regions using Mapper (PCA lens, cubical cover, and followed by density-based clustering), constructs a region graph from their overlaps, and performs graph classification with a Graph Isomorphism Network. On the corruption benchmark ModelNet40-C, Mapper-GIN achieves competitive and stable accuracy under Noise and Transformation corruptions with only 0.5M parameters. In contrast to prior approaches that require heavier architectures or additional mechanisms to gain robustness, Mapper-GIN attains strong corruption robustness through simple region-level graph abstraction and GIN message passing. Overall, our results suggest that region-graph structure offers an efficient and interpretable source of robustness for 3D visual recognition.
zh
[CV-59] VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency
【速读】:该论文旨在解决无标定单目SLAM(Simultaneous Localization and Mapping,同时定位与建图)在长距离轨迹中尺度漂移严重的问题,尤其是传统基于运动无关的子地图划分会破坏上下文一致性导致零运动漂移,而常规几何对齐方法计算成本高昂。其解决方案的关键在于:首先提出一种基于光流引导的运动感知子地图构建机制,实现自适应分区、静态冗余剔除和转弯特征封装,以稳定局部几何结构;其次设计了一种锚点驱动的直接Sim(3)配准策略,通过上下文平衡的锚点实现无需搜索的像素级稠密对齐与高效回环闭合;最后采用轻量级子地图级别的位姿图优化,以线性复杂度保障全局一致性,从而实现千米级轨迹下的高效鲁棒全局一致性。
链接: https://arxiv.org/abs/2602.05508
作者: Zhuang Xiong,Chen Zhang,Qingshan Xu,Wenbing Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.
zh
[CV-60] XEmoGPT : An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reason ing
【速读】:该论文旨在解决现有可解释多模态情感识别(Explainable Multimodal Emotion Recognition)方法在情感线索层面感知与推理能力不足的问题。主要挑战包括:1)通用模态编码器因预训练目标侧重全局结构和语义,缺乏对细粒度情感线索的敏感性;2)现有数据集在标注质量与规模之间存在权衡,导致情感线索监督不足,限制了线索级推理能力;3)现有评估指标无法有效衡量线索级推理性能。解决方案的关键在于提出一种名为XEmoGPT的新框架,其核心创新为引入两个专用模块——视频情感线索桥接器(Video Emotional Cue Bridge, VECB)和音频情感线索桥接器(Audio Emotional Cue Bridge, AECB),通过设计特定任务增强视频与音频编码器对情感线索的细粒度感知能力;同时构建大规模标注数据集EmoCue用于训练模型进行线索级推理,并开发自动化评估指标EmoCue-360与专家标注基准EmoCue-Eval以全面评估性能。
链接: https://arxiv.org/abs/2602.05496
作者: Hanwen Zhang,Yao Liu,Peiyuan Jiang,Lang Junjie,Xie Jun,Yihui He,Yajiao Deng,Siyu Du,Qiao Liu
机构: University of Electronic Science and Technology of China (电子科技大学); China Electronics Technology Group Corporation (中国电子科技集团有限公司)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.
zh
[CV-61] Feature points evaluation on omnidirectional vision with a photorealistic fisheye sequence – A report on experiments done in 2014
【速读】:该论文旨在解决鱼眼图像中特征检测与描述子选择的问题,特别是在车载摄像头朝向天顶(zenith)进行自标定(self-calibration)的背景下,以支持城市场景下的鱼眼视觉里程计(fisheye visual odometry)和立体视觉(stereovision)应用。其核心挑战在于“鸡生蛋、蛋生鸡”的困境:缺乏精确的相机投影模型时难以实现最优特征检测与描述,而没有高质量特征又无法完成准确的标定(即获取精确的投影模型)。解决方案的关键在于通过系统性实验评估现有标准特征算法在鱼眼图像上的表现,从而筛选出适用于此类非透视成像场景的最佳特征检测器与描述子组合,为后续几何重建与定位提供可靠基础。
链接: https://arxiv.org/abs/2602.05487
作者: Julien Moreau(Heudiasyc),S. Ambellouis,Yassine Ruichek(CIAD)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:What is this report: This is a scientific report, contributing with a detailed bibliography, a dataset which we will call now PFSeq for ‘‘Photorealistic Fisheye Sequence’’ and make available at this https URL. 57745/DYIVVU, and comprehensive experiments. This work should be considered as a draft, and has been done during my PhD thesis ‘‘Construction of 3D models from fisheye video data-Application to the localisation in urban area’’ in 2014 [Mor16]. These results have never been published. The aim was to find the best features detector and descriptor for fisheye images, in the context of selfcalibration, with cameras mounted on the top of a car and aiming at the zenith (to proceed then fisheye visual odometry and stereovision in urban scenes). We face a chicken and egg problem, because we can not take advantage of an accurate projection model for an optimal features detection and description, and we rightly need good features to perform the calibration (i.e. to compute the accurate projection model of the camera). What is not this report: It does not contribute with new features algorithm. It does not compare standard features algorithms to algorithms designed for omnidirectional images (unfortunately). It has not been peer-reviewed. Discussions have been translated and enhanced but the experiments have not been run again and the report has not been updated accordingly to the evolution of the state-of-the-art (read this as a 2014 report).
zh
[CV-62] SOMA-1M: A Large-Scale SAR-Optical Multi-resolution Alignment Dataset for Multi-Task Remote Sensing
【速读】:该论文旨在解决当前多模态遥感图像处理中基准数据集存在的局限性问题,如单一空间分辨率、数据规模不足以及对齐精度低等,这些问题限制了多尺度基础模型的训练与泛化能力。解决方案的关键在于构建一个名为SOMA-1M(SAR-Optical Multi-resolution Alignment)的像素级精确对齐数据集,包含超过130万对地理参考图像(512×512像素),覆盖0.5米至10米的全球多尺度范围,并涵盖12类典型地表覆盖类型以保障场景多样性。为实现高精度对齐,研究设计了一种粗到精的图像匹配框架,有效应对多模态投影变形和大规模数据注册难题。实验表明,基于SOMA-1M的监督训练显著提升了图像匹配、图像融合、SAR辅助去云及跨模态翻译四项任务的性能,达到当前最优水平,验证了其作为多模态遥感算法与基础模型研发关键资源的价值。
链接: https://arxiv.org/abs/2602.05480
作者: Peihao Wu,Yongxiang Yao,Yi Wan,Wenfei Zhang,Ruipeng Zhao,Jiayuan Li,Yongjun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Synthetic Aperture Radar (SAR) and optical imagery provide complementary strengths that constitute the critical foundation for transcending single-modality constraints and facilitating cross-modal collaborative processing and intelligent interpretation. However, existing benchmark datasets often suffer from limitations such as single spatial resolution, insufficient data scale, and low alignment accuracy, making them inadequate for supporting the training and generalization of multi-scale foundation models. To address these challenges, we introduce SOMA-1M (SAR-Optical Multi-resolution Alignment), a pixel-level precisely aligned dataset containing over 1.3 million pairs of georeferenced images with a specification of 512 x 512 pixels. This dataset integrates imagery from Sentinel-1, PIESAT-1, Capella Space, and Google Earth, achieving global multi-scale coverage from 0.5 m to 10 m. It encompasses 12 typical land cover categories, effectively ensuring scene diversity and complexity. To address multimodal projection deformation and massive data registration, we designed a rigorous coarse-to-fine image matching framework ensuring pixel-level alignment. Based on this dataset, we established comprehensive evaluation benchmarks for four hierarchical vision tasks, including image matching, image fusion, SAR-assisted cloud removal, and cross-modal translation, involving over 30 mainstream algorithms. Experimental results demonstrate that supervised training on SOMA-1M significantly enhances performance across all tasks. Notably, multimodal remote sensing image (MRSI) matching performance achieves current state-of-the-art (SOTA) levels. SOMA-1M serves as a foundational resource for robust multimodal algorithms and remote sensing foundation models. The dataset will be released publicly at: this https URL.
zh
[CV-63] Refine and Purify: Orthogonal Basis Optimization with Null-Space Denoising for Conditional Representation Learning
【速读】:该论文旨在解决条件表示学习(Conditional Representation Learning)中两个关键问题:一是现有方法对子空间基(subspace basis)敏感,导致性能不稳定;二是易受非目标语义子空间干扰(inter-subspace interference),影响表示的纯净性。解决方案的核心在于提出OD-CRL框架,其关键创新包括:1)自适应正交基优化(Adaptive Orthogonal Basis Optimization, AOBO),通过奇异值分解结合曲率截断构建正交语义基,增强基的鲁棒性;2)零空间去噪投影(Null-Space Denoising Projection, NSDP),将嵌入投影到无关子空间的零空间中以抑制非目标语义干扰,从而提升条件表示的准确性与泛化能力。
链接: https://arxiv.org/abs/2602.05464
作者: Jiaquan Wang,Yan Lyu,Chen Li,Yuheng Jia
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Conditional representation learning aims to extract criterion-specific features for customized tasks. Recent studies project universal features onto the conditional feature subspace spanned by an LLM-generated text basis to obtain conditional representations. However, such methods face two key limitations: sensitivity to subspace basis and vulnerability to inter-subspace interference. To address these challenges, we propose OD-CRL, a novel framework integrating Adaptive Orthogonal Basis Optimization (AOBO) and Null-Space Denoising Projection (NSDP). Specifically, AOBO constructs orthogonal semantic bases via singular value decomposition with a curvature-based truncation. NSDP suppresses non-target semantic interference by projecting embeddings onto the null space of irrelevant subspaces. Extensive experiments conducted across customized clustering, customized classification, and customized retrieval tasks demonstrate that OD-CRL achieves a new state-of-the-art performance with superior generalization.
zh
[CV-64] Attention Retention for Continual Learning with Vision Transformers AAAI-2026
【速读】:该论文旨在解决持续学习(Continual Learning, CL)中的灾难性遗忘问题,即AI系统在学习新任务时会严重干扰甚至丢失对先前任务的知识。研究表明,视觉Transformer模型中注意力机制的漂移(attention drift)是导致遗忘的关键因素,表现为学习新任务后,模型对旧任务视觉概念的关注度显著下降。解决方案的核心在于提出一种注意力保留框架,通过两阶段梯度掩码机制来抑制注意力漂移:首先利用逐层回溯(layer-wise rollout)提取前序任务的注意力图并生成实例自适应的二值掩码;其次在训练新任务时,将掩码应用于反向传播过程,屏蔽与旧注意力区域相关的梯度更新,从而保护已学视觉概念。为兼容现代优化器,进一步引入参数更新比例缩放策略以维持梯度相对幅度,确保优化稳定性。实验表明该方法在多种持续学习场景下均能有效缓解遗忘,达到当前最优性能。
链接: https://arxiv.org/abs/2602.05454
作者: Yue Lu,Xiangyu Zhou,Shizhou Zhang,Yinghui Xing,Guoqiang Liang,Wencong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI-2026 Camera Ready
Abstract:Continual learning (CL) empowers AI systems to progressively acquire knowledge from non-stationary data streams. However, catastrophic forgetting remains a critical challenge. In this work, we identify attention drift in Vision Transformers as a primary source of catastrophic forgetting, where the attention to previously learned visual concepts shifts significantly after learning new tasks. Inspired by neuroscientific insights into the selective attention in the human visual system, we propose a novel attention-retaining framework to mitigate forgetting in CL. Our method constrains attention drift by explicitly modifying gradients during backpropagation through a two-step process: 1) extracting attention maps of the previous task using a layer-wise rollout mechanism and generating instance-adaptive binary masks, and 2) when learning a new task, applying these masks to zero out gradients associated with previous attention regions, thereby preventing disruption of learned visual concepts. For compatibility with modern optimizers, the gradient masking process is further enhanced by scaling parameter updates proportionally to maintain their relative magnitudes. Experiments and visualizations demonstrate the effectiveness of our method in mitigating catastrophic forgetting and preserving visual concepts. It achieves state-of-the-art performance and exhibits robust generalizability across diverse CL scenarios.
zh
[CV-65] DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching CVPR2026
【速读】:该论文旨在解决扩散模型(Diffusion Models)在视频生成任务中因计算复杂度高而导致的推理效率瓶颈问题,尤其是在采用特征缓存(Feature Caching)和步长蒸馏(Step-Distillation)等加速方法时所面临的语义信息丢失与质量下降问题。解决方案的关键在于提出了一种可蒸馏兼容的可学习特征缓存机制(Distillation-Compatible Learnable Feature Caching),用轻量级神经预测器替代传统无训练启发式策略,从而更精确地捕捉高维特征演化过程;同时引入保守的受限均流(Restricted MeanFlow)方法以应对大规模视频模型中高度压缩蒸馏带来的不稳定性,实现11.8倍加速下仍保持高质量生成效果。
链接: https://arxiv.org/abs/2602.05449
作者: Chang Zou,Changlin Li,Yang Li,Patrol Li,Jianbing Wu,Xiao He,Songtao Liu,Zhao Zhong,Kailin Huang,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Tencent Hunyuan (腾讯混元); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 7 figures; cvpr2026 submission
Abstract:While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to 11.8\times while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.
zh
[CV-66] Synthetic Defect Geometries of Cast Metal Objects Modeled via 2d Voronoi Tessellations
【速读】:该论文旨在解决工业中缺陷检测对高质量、大规模标注数据的依赖问题,尤其是在非破坏性测试(NDT)场景下,真实缺陷样本稀缺且难以获取。其解决方案的关键在于构建一个基于参数化建模的数字孪生系统:首先通过规则驱动的方法生成多种典型缺陷类型的三维网格模型(如金属铸造中的常见缺陷),并将其嵌入到对象几何体中形成合成缺陷物体;随后利用物理驱动的蒙特卡洛模拟(Monte Carlo simulation)生成与实际检测数据高度相似的合成数据集,同时实现像素级精确标注。该方法可灵活扩展至其他NDT技术,且能有效包含罕见缺陷,从而显著提升机器学习在自动化缺陷检测中的训练效果和泛化能力。
链接: https://arxiv.org/abs/2602.05440
作者: Natascha Jeziorski,Petra Gospodnetić,Claudia Redenbach
机构: RPTU University Kaiserslautern-Landau (RPTU凯撒斯劳滕-兰道大学); Fraunhofer Institute for Industrial Mathematics ITWM, Kaiserslautern (弗劳恩霍夫工业数学研究所ITWM,凯撒斯劳滕)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In industry, defect detection is crucial for quality control. Non-destructive testing (NDT) methods are preferred as they do not influence the functionality of the object while inspecting. Automated data evaluation for automated defect detection is a growing field of research. In particular, machine learning approaches show promising results. To provide training data in sufficient amount and quality, synthetic data can be used. Rule-based approaches enable synthetic data generation in a controllable environment. Therefore, a digital twin of the inspected object including synthetic defects is needed. We present parametric methods to model 3d mesh objects of various defect types that can then be added to the object geometry to obtain synthetic defective objects. The models are motivated by common defects in metal casting but can be transferred to other machining procedures that produce similar defect shapes. Synthetic data resembling the real inspection data can then be created by using a physically based Monte Carlo simulation of the respective testing method. Using our defect models, a variable and arbitrarily large synthetic data set can be generated with the possibility to include rarely occurring defects in sufficient quantity. Pixel-perfect annotation can be created in parallel. As an example, we will use visual surface inspection, but the procedure can be applied in combination with simulations for any other NDT method.
zh
[CV-67] Stable Velocity: A Variance Perspective on Flow Matching
【速读】:该论文旨在解决流匹配(Flow Matching)方法在训练过程中因依赖单样本条件速度(conditional velocity)而导致的高方差问题,进而引发优化不稳定和收敛缓慢的问题。其解决方案的关键在于通过显式分析该方差特性,识别出两个关键区域:一是靠近先验分布的高方差区域,优化困难;二是靠近数据分布的低方差区域,条件速度与边际速度几乎一致。基于此洞察,作者提出统一框架Stable Velocity,包含两个核心组件:训练阶段引入无偏方差减少目标StableVM及自适应增强辅助监督的Variance-Aware Representation Alignment (VA-REPA),推理阶段则利用低方差区域的动力学可解析简化特性,提出无需微调的Stable Velocity Sampling (StableVS),实现采样速度提升超过2倍且不牺牲样本质量。
链接: https://arxiv.org/abs/2602.05435
作者: Donglin Yang,Yongxing Zhang,Xin Yu,Liang Hou,Xin Tao,Pengfei Wan,Xiaojuan Qi,Renjie Liao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While flow matching is elegant, its reliance on single-sample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is challenging, and 2) a low-variance regime near the data distribution, where conditional and marginal velocities nearly coincide. Leveraging this insight, we propose Stable Velocity, a unified framework that improves both training and sampling. For training, we introduce Stable Velocity Matching (StableVM), an unbiased variance-reduction objective, along with Variance-Aware Representation Alignment (VA-REPA), which adaptively strengthen auxiliary supervision in the low-variance regime. For inference, we show that dynamics in the low-variance regime admit closed-form simplifications, enabling Stable Velocity Sampling (StableVS), a finetuning-free acceleration. Extensive experiments on ImageNet 256\times256 and large pretrained text-to-image and text-to-video models, including SD3.5, Flux, Qwen-Image, and Wan2.2, demonstrate consistent improvements in training efficiency and more than 2\times faster sampling within the low-variance regime without degrading sample quality. Our code is available at this https URL.
zh
[CV-68] LD-SLRO: Latent Diffusion Structured Light for 3-D Reconstruction of Highly Reflective Objects
【速读】:该论文旨在解决高反射率和低表面粗糙度物体在基于条纹投影的三维重建中,因镜面反射和间接光照导致条纹图案严重失真或丢失的问题。解决方案的关键在于提出了一种基于潜在扩散模型的结构光方法(Latent Diffusion-based Structured Light for Reflective Objects, LD-SLRO):首先通过编码器提取从高反射表面捕获的相移条纹图像中的潜在特征,以表征表面反射特性;随后将这些特征作为条件输入至潜在扩散模型,该模型通过概率性方式抑制由反射引起的伪影并恢复丢失的条纹信息;同时引入镜面反射编码器、时变通道仿射层和注意力模块等组件进一步提升条纹恢复质量,并支持灵活配置输入与输出条纹集合。实验表明,该方法显著优于现有技术,在3D重建精度上将平均均方根误差从1.8176 mm降低至0.9619 mm。
链接: https://arxiv.org/abs/2602.05434
作者: Sanghoon Jeon,Gihyun Jung,Suhyeon Ka,Jae-Sang Hyun
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures
Abstract:Fringe projection profilometry-based 3-D reconstruction of objects with high reflectivity and low surface roughness remains a significant challenge. When measuring such glossy surfaces, specular reflection and indirect illumination often lead to severe distortion or loss of the projected fringe patterns. To address these issues, we propose a latent diffusion-based structured light for reflective objects (LD-SLRO). Phase-shifted fringe images captured from highly reflective surfaces are first encoded to extract latent representations that capture surface reflectance characteristics. These latent features are then used as conditional inputs to a latent diffusion model, which probabilistically suppresses reflection-induced artifacts and recover lost fringe information, yielding high-quality fringe images. The proposed components, including the specular reflection encoder, time-variant channel affine layer, and attention modules, further improve fringe restoration quality. In addition, LD-SLRO provides high flexibility in configuring the input and output fringe sets. Experimental results demonstrate that the proposed method improves both fringe quality and 3-D reconstruction accuracy over state-of-the-art methods, reducing the average root-mean-squared error from 1.8176 mm to 0.9619 mm.
zh
[CV-69] M2-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining ICLR2026
【速读】:该论文旨在解决当前构建高性能图形用户界面(GUI)智能代理所面临的三大挑战:高成本、低质量与数据多样性不足的问题。其核心解决方案是提出M²-Miner框架,该框架基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)实现低成本、自动化的移动端GUI交互轨迹数据挖掘。关键创新包括:1)设计协同多智能体架构(InferAgent、OrchestraAgent与JudgeAgent),分别负责行为引导、加速生成与质量评估;2)引入意图复用策略以提升交互轨迹的挖掘效率并增强意图多样性;3)采用模型在环(model-in-the-loop)的渐进式训练机制,显著提高数据挖掘的成功率。实验表明,基于该框架挖掘的数据微调后的GUI代理在多个主流移动GUI基准上达到当前最优性能。
链接: https://arxiv.org/abs/2602.05429
作者: Rui Lv,Juncheng Mo,Tianyi Chu,Chen Rao,Hongyi Jing,Jiajie Teng,Jiafu Chen,Shiqi Zhang,Liangzi Ding,Shuo Fang,Huaizhong Lin,Ziqiang Dang,Chenguang Ma,Lei Zhao
机构: Ant Group (蚂蚁集团); Zhejiang University (浙江大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2026. Supplementary material is included at the end of the main paper (16 pages, 15 figures, 2 tables)
Abstract:Graphical User Interface (GUI) agent is pivotal to advancing intelligent human-computer interaction paradigms. Constructing powerful GUI agents necessitates the large-scale annotation of high-quality user-behavior trajectory data (i.e., intent-trajectory pairs) for training. However, manual annotation methods and current GUI agent data mining approaches typically face three critical challenges: high construction cost, poor data quality, and low data richness. To address these issues, we propose M ^2 -Miner, the first low-cost and automated mobile GUI agent data-mining framework based on Monte Carlo Tree Search (MCTS). For better data mining efficiency and quality, we present a collaborative multi-agent framework, comprising InferAgent, OrchestraAgent, and JudgeAgent for guidance, acceleration, and evaluation. To further enhance the efficiency of mining and enrich intent diversity, we design an intent recycling strategy to extract extra valuable interaction trajectories. Additionally, a progressive model-in-the-loop training strategy is introduced to improve the success rate of data mining. Extensive experiments have demonstrated that the GUI agent fine-tuned using our mined data achieves state-of-the-art performance on several commonly used mobile GUI benchmarks. Our work will be released to facilitate the community research.
zh
[CV-70] Multi-AD: Cross-Domain Unsupervised Anomaly Detection for Medical and Industrial Applications
【速读】:该论文旨在解决跨域应用场景中深度学习模型因缺乏标注数据而导致的异常检测性能受限问题,尤其是在医学影像(如脑部MRI、肝脏CT和视网膜OCT)与工业图像(如MVTec AD)中的早期疾病诊断和缺陷检测任务。其解决方案的关键在于提出一种基于教师-学生(teacher-student, T-S)架构的多尺度卷积神经网络(CNN)模型Multi-AD,通过引入通道注意力机制(squeeze-and-excitation, SE)增强特征提取能力,利用知识蒸馏(knowledge distillation, KD)将教师模型中的判别性特征迁移至学生模型以学习正常与异常样本的差异,并结合判别器网络进一步提升分类边界清晰度;在推理阶段融合多尺度特征实现对不同尺寸异常的精准定位,从而在多个领域均展现出卓越的泛化能力和高精度的异常检测性能。
链接: https://arxiv.org/abs/2602.05426
作者: Wahyu Rahmaniar,Kenji Suzuki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 8 figures
Abstract:Traditional deep learning models often lack annotated data, especially in cross-domain applications such as anomaly detection, which is critical for early disease diagnosis in medicine and defect detection in industry. To address this challenge, we propose Multi-AD, a convolutional neural network (CNN) model for robust unsupervised anomaly detection across medical and industrial images. Our approach employs the squeeze-and-excitation (SE) block to enhance feature extraction via channel-wise attention, enabling the model to focus on the most relevant features and detect subtle anomalies. Knowledge distillation (KD) transfers informative features from the teacher to the student model, enabling effective learning of the differences between normal and anomalous data. Then, the discriminator network further enhances the model’s capacity to distinguish between normal and anomalous data. At the inference stage, by integrating multi-scale features, the student model can detect anomalies of varying sizes. The teacher-student (T-S) architecture ensures consistent representation of high-dimensional features while adapting them to enhance anomaly detection. Multi-AD was evaluated on several medical datasets, including brain MRI, liver CT, and retina OCT, as well as industrial datasets, such as MVTec AD, demonstrating strong generalization across multiple domains. Experimental results demonstrated that our approach consistently outperformed state-of-the-art models, achieving the best average AUROC for both image-level (81.4% for medical and 99.6% for industrial) and pixel-level (97.0% for medical and 98.4% for industrial) tasks, making it effective for real-world applications.
zh
[CV-71] NeVStereo: A NeRF-Driven NVS-Stereo Architecture for High-Fidelity 3D Tasks
【速读】:该论文旨在解决当前三维重建系统中难以同时实现高精度相机位姿估计、可靠深度图、高质量新视角合成(NVS)以及精确三维表面重建的问题,尤其是在从随意拍摄的多视角RGB图像中获取一致且高质量的几何与外观信息时。其解决方案的关键在于提出一种基于NeRF(神经辐射场)驱动的联合优化架构——NeVStereo,该架构通过四个核心模块协同工作:利用NeRF生成适合立体匹配的渲染结果以提升深度估计准确性;采用置信度引导的多视角深度估计方法;结合NeRF耦合的束调整(bundle adjustment)进行位姿精修;并通过迭代优化阶段同步更新深度图与辐射场,从而增强几何一致性并缓解传统NeRF方法常见的表面堆叠、伪影及位姿-深度耦合等问题。
链接: https://arxiv.org/abs/2602.05423
作者: Pengcheng Chen,Yue Hu,Wenhao Li,Nicole M Gunderson,Andrew Feng,Zhenglong Sun,Peter Beerel,Eric J Seibel
机构: University of Washington (华盛顿大学); University of Southern California (南加州大学); CUHK-Shenzhen (香港中文大学(深圳)); USC Institute for Creative Technologies (南加州大学创意技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:In modern dense 3D reconstruction, feed-forward systems (e.g., VGGT, pi3) focus on end-to-end matching and geometry prediction but do not explicitly output the novel view synthesis (NVS). Neural rendering-based approaches offer high-fidelity NVS and detailed geometry from posed images, yet they typically assume fixed camera poses and can be sensitive to pose errors. As a result, it remains non-trivial to obtain a single framework that can offer accurate poses, reliable depth, high-quality rendering, and accurate 3D surfaces from casually captured views. We present NeVStereo, a NeRF-driven NVS-stereo architecture that aims to jointly deliver camera poses, multi-view depth, novel view synthesis, and surface reconstruction from multi-view RGB-only inputs. NeVStereo combines NeRF-based NVS for stereo-friendly renderings, confidence-guided multi-view depth estimation, NeRF-coupled bundle adjustment for pose refinement, and an iterative refinement stage that updates both depth and the radiance field to improve geometric consistency. This design mitigated the common NeRF-based issues such as surface stacking, artifacts, and pose-depth coupling. Across indoor, outdoor, tabletop, and aerial benchmarks, our experiments indicate that NeVStereo achieves consistently strong zero-shot performance, with up to 36% lower depth error, 10.4% improved pose accuracy, 4.5% higher NVS fidelity, and state-of-the-art mesh quality (F1 91.93%, Chamfer 4.35 mm) compared to existing prestigious methods.
zh
[CV-72] Disco: Densely-overlapping Cell Instance Segmentation via Adjacency-aware Collaborative Coloring ICLR2026
【速读】:该论文旨在解决数字病理分析中复杂密集细胞区域的实例分割难题,尤其是现有基于轮廓检测和距离映射的方法在处理高密度重叠细胞时性能受限的问题。研究表明,真实世界细胞邻接图多为非二分图且存在大量奇数长度环(主要为三角形),使得简单的2色着色理论失效,而高色数模型又会引入表示冗余与优化困难。解决方案的关键在于提出Disco框架——一种基于“分而治之”原则的邻接感知协同着色方法:首先通过“显式标记”策略将拓扑挑战转化为可学习的分类任务,递归分解细胞图并隔离冲突集;其次利用“隐式消歧”机制在冲突区域强制不同实例特征差异性,从而学习可分离的特征表示,有效解决复杂邻接冲突问题。
链接: https://arxiv.org/abs/2602.05420
作者: Rui Sun,Yiwen Yang,Kaiyu Guo,Chen Jiang,Dongli Xu,Zhaonan Liu,Tan Pan,Limei Han,Xue Jiang,Wu Wei,Yuan Cheng
机构: Shanghai Academy of Artificial Intelligence for Science (上海人工智能科学研究院); Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); School of Life Sciences and Biotechnology, Shanghai Jiao Tong University (上海交通大学生命科学与生物技术学院); Lingang Laboratory (临港实验室); Renji Hospital, School of Medicine, Shanghai Jiao Tong University (上海交通大学医学院仁济医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 10 figures; ICLR 2026
Abstract:Accurate cell instance segmentation is foundational for digital pathology analysis. Existing methods based on contour detection and distance mapping still face significant challenges in processing complex and dense cellular regions. Graph coloring-based methods provide a new paradigm for this task, yet the effectiveness of this paradigm in real-world scenarios with dense overlaps and complex topologies has not been verified. Addressing this issue, we release a large-scale dataset GBC-FS 2025, which contains highly complex and dense sub-cellular nuclear arrangements. We conduct the first systematic analysis of the chromatic properties of cell adjacency graphs across four diverse datasets and reveal an important discovery: most real-world cell graphs are non-bipartite, with a high prevalence of odd-length cycles (predominantly triangles). This makes simple 2-coloring theory insufficient for handling complex tissues, while higher-chromaticity models would cause representational redundancy and optimization difficulties. Building on this observation of complex real-world contexts, we propose Disco (Densely-overlapping Cell Instance Segmentation via Adjacency-aware COllaborative Coloring), an adjacency-aware framework based on the “divide and conquer” principle. It uniquely combines a data-driven topological labeling strategy with a constrained deep learning system to resolve complex adjacency conflicts. First, “Explicit Marking” strategy transforms the topological challenge into a learnable classification task by recursively decomposing the cell graph and isolating a “conflict set.” Second, “Implicit Disambiguation” mechanism resolves ambiguities in conflict regions by enforcing feature dissimilarity between different instances, enabling the model to learn separable feature representations.
zh
[CV-73] VMF-GOS: Geometry-guided virtual Outlier Synthesis for Long-Tailed OOD Detection
【速读】:该论文旨在解决长尾分布下分布外(Out-of-Distribution, OOD)检测性能下降的问题,其核心挑战在于尾部类别样本稀缺导致特征空间中决策边界模糊。现有最先进方法通常依赖大规模外部真实数据集(如80 Million Tiny Images)进行异常暴露(Outlier Exposure, OE)正则化,但在实际部署中受限于数据获取成本和隐私敏感性。本文提出一种完全无需外部数据的新型框架,关键创新在于引入几何引导的虚拟异常合成(Geometry-guided virtual Outlier Synthesis, GOS)策略,利用超球面上的von Mises-Fisher(vMF)分布建模统计特性,在低似然环形区域进行方向性虚拟异常采样;同时设计双粒度语义损失(Dual-Granularity Semantic Loss, DGS),通过对比学习增强分布内(In-Distribution, ID)特征与合成边界异常之间的区分度,从而在不依赖外部数据的前提下实现优于现有方法的OOD检测性能。
链接: https://arxiv.org/abs/2602.05415
作者: Ningkang Peng,Qianfeng Yu,Yuhao Zhang,Yafei Liu,Xiaoqian Peng,Peirong Ma,Yi Chen,Peiheng Li,Yanhui Gu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Out-of-Distribution (OOD) detection under long-tailed distributions is a highly challenging task because the scarcity of samples in tail classes leads to blurred decision boundaries in the feature space. Current state-of-the-art (sota) methods typically employ Outlier Exposure (OE) strategies, relying on large-scale real external datasets (such as 80 Million Tiny Images) to regularize the feature space. However, this dependence on external data often becomes infeasible in practical deployment due to high data acquisition costs and privacy sensitivity. To this end, we propose a novel data-free framework aimed at completely eliminating reliance on external datasets while maintaining superior detection performance. We introduce a Geometry-guided virtual Outlier Synthesis (GOS) strategy that models statistical properties using the von Mises-Fisher (vMF) distribution on a hypersphere. Specifically, we locate a low-likelihood annulus in the feature space and perform directional sampling of virtual outliers in this region. Simultaneously, we introduce a new Dual-Granularity Semantic Loss (DGS) that utilizes contrastive learning to maximize the distinction between in-distribution (ID) features and these synthesized boundary outliers. Extensive experiments on benchmarks such as CIFAR-LT demonstrate that our method outperforms sota approaches that utilize external real images.
zh
[CV-74] SBOW: Traffic Surveillance Benchmark for Occluded Vehicles Under Various Weather Conditions AAAI AAAI-26
【速读】:该论文旨在解决极端天气条件下闭塞车辆检测难题,现有数据集多局限于轻度雾霾、雨雪等常规天气,难以模拟真实复杂交通场景中的视频质量退化与交通流扰动问题。解决方案的关键在于构建一个名为Traffic Surveillance Benchmark for Occluded vehicles under various Weather conditions (TSBOW)的综合性基准数据集,涵盖超过32小时来自高密度城市区域的真实交通视频,包含48,000余帧人工标注及320万帧半自动标注的图像,覆盖从大型车辆到微移动设备等八类交通参与者,并提供多视角、多道路类型和不同尺度的复杂场景,从而为闭塞目标检测算法在恶劣天气下的鲁棒性研究提供高质量训练与评估基础。
链接: https://arxiv.org/abs/2602.05414
作者: Ngoc Doan-Minh Huynh,Duong Nguyen-Ngoc Tran,Long Hoang Pham,Tai Huu-Phuong Tran,Hyung-Joon Jeon,Huy-Hung Nguyen,Duong Khac Vu,Hyung-Min Jeon,Son Hong Phan,Quoc Pham-Nam Ho,Chi Dai Tran,Trinh Le Ba Khanh,Jae Wook Jeon
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by the 40th AAAI Conference on Artificial Intelligence (AAAI-26)
Abstract:Global warming has intensified the frequency and severity of extreme weather events, which degrade CCTV signal and video quality while disrupting traffic flow, thereby increasing traffic accident rates. Existing datasets, often limited to light haze, rain, and snow, fail to capture extreme weather conditions. To address this gap, this study introduces the Traffic Surveillance Benchmark for Occluded vehicles under various Weather conditions (TSBOW), a comprehensive dataset designed to enhance occluded vehicle detection across diverse annual weather scenarios. Comprising over 32 hours of real-world traffic data from densely populated urban areas, TSBOW includes more than 48,000 manually annotated and 3.2 million semi-labeled frames; bounding boxes spanning eight traffic participant classes from large vehicles to micromobility devices and pedestrians. We establish an object detection benchmark for TSBOW, highlighting challenges posed by occlusions and adverse weather. With its varied road types, scales, and viewpoints, TSBOW serves as a critical resource for advancing Intelligent Transportation Systems. Our findings underscore the potential of CCTV-based traffic monitoring, pave the way for new research and applications. The TSBOW dataset is publicly available at: this https URL.
zh
[CV-75] Explainable Pathomics Feature Visualization via Correlation-aware Conditional Feature Editing
【速读】:该论文旨在解决路径组学(Pathomics)特征在数字病理学中可解释性差、跨临床场景难以理解的问题,尤其是现有条件扩散模型在编辑特征时因假设特征独立性而破坏生物流形(biological manifold),导致生成不真实图像的局限。解决方案的关键在于提出一种流形感知扩散(Manifold-Aware Diffusion, MAD)框架,通过变分自编码器(VAE)学习解耦的潜在空间,并在此空间中对特征轨迹进行正则化,确保编辑目标特征的同时自动调整相关属性以保持在真实细胞分布范围内,从而实现可控且生物学合理的细胞核图像编辑。
链接: https://arxiv.org/abs/2602.05397
作者: Yuechen Yang,Junlin Guo,Ruining Deng,Junchao Zhu,Zhengyi Lu,Chongyu Qu,Yanfan Zhu,Xingyi Guo,Yu Wang,Shilin Zhao,Haichun Yang,Yuankai Huo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pathomics is a recent approach that offers rich quantitative features beyond what black-box deep learning can provide, supporting more reproducible and explainable biomarkers in digital pathology. However, many derived features (e.g., “second-order moment”) remain difficult to interpret, especially across different clinical contexts, which limits their practical adoption. Conditional diffusion models show promise for explainability through feature editing, but they typically assume feature independence**–**an assumption violated by intrinsically correlated pathomics features. Consequently, editing one feature while fixing others can push the model off the biological manifold and produce unrealistic artifacts. To address this, we propose a Manifold-Aware Diffusion (MAD) framework for controllable and biologically plausible cell nuclei editing. Unlike existing approaches, our method regularizes feature trajectories within a disentangled latent space learned by a variational auto-encoder (VAE). This ensures that manipulating a target feature automatically adjusts correlated attributes to remain within the learned distribution of real cells. These optimized features then guide a conditional diffusion model to synthesize high-fidelity images. Experiments demonstrate that our approach is able to navigate the manifold of pathomics features when editing those features. The proposed method outperforms baseline methods in conditional feature editing while preserving structural coherence.
zh
[CV-76] Dataset Distillation via Relative Distribution Matching and Cognitive Heritage
【速读】:该论文旨在解决数据蒸馏(Dataset Distillation)在使用预训练自监督模型作为主干网络时存在的计算与内存开销过高的问题。现有方法依赖于线性梯度匹配,需在每一轮蒸馏中加载数千张真实图像并进行多次可微增强操作,导致效率低下。其解决方案的关键在于提出**统计流匹配(Statistical Flow Matching)**框架,通过将合成图像的统计特征对齐到原始数据中目标类中心到非目标类中心的恒定统计流,实现高效优化:仅需一次性加载原始数据统计量,并对合成数据执行单次增强,即可达到或超越当前最优性能,同时降低10倍GPU显存占用和4倍运行时间。此外,引入分类器继承策略,复用原数据集训练的分类器,仅需轻量级线性投影层即可显著提升性能。
链接: https://arxiv.org/abs/2602.05391
作者: Qianxin Xia,Jiawei Du,Yuhan Zhang,Jielei Wang,Guoming Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dataset distillation seeks to synthesize a highly compact dataset that achieves performance comparable to the original dataset on downstream tasks. For the classification task that use pre-trained self-supervised models as backbones, previous linear gradient matching optimizes synthetic images by encouraging them to mimic the gradient updates induced by real images on the linear classifier. However, this batch-level formulation requires loading thousands of real images and applying multiple rounds of differentiable augmentations to synthetic images at each distillation step, leading to substantial computational and memory overhead. In this paper, we introduce statistical flow matching , a stable and efficient supervised learning framework that optimizes synthetic images by aligning constant statistical flows from target class centers to non-target class centers in the original data. Our approach loads raw statistics only once and performs a single augmentation pass on the synthetic data, achieving performance comparable to or better than the state-of-the-art methods with 10x lower GPU memory usage and 4x shorter runtime. Furthermore, we propose a classifier inheritance strategy that reuses the classifier trained on the original dataset for inference, requiring only an extremely lightweight linear projector and marginal storage while achieving substantial performance gains.
zh
[CV-77] Parallel Swin Transformer-Enhanced 3D MRI-to-CT Synthesis for MRI-Only Radiotherapy Planning
【速读】:该论文旨在解决磁共振成像(MRI)在放射治疗计划中因缺乏电子密度信息而无法直接用于剂量计算的问题,当前临床流程需依赖MRI与CT联合采集,导致配准不确定性及流程复杂化。其解决方案的关键在于提出一种名为Parallel Swin Transformer-Enhanced Med2Transformer的3D架构,该架构融合卷积编码与双Swin Transformer分支,以同时建模局部解剖细节和长程上下文依赖关系;通过多尺度移位窗口注意力机制与分层特征聚合策略,显著提升生成合成CT图像的解剖保真度和几何准确性,最终实现可接受的剂量学性能(平均靶区剂量误差为1.69%)。
链接: https://arxiv.org/abs/2602.05387
作者: Zolnamar Dorjsembe,Hung-Yi Chen,Furen Xiao,Hsing-Kuo Pao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:MRI provides superior soft tissue contrast without ionizing radiation; however, the absence of electron density information limits its direct use for dose calculation. As a result, current radiotherapy workflows rely on combined MRI and CT acquisitions, increasing registration uncertainty and procedural complexity. Synthetic CT generation enables MRI only planning but remains challenging due to nonlinear MRI-CT relationships and anatomical variability. We propose Parallel Swin Transformer-Enhanced Med2Transformer, a 3D architecture that integrates convolutional encoding with dual Swin Transformer branches to model both local anatomical detail and long-range contextual dependencies. Multi-scale shifted window attention with hierarchical feature aggregation improves anatomical fidelity. Experiments on public and clinical datasets demonstrate higher image similarity and improved geometric accuracy compared with baseline methods. Dosimetric evaluation shows clinically acceptable performance, with a mean target dose error of 1.69%. Code is available at: this https URL.
zh
[CV-78] Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting
【速读】:该论文旨在解决文档图像解析领域中存在的模型碎片化与处理能力局限性问题,尤其是现有两阶段方法依赖轴对齐边界框进行版面检测,难以有效处理扭曲或拍摄获取的文档图像。其解决方案的关键在于提出Dolphin-v2模型:第一阶段联合执行文档类型分类(数字生成 vs. 拍摄文档)与版面分析,对数字生成文档进一步实现细粒度元素检测并预测阅读顺序;第二阶段采用混合解析策略——拍摄文档以整体页面为单位进行端到端解析以应对几何失真,而数字生成文档则基于检测到的版面锚点进行元素级并行解析,从而提升内容提取效率与准确性。该方案显著提升了对拍摄文档的鲁棒性(错误率降低91%),同时支持21类细粒度元素识别及代码块缩进保留等新能力,整体性能在OmniDocBench上提升14.78分。
链接: https://arxiv.org/abs/2602.05384
作者: Hao Feng,Wei Shi,Ke Zhang,Xiang Fei,Lei Liao,Dingkang Yang,Yongkun Du,Xuecheng Wu,Jingqun Tang,Yang Liu,Hong Chen,Can Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Document parsing has garnered widespread attention as vision-language models (VLMs) advance OCR capabilities. However, the field remains fragmented across dozens of specialized models with varying strengths, forcing users to navigate complex model selection and limiting system scalability. Moreover, existing two-stage approaches depend on axis-aligned bounding boxes for layout detection, failing to handle distorted or photographed documents effectively. To this end, we present Dolphin-v2, a two-stage document image parsing model that substantially improves upon the original Dolphin. In the first stage, Dolphin-v2 jointly performs document type classification (digital-born versus photographed) alongside layout analysis. For digital-born documents, it conducts finer-grained element detection with reading order prediction. In the second stage, we employ a hybrid parsing strategy: photographed documents are parsed holistically as complete pages to handle geometric distortions, while digital-born documents undergo element-wise parallel parsing guided by the detected layout anchors, enabling efficient content extraction. Compared with the original Dolphin, Dolphin-v2 introduces several crucial enhancements: (1) robust parsing of photographed documents via holistic page-level understanding, (2) finer-grained element detection (21 categories) with semantic attribute extraction such as author information and document metadata, and (3) code block recognition with indentation preservation, which existing systems typically lack. Comprehensive evaluations are conducted on DocPTBench, OmniDocBench, and our self-constructed RealDoc-160 benchmark. The results demonstrate substantial improvements: +14.78 points overall on the challenging OmniDocBench and 91% error reduction on photographed documents, while maintaining efficient inference through parallel processing.
zh
[CV-79] VRIQ: Benchmarking and Analyzing Visual-Reason ing IQ of VLMs
【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在非语言推理任务中表现不可靠的问题,特别是其在抽象推理和自然图像推理能力上的局限性。解决方案的关键在于构建一个名为VRIQ(Visual Reasoning IQ)的新基准,系统评估VLMs的视觉推理能力,并引入细粒度诊断探针(diagnostic probes),区分感知(perception)与推理(reasoning)模块的贡献。分析表明,约56%的失败源于感知问题,43%来自感知与推理共同作用,仅有1%纯粹由推理不足导致,从而揭示了当前VLMs在视觉推理上的瓶颈主要在于感知能力而非推理机制本身,为改进多模态系统的视觉推理提供了可解释、可操作的方向。
链接: https://arxiv.org/abs/2602.05382
作者: Tina Khezresmaeilzadeh,Jike Zhong,Konstantinos Psounis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent progress in Vision Language Models (VLMs) has raised the question of whether they can reliably perform nonverbal reasoning. To this end, we introduce VRIQ (Visual Reasoning IQ), a novel benchmark designed to assess and analyze the visual reasoning ability of VLMs. We evaluate models on two sets of tasks: abstract puzzle-style and natural-image reasoning tasks. We find that on abstract puzzles, performance remains near random with an average accuracy of around 28%, while natural tasks yield better but still weak results with 45% accuracy. We also find that tool-augmented reasoning demonstrates only modest improvements. To uncover the source of this weakness, we introduce diagnostic probes targeting perception and reasoning. Our analysis demonstrates that around 56% of failures arise from perception alone, 43% from both perception and reasoning, and only a mere 1% from reasoning alone. This motivates us to design fine-grained diagnostic probe questions targeting specific perception categories (e.g., shape, count, position, 3D/depth), revealing that certain categories cause more failures than others. Our benchmark and analysis establish that current VLMs, even with visual reasoning tools, remain unreliable abstract reasoners, mostly due to perception limitations, and offer a principled basis for improving visual reasoning in multimodal systems.
zh
[CV-80] SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback
【速读】:该论文旨在解决扩散模型(diffusion models)与人类偏好对齐的难题,特别是在缺乏辅助奖励模型或难以获取大规模偏好数据集的情况下。其核心挑战在于如何在仅依赖极少量人类标注偏好对的前提下实现高效且稳定的对齐。解决方案的关键在于提出SAIL(Self-Amplified Iterative Learning)框架,该框架通过让扩散模型自身作为教师进行迭代式自我改进:从一个最小的人工标注偏好对种子集出发,模型在闭环流程中逐步生成多样化样本、基于自身演化的理解进行自标注,并利用扩展后的自增强数据集持续优化自身。为防止灾难性遗忘并确保学习稳定性,文中引入了排序偏好混合策略(ranked preference mixup),以平衡探索与初始人类先验之间的关系。实验表明,SAIL仅需现有方法6%的偏好数据即可超越当前最优基准,揭示了扩散模型内在的自我提升潜力。
链接: https://arxiv.org/abs/2602.05380
作者: Xiaoxuan He,Siming Fu,Wanli Li,Zhiyuan Li,Dacheng Yin,Kang Rong,Fengyun Rao,Bo Zhang
机构: ZheJiang University (浙江大学); WeChat Vision, Tencent Inc (腾讯公司微信视觉团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Aligning diffusion models with human preferences remains challenging, particularly when reward models are unavailable or impractical to obtain, and collecting large-scale preference datasets is prohibitively expensive. \textitThis raises a fundamental question: can we achieve effective alignment using only minimal human feedback, without auxiliary reward models, by unlocking the latent capabilities within diffusion models themselves? In this paper, we propose \textbfSAIL (\textbfSelf-\textbfAmplified \textbfIterative \textbfLearning), a novel framework that enables diffusion models to act as their own teachers through iterative self-improvement. Starting from a minimal seed set of human-annotated preference pairs, SAIL operates in a closed-loop manner where the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and refines itself using this self-augmented dataset. To ensure robust learning and prevent catastrophic forgetting, we introduce a ranked preference mixup strategy that carefully balances exploration with adherence to initial human priors. Extensive experiments demonstrate that SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using merely 6% of the preference data required by existing approaches, revealing that diffusion models possess remarkable self-improvement capabilities that, when properly harnessed, can effectively replace both large-scale human annotation and external reward models.
zh
[CV-81] Erase at the Core: Representation Unlearning for Machine Unlearning
【速读】:该论文试图解决现有近似机器遗忘方法中存在的“表面遗忘”(superficial forgetting)问题,即模型在输出层(logit-level)表现出对遗忘数据的高遗忘效果(如准确率为零),但其内部特征表示仍保留与原模型高度相似的信息,导致隐私泄露风险。解决方案的关键在于提出Erase at the Core (EC)框架,通过在网络各中间层引入辅助模块,结合多层对比遗忘学习(multi-layer contrastive unlearning)与深度监督学习(deeply supervised learning),在每个监督点同时施加对比损失和交叉熵损失,并采用分层加权策略,从而实现从底层到高层的全网络结构化遗忘,显著降低中间层特征表示与原模型的相似性,同时保持保留集(retain set)性能不受损。
链接: https://arxiv.org/abs/2602.05375
作者: Jaewon Lee,Yongwoo Kim,Donghyun Kim
机构: Korea University (韩国大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Many approximate machine unlearning methods demonstrate strong logit-level forgetting – such as near-zero accuracy on the forget set – yet continue to preserve substantial information within their internal feature representations. We refer to this discrepancy as superficial forgetting. Recent studies indicate that most existing unlearning approaches primarily alter the final classifier, leaving intermediate representations largely unchanged and highly similar to those of the original model. To address this limitation, we introduce the Erase at the Core (EC), a framework designed to enforce forgetting throughout the entire network hierarchy. EC integrates multi-layer contrastive unlearning on the forget set with retain set preservation through deeply supervised learning. Concretely, EC attaches auxiliary modules to intermediate layers and applies both contrastive unlearning and cross-entropy losses at each supervision point, with layer-wise weighted losses. Experimental results show that EC not only achieves effective logit-level forgetting, but also substantially reduces representational similarity to the original model across intermediate layers. Furthermore, EC is model-agnostic and can be incorporated as a plug-in module into existing unlearning methods, improving representation-level forgetting while maintaining performance on the retain set.
zh
[CV-82] Imagine a City: CityGenAgent for Procedural 3D City Generation
【速读】:该论文旨在解决3D城市自动化生成中面临的高保真资产创建、可控性与可编辑性不足的问题,尤其在自动驾驶、虚拟现实和具身智能等场景下对真实感与语义一致性要求较高的应用中。解决方案的关键在于提出一个自然语言驱动的分层过程化生成框架——CityGenAgent,其核心创新是将城市生成分解为两个可解释的模块:Block Program(街区程序)与Building Program(建筑程序),并通过两阶段学习策略实现结构正确性和语义对齐:首先通过监督微调(Supervised Fine-Tuning, SFT)确保生成程序满足几何约束(如非自交多边形和字段完整性),再利用强化学习(Reinforcement Learning, RL)设计空间对齐奖励(Spatial Alignment Reward)与视觉一致性奖励(Visual Consistency Reward),从而增强模型的空间推理能力和文本-视觉模态间的对齐能力。该方法支持基于自然语言的编辑与操控,显著提升了生成城市的语义准确性、视觉质量和可控性。
链接: https://arxiv.org/abs/2602.05362
作者: Zishan Liu,Zecong Tang,RuoCheng Wu,Xinzhe Zheng,Jingyu Hu,Ka-Hei Hui,Haoran Xie,Bo Dai,Zhengzhe Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The automated generation of interactive 3D cities is a critical challenge with broad applications in autonomous driving, virtual reality, and embodied intelligence. While recent advances in generative models and procedural techniques have improved the realism of city generation, existing methods often struggle with high-fidelity asset creation, controllability, and manipulation. In this work, we introduce CityGenAgent, a natural language-driven framework for hierarchical procedural generation of high-quality 3D cities. Our approach decomposes city generation into two interpretable components, Block Program and Building Program. To ensure structural correctness and semantic alignment, we adopt a two-stage learning strategy: (1) Supervised Fine-Tuning (SFT). We train BlockGen and BuildingGen to generate valid programs that adhere to schema constraints, including non-self-intersecting polygons and complete fields; (2) Reinforcement Learning (RL). We design Spatial Alignment Reward to enhance spatial reasoning ability and Visual Consistency Reward to bridge the gap between textual descriptions and the visual modality. Benefiting from the programs and the models’ generalization, CityGenAgent supports natural language editing and manipulation. Comprehensive evaluations demonstrate superior semantic alignment, visual quality, and controllability compared to existing methods, establishing a robust foundation for scalable 3D city generation.
zh
[CV-83] Breaking Semantic Hegemony: Decoupling Principal and Residual Subspaces for Generalized OOD Detection
【速读】:该论文旨在解决当前基于特征的后验(post-hoc)方法在分布外(Out-of-Distribution, OOD)检测中存在的一种反直觉现象——“简约悖论”(Simplicity Paradox):现有最先进(SOTA)模型对语义细微差异的OOD样本敏感,却对结构显著但语义简单的样本或高频传感器噪声表现出严重的几何盲区(Geometric Blindness)。作者将此归因于深度特征空间中的语义霸权(Semantic Hegemony),并通过神经坍缩(Neural Collapse)理论揭示其数学本质——主子空间高方差引发的谱集中偏差,数值上掩盖了残差子空间中应显著的结构分布偏移信号。解决方案的关键在于提出一种无需训练、即插即用的几何解耦框架D-KNN,其核心机制是利用正交分解显式分离语义成分与结构残差,并引入双空间校准机制以重新激活模型对微弱残差信号的敏感性,从而有效打破语义霸权,在CIFAR和ImageNet基准上实现新的SOTA性能。
链接: https://arxiv.org/abs/2602.05360
作者: Ningkang Peng,Xiaoqian Peng,Yuhao Zhang,Qianfeng Yu,Feng Xing,Peirong Ma,Xichen Yang,Yi Chen,Tingyu Lu,Yanhui Gu
机构: Nanjing Normal University (南京师范大学); Nanjing University of Chinese Medicine (南京中医药大学); Tohoku University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While feature-based post-hoc methods have made significant strides in Out-of-Distribution (OOD) detection, we uncover a counter-intuitive Simplicity Paradox in existing state-of-the-art (SOTA) models: these models exhibit keen sensitivity in distinguishing semantically subtle OOD samples but suffer from severe Geometric Blindness when confronting structurally distinct yet semantically simple samples or high-frequency sensor noise. We attribute this phenomenon to Semantic Hegemony within the deep feature space and reveal its mathematical essence through the lens of Neural Collapse. Theoretical analysis demonstrates that the spectral concentration bias, induced by the high variance of the principal subspace, numerically masks the structural distribution shift signals that should be significant in the residual subspace. To address this issue, we propose D-KNN, a training-free, plug-and-play geometric decoupling framework. This method utilizes orthogonal decomposition to explicitly separate semantic components from structural residuals and introduces a dual-space calibration mechanism to reactivate the model’s sensitivity to weak residual signals. Extensive experiments demonstrate that D-KNN effectively breaks Semantic Hegemony, establishing new SOTA performance on both CIFAR and ImageNet benchmarks. Notably, in resolving the Simplicity Paradox, it reduces the FPR95 from 31.3% to 2.3%; when addressing sensor failures such as Gaussian noise, it boosts the detection performance (AUROC) from a baseline of 79.7% to 94.9%.
zh
[CV-84] Multimodal Latent Reason ing via Hierarchical Visual Cues Injection
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中依赖“快思考”范式的问题,即模型通常采用端到端生成或显式的、以语言为中心的思维链(Chain of Thought, CoT),导致推理效率低、冗余且易产生幻觉。其解决方案的关键在于提出了一种基于分层视觉提示注入(Hierarchical Visual Cues Injection, HIVE)的多模态潜在空间推理框架,通过递归扩展Transformer块构建内部迭代循环,将来自全局场景上下文到细粒度区域细节的层次化视觉线索直接注入模型的潜在表示中,从而实现无需依赖表面文本推理的“慢思考”,并在对齐的潜在空间内完成具身的多步推理。
链接: https://arxiv.org/abs/2602.05359
作者: Yiming Zhang,Qiangyu Yan,Borui Jiang,Kai Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The advancement of multimodal large language models (MLLMs) has enabled impressive perception capabilities. However, their reasoning process often remains a “fast thinking” paradigm, reliant on end-to-end generation or explicit, language-centric chains of thought (CoT), which can be inefficient, verbose, and prone to hallucination. This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly. We propose multimodal latent reasoning via HIerarchical Visual cuEs injection (\emphHIVE), a novel framework that instills deliberate, “slow thinking” without depending on superficial textual rationales. Our method recursively extends transformer blocks, creating an internal loop for iterative reasoning refinement. Crucially, it injectively grounds this process with hierarchical visual cues from global scene context to fine-grained regional details directly into the model’s latent representations. This enables the model to perform grounded, multi-step inference entirely in the aligned latent space. Extensive evaluations demonstrate that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model’s understanding of complex scenes.
zh
[CV-85] Learning with Adaptive Prototype Manifolds for Out-of-Distribution Detection
【速读】:该论文旨在解决现有基于原型(prototype-based)表示学习方法在分布外(Out-of-distribution, OOD)检测中面临的两个根本性问题:一是静态同质性假设(Static Homogeneity Assumption),即所有类别共享固定数量的原型资源,导致原型冲突;二是学习-推理断层(Learning-Inference Disconnect),即在推理阶段丢弃了丰富的原型质量信息。为解决这些问题,论文提出 APEX(Adaptive Prototype for eXtensive OOD Detection)框架,其关键创新在于:(1) 自适应原型流形(Adaptive Prototype Manifold, APM),利用最小描述长度(Minimum Description Length, MDL)原则自动确定每类最优原型复杂度 $ K_c^* $,从而从根本上缓解原型冲突;(2) 后验感知OOD评分机制(Posterior-Aware OOD Scoring, PAOS),量化原型的凝聚性和分离性以弥合学习与推理之间的断层,显著提升模型性能。
链接: https://arxiv.org/abs/2602.05349
作者: Ningkang Peng,JiuTao Zhou,Yuhao Zhang,Xiaoqian Peng,Qianfeng Yu,Linjing Qian,Tingyu Lu,Yi Chen,Yanhui Gu
机构: Nanjing Normal University (南京师范大学); Nanjing University of Chinese Medicine (南京中医药大学); Tohoku University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Out-of-distribution (OOD) detection is a critical task for the safe deployment of machine learning models in the real world. Existing prototype-based representation learning methods have demonstrated exceptional performance. Specifically, we identify two fundamental flaws that universally constrain these methods: the Static Homogeneity Assumption (fixed representational resources for all classes) and the Learning-Inference Disconnect (discarding rich prototype quality knowledge at inference). These flaws fundamentally limit the model’s capacity and performance. To address these issues, we propose APEX (Adaptive Prototype for eXtensive OOD Detection), a novel OOD detection framework designed via a Two-Stage Repair process to optimize the learned feature manifold. APEX introduces two key innovations to address these respective flaws: (1) an Adaptive Prototype Manifold (APM), which leverages the Minimum Description Length (MDL) principle to automatically determine the optimal prototype complexity K_c^* for each class, thereby fundamentally resolving prototype collision; and (2) a Posterior-Aware OOD Scoring (PAOS) mechanism, which quantifies prototype quality (cohesion and separation) to bridge the learning-inference disconnect. Comprehensive experiments on benchmarks such as CIFAR-100 validate the superiority of our method, where APEX achieves new state-of-the-art performance.
zh
[CV-86] Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation
【速读】:该论文旨在解决当前文本到图像扩散模型中概念擦除(concept erasure)方法存在的问题:现有方法多聚焦于移除不安全概念(如有害内容),但缺乏对相应安全替代物的引导,导致生成结果在结构和语义上难以保持与原始图像的一致性。其解决方案的关键在于提出一种名为PAIRed Erasing(PAIR)的新框架,将概念擦除重构为基于不安全-安全配对的语义一致性重对齐任务。核心创新包括:(1) 配对语义重对齐(Paired Semantic Realignment),利用不安全-安全图像对显式地将目标概念映射至语义一致的安全锚点;(2) Fisher加权的DoRA初始化(Fisher-weighted Initialization for DoRA),通过不安全-安全配对初始化参数高效低秩适配矩阵,从而在选择性抑制不安全概念的同时促进安全替代物的生成。该方案实现了细粒度的概念擦除,同时保障整体语义一致性、结构完整性与生成质量。
链接: https://arxiv.org/abs/2602.05339
作者: Yongwoo Kim,Sungmin Cha,Hyunsoo Kim,Jaewon Lee,Donghyun Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:With the increasing versatility of text-to-image diffusion models, the ability to selectively erase undesirable concepts (e.g., harmful content) has become indispensable. However, existing concept erasure approaches primarily focus on removing unsafe concepts without providing guidance toward corresponding safe alternatives, which often leads to failure in preserving the structural and semantic consistency between the original and erased generations. In this paper, we propose a novel framework, PAIRed Erasing (PAIR), which reframes concept erasure from simple removal to consistency-preserving semantic realignment using unsafe-safe pairs. We first generate safe counterparts from unsafe inputs while preserving structural and semantic fidelity, forming paired unsafe-safe multimodal data. Leveraging these pairs, we introduce two key components: (1) Paired Semantic Realignment, a guided objective that uses unsafe-safe pairs to explicitly map target concepts to semantically aligned safe anchors; and (2) Fisher-weighted Initialization for DoRA, which initializes parameter-efficient low-rank adaptation matrices using unsafe-safe pairs, encouraging the generation of safe alternatives while selectively suppressing unsafe concepts. Together, these components enable fine-grained erasure that removes only the targeted concepts while maintaining overall semantic consistency. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving effective concept erasure while preserving structural integrity, semantic coherence, and generation quality.
zh
[CV-87] MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors
【速读】:该论文旨在解决全景场景理解中因高分辨率多任务标注数据稀缺、几何失真及坐标系差异导致的模型性能瓶颈问题,以及球面空间下不同密集预测任务间潜在关联未被充分挖掘的问题。其解决方案的关键在于提出一个名为MTPano的鲁棒多任务全景基础模型,通过无标签训练流程实现高效学习:首先利用透视域强大的密集先验,将全景图像投影为透视补丁并借助现成的基础模型生成无域差距的伪标签,再回投影提供补丁级监督;其次,针对任务类型间的干扰,将任务分为旋转不变(如深度、分割)与旋转相关(如表面法向量)两类,并设计全景双桥网络(Panoramic Dual BridgeNet),通过引入绝对位置和射线方向先验的几何感知调制层解耦特征流;此外,为缓解等距圆柱投影(ERP)带来的畸变影响,引入ERP token混合器与双分支BridgeNet结合梯度截断机制,在促进跨任务信息共享的同时抑制不兼容任务属性引发的冲突梯度。
链接: https://arxiv.org/abs/2602.05330
作者: Jingdong Zhang,Xiaohang Zhan,Lingzhi Zhang,Yizhou Wang,Zhengming Yu,Jionghao Wang,Wenping Wang,Xin Li
机构: Texas A&M University (德州农工大学); Adobe (Adobe)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Comprehensive panoramic scene understanding is critical for immersive applications, yet it remains challenging due to the scarcity of high-resolution, multi-task annotations. While perspective foundation models have achieved success through data scaling, directly adapting them to the panoramic domain often fails due to severe geometric distortions and coordinate system discrepancies. Furthermore, the underlying relations between diverse dense prediction tasks in spherical spaces are underexplored. To address these challenges, we propose MTPano, a robust multi-task panoramic foundation model established by a label-free training pipeline. First, to circumvent data scarcity, we leverage powerful perspective dense priors. We project panoramic images into perspective patches to generate accurate, domain-gap-free pseudo-labels using off-the-shelf foundation models, which are then re-projected to serve as patch-wise supervision. Second, to tackle the interference between task types, we categorize tasks into rotation-invariant (e.g., depth, segmentation) and rotation-variant (e.g., surface normals) groups. We introduce the Panoramic Dual BridgeNet, which disentangles these feature streams via geometry-aware modulation layers that inject absolute position and ray direction priors. To handle the distortion from equirectangular projections (ERP), we incorporate ERP token mixers followed by a dual-branch BridgeNet for interactions with gradient truncation, facilitating beneficial cross-task information sharing while blocking conflicting gradients from incompatible task attributes. Additionally, we introduce auxiliary tasks (image gradient, point map, etc.) to fertilize the cross-task learning process. Extensive experiments demonstrate that MTPano achieves state-of-the-art performance on multiple benchmarks and delivers competitive results against task-specific panoramic specialist foundation models.
zh
[CV-88] Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning
【速读】:该论文旨在解决现有视觉几何重建方法在处理广角相机(如鱼眼或全景相机)时的局限性问题,这些问题通常源于模型架构和训练数据仅针对针孔相机或校正后的透视图像设计,导致在真实场景中需额外进行复杂的标定与去畸变处理。解决方案的关键在于提出Wid3R——一种支持广角相机模型的前馈神经网络,其核心创新包括:利用球谐函数(spherical harmonics)表示光线方向,并引入新型相机模型标记(camera model token)嵌入网络结构,从而实现对图像畸变的感知与建模;此外,Wid3R是首个可直接从360°图像进行前馈式三维重建的多视角基础模型,在零样本迁移任务中表现出强鲁棒性,且在Stanford2D3D数据集上性能提升高达+77.33。
链接: https://arxiv.org/abs/2602.05321
作者: Dongki Jung,Jaehoon Choi,Adil Qureshi,Somi Jeong,Dinesh Manocha,Suyong Yeon
机构: NAVER LABS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Wid3R, a feed-forward neural network for visual geometry reconstruction that supports wide field-of-view camera models. Prior methods typically assume that input images are rectified or captured with pinhole cameras, since both their architectures and training datasets are tailored to perspective images only. These assumptions limit their applicability in real-world scenarios that use fisheye or panoramic cameras and often require careful calibration and undistortion. In contrast, Wid3R is a generalizable multi-view 3D estimation method that can model wide field-of-view camera types. Our approach leverages a ray representation with spherical harmonics and a novel camera model token within the network, enabling distortion-aware 3D reconstruction. Furthermore, Wid3R is the first multi-view foundation model to support feed-forward 3D reconstruction directly from 360 imagery. It demonstrates strong zero-shot robustness and consistently outperforms prior methods, achieving improvements of up to +77.33 on Stanford2D3D.
zh
[CV-89] Fast-SAM3D: 3Dfy Anything in Images but Faster
【速读】:该论文旨在解决SAM3D在复杂场景下进行大规模、开放世界三维重建时面临的推理延迟过高问题。现有通用加速策略在此场景中表现脆弱,其根源在于忽略了系统固有的多层级异质性:形状与布局之间的运动学差异、纹理精炼的内在稀疏性以及几何结构在频域上的方差特性。解决方案的关键在于提出Fast-SAM3D——一个无需训练的动态计算对齐框架,通过三个异质性感知机制实现高效推理:(1) 模态感知步长缓存(Modality-Aware Step Caching),将结构演化与敏感布局更新解耦;(2) 联合时空标记裁剪(Joint Spatiotemporal Token Carving),聚焦高熵区域进行精炼;(3) 频谱感知标记聚合(Spectral-Aware Token Aggregation),自适应调整解码分辨率。实验表明,Fast-SAM3D在保持几乎无损重建质量的前提下,实现了最高达2.67倍的端到端加速,确立了单视角三维生成的新效率前沿。
链接: https://arxiv.org/abs/2602.05293
作者: Weilun Feng,Mingqiang Wu,Zhiliang Chen,Chuanguang Yang,Haotong Qin,Yuqi Li,Xiaokun Liu,Guoxin Fan,Zhulin An,Libo Huang,Yulun Zhang,Michele Magno,Yongjun Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the \textbffirst systematic investigation into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline’s inherent multi-level \textbfheterogeneity: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present \textbfFast-SAM3D, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) \textitModality-Aware Step Caching to decouple structural evolution from sensitive layout updates; (2) \textitJoint Spatiotemporal Token Carving to concentrate refinement on high-entropy regions; and (3) \textitSpectral-Aware Token Aggregation to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to \textbf2.67 \times end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation. Our code is released in this https URL.
zh
[CV-90] Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在通用多模态检索任务中因处理大量视觉输入token而导致的高计算开销问题。解决方案的关键在于提出Magic-MM-Embedding,其核心创新包括:(1) 一种高效的MLLM架构,通过视觉token压缩显著降低推理延迟和内存占用;(2) 一种分阶段渐进式训练策略,从大规模连续预训练恢复多模态理解能力,经对比学习与难负样本挖掘增强判别力,并最终通过“以MLLM为裁判”(MLLM-as-a-Judge)引导的任务感知微调实现精准数据筛选与性能提升,从而在保持高效率的同时达到当前最优的多模态嵌入效果。
链接: https://arxiv.org/abs/2602.05275
作者: Qi Li,Yanzhe Zhao,Yongxin Zhou,Yameng Wang,Yandong Yang,Yuanjia Zhou,Jue Wang,Zuojian Wang,Jinxiang Liu
机构: Honor Device Co., Ltd
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs. In this paper, we propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding. Our approach is built on two synergistic pillars: (1) a highly efficient MLLM architecture incorporating visual token compression to drastically reduce inference latency and memory footprint, and (2) a multi-stage progressive training strategy designed to not only recover but significantly boost performance. This coarse-to-fine training paradigm begins with extensive continue pretraining to restore multimodal understanding and generation capabilities, progresses to large-scale contrastive pretraining and hard negative mining to enhance discriminative power, and culminates in a task-aware fine-tuning stage guided by an MLLM-as-a-Judge for precise data curation. Comprehensive experiments show that our model outperforms existing methods by a large margin while being more inference-efficient.
zh
[CV-91] Unlocking Prototype Potential: An Efficient Tuning Framework for Few-Shot Class-Incremental Learning
【速读】:该论文旨在解决少样本类增量学习(Few-shot Class-Incremental Learning, FSCIL)中因数据极度稀缺导致模型难以有效吸收新知识并提升全局判别能力的问题。传统方法依赖冻结的预训练特征提取器生成静态类别原型,易受主干网络固有表示偏差影响;而近期基于提示的微调方法虽能适应主干网络,但在极端数据稀缺下仍受限于模型容量。本文的关键创新在于转变视角:冻结特征提取器,仅对原型进行微调,认为FSCIL的核心挑战并非特征获取,而是优化静态高质量特征空间中的决策区域。为此提出一种高效的原型微调框架,通过引入类特定与任务感知的双重校准机制(dual-calibration method),使静态中心点演化为动态可学习组件,从而显著增强对持续增量类别的判别能力,且仅需极少可学习参数即可实现跨多个基准的性能领先。
链接: https://arxiv.org/abs/2602.05271
作者: Shengqin Jiang,Xiaoran Feng,Yuankai Qi,Haokui Zhang,Renlong Hang,Qingshan Liu,Lina Yao,Quan Z. Sheng,Ming-Hsuan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review
Abstract:Few-shot class-incremental learning (FSCIL) seeks to continuously learn new classes from very limited samples while preserving previously acquired knowledge. Traditional methods often utilize a frozen pre-trained feature extractor to generate static class prototypes, which suffer from the inherent representation bias of the backbone. While recent prompt-based tuning methods attempt to adapt the backbone via minimal parameter updates, given the constraint of extreme data scarcity, the model’s capacity to assimilate novel information and substantively enhance its global discriminative power is inherently limited. In this paper, we propose a novel shift in perspective: freezing the feature extractor while fine-tuning the prototypes. We argue that the primary challenge in FSCIL is not feature acquisition, but rather the optimization of decision regions within a static, high-quality feature space. To this end, we introduce an efficient prototype fine-tuning framework that evolves static centroids into dynamic, learnable components. The framework employs a dual-calibration method consisting of class-specific and task-aware offsets. These components function synergistically to improve the discriminative capacity of prototypes for ongoing incremental classes. Extensive results demonstrate that our method attains superior performance across multiple benchmarks while requiring minimal learnable parameters.
zh
[CV-92] ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network
【速读】:该论文旨在解决轻量级模型在高分辨率图像上难以平衡准确率与延迟的问题,尤其是基于Transformer的架构常因计算复杂度高而导致过长延迟。其解决方案的核心在于提出了一种名为ReGLA的轻量级混合网络,关键创新包括:(1)高效大感受野(Efficient Large Receptive Field, ELRF)模块,在保持大感受野的同时提升卷积效率;(2)ReLU门控调制注意力(ReLU Gated Modulated Attention, RGMA)模块,在维持线性计算复杂度的前提下增强局部特征表示能力;(3)多教师蒸馏策略,显著提升下游任务性能。实验表明,ReGLA-M在ImageNet-1K上达到80.85% Top-1准确率且仅需4.98ms延迟(512px输入),并在COCO和ADE20K等任务中优于同类模型,展现出优异的高分辨率视觉建模能力。
链接: https://arxiv.org/abs/2602.05262
作者: Junzhou Li,Manqi Zhao,Yilin Gao,Zhiheng Yu,Yin Li,Dongsheng Jiang,Li Xiao
机构: University of Science and Technology of China (中国科学技术大学); Huawei Technologies Co., Ltd. (华为技术有限公司); Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures
Abstract:Balancing accuracy and latency on high-resolution images is a critical challenge for lightweight models, particularly for Transformer-based architectures that often suffer from excessive latency. To address this issue, we introduce \textbfReGLA, a series of lightweight hybrid networks, which integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling. The design incorporates three key innovations: the Efficient Large Receptive Field (ELRF) module for enhancing convolutional efficiency while preserving a large receptive field; the ReLU Gated Modulated Attention (RGMA) module for maintaining linear complexity while enhancing local feature representation; and a multi-teacher distillation strategy to boost performance on downstream tasks. Extensive experiments validate the superiority of ReGLA; particularly the ReGLA-M achieves \textbf80.85% Top-1 accuracy on ImageNet-1K at 224px , with only \textbf4.98 ms latency at 512px . Furthermore, ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of \textbf3.1% AP on COCO object detection and \textbf3.6% mIoU on ADE20K semantic segmentation, establishing it as a state-of-the-art solution for high-resolution visual applications.
zh
[CV-93] RFM-Pose:Reinforcement-Guided Flow Matching for Fast Category-Level 6D Pose Estimation
【速读】:该论文旨在解决类别级6D物体位姿估计(category-level 6D object pose estimation)中基于得分的生成模型因采样成本高而导致效率低下的问题。其关键解决方案在于提出了一种名为RFM-Pose的新框架,该框架采用流匹配(flow-matching)生成模型,沿最优传输路径从简单先验分布生成位姿候选,并将采样过程建模为马尔可夫决策过程,通过近端策略优化(proximal policy optimization)对采样策略进行微调;进一步地,将流场视为可学习策略,结合估值网络实现位姿生成与假设评分的联合优化,从而在保持高性能的同时显著降低计算开销。
链接: https://arxiv.org/abs/2602.05257
作者: Diya He,Qingchen Liu,Cong Zhang,Jiahu Qin
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Object pose estimation is a fundamental problem in computer vision and plays a critical role in virtual reality and embodied intelligence, where agents must understand and interact with objects in 3D space. Recently, score based generative models have to some extent solved the rotational symmetry ambiguity problem in category level pose estimation, but their efficiency remains limited by the high sampling cost of score-based diffusion. In this work, we propose a new framework, RFM-Pose, that accelerates category-level 6D object pose generation while actively evaluating sampled hypotheses. To improve sampling efficiency, we adopt a flow-matching generative model and generate pose candidates along an optimal transport path from a simple prior to the pose distribution. To further refine these candidates, we cast the flow-matching sampling process as a Markov decision process and apply proximal policy optimization to fine-tune the sampling policy. In particular, we interpret the flow field as a learnable policy and map an estimator to a value network, enabling joint optimization of pose generation and hypothesis scoring within a reinforcement learning framework. Experiments on the REAL275 benchmark demonstrate that RFM-Pose achieves favorable performance while significantly reducing computational cost. Moreover, similar to prior work, our approach can be readily adapted to object pose tracking and attains competitive results in this setting.
zh
[CV-94] Active Label Cleaning for Reliable Detection of Electron Dense Deposits in Transmission Electron Microscopy Images
【速读】:该论文旨在解决肾小球疾病中电子致密沉积物(Electron Dense Deposits, EDD)自动检测因高质量标注数据稀缺而受限的问题,尤其针对众包标注引入标签噪声所带来的挑战。其核心解决方案是提出一种主动标签清洗方法,通过主动学习策略选择最具价值的噪声样本供专家重新标注,从而构建高精度的标签清洗模型;其中关键在于设计了一个标签选择模块(Label Selection Module),利用众包标签与模型预测之间的不一致性,实现样本选择与实例级噪声评分的联合优化,显著提升了在有限专家资源下医疗AI模型的性能与成本效益比。
链接: https://arxiv.org/abs/2602.05250
作者: Jieyun Tan,Shuo Liu,Guibin Zhang,Ziqi Li,Jian Geng,Lei Zhang,Lei Cao
机构: Southern Medical University (南方医科大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Nanfang Hospital, Southern Medical University (南方医科大学附属南方医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures
Abstract:Automated detection of electron dense deposits (EDD) in glomerular disease is hindered by the scarcity of high-quality labeled data. While crowdsourcing reduces annotation cost, it introduces label noise. We propose an active label cleaning method to efficiently denoise crowdsourced datasets. Our approach uses active learning to select the most valuable noisy samples for expert re-annotation, building high-accuracy cleaning models. A Label Selection Module leverages discrepancies between crowdsourced labels and model predictions for both sample selection and instance-level noise grading. Experiments show our method achieves 67.18% AP\textsubscript50 on a private dataset, an 18.83% improvement over training on noisy labels. This performance reaches 95.79% of that with full expert annotation while reducing annotation cost by 73.30%. The method provides a practical, cost-effective solution for developing reliable medical AI with limited expert resources.
zh
[CV-95] PatchFlow: Leverag ing a Flow-Based Model with Patch Features
【速读】:该论文旨在解决压铸(die casting)过程中表面缺陷检测效率与准确性不足的问题,传统方法难以实现自动化、高精度的质量控制。其解决方案的关键在于结合局部邻域感知的patch特征与归一化流(normalizing flow)模型,并引入适配器模块(adapter module),以桥接通用预训练特征提取器与工业产品图像之间的域差异,从而提升异常检测的效率和准确率。该方法在MVTec AD和VisA数据集上分别实现了99.28%和96.48%的图像级AUROC,较现有最优方法误差降低20%和28.2%,且在无需异常样本训练的情况下,在自建压铸数据集上达到95.77%的检测准确率,验证了其在工业场景下的有效性与泛化能力。
链接: https://arxiv.org/abs/2602.05238
作者: Boxiang Zhang,Baijian Yang,Xiaoming Wang,Corey Vian
机构: Purdue University (普渡大学); Stellantis - Kokomo Casting Plant (斯特兰蒂斯-科科莫铸件厂)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Die casting plays a crucial role across various industries due to its ability to craft intricate shapes with high precision and smooth surfaces. However, surface defects remain a major issue that impedes die casting quality control. Recently, computer vision techniques have been explored to automate and improve defect detection. In this work, we combine local neighbor-aware patch features with a normalizing flow model and bridge the gap between the generic pretrained feature extractor and industrial product images by introducing an adapter module to increase the efficiency and accuracy of automated anomaly detection. Compared to state-of-the-art methods, our approach reduces the error rate by 20% on the MVTec AD dataset, achieving an image-level AUROC of 99.28%. Our approach has also enhanced performance on the VisA dataset , achieving an image-level AUROC of 96.48%. Compared to the state-of-the-art models, this represents a 28.2% reduction in error. Additionally, experiments on a proprietary die casting dataset yield an accuracy of 95.77% for anomaly detection, without requiring any anomalous samples for training. Our method illustrates the potential of leveraging computer vision and deep learning techniques to advance inspection capabilities for the die casting industry
zh
[CV-96] Boosting SAM for Cross-Domain Few-Shot Segmentation via Conditional Point Sparsification
【速读】:该论文旨在解决跨域少样本分割(Cross-Domain Few-Shot Segmentation, CD-FSS)中基于Segment Anything Model (SAM) 的训练-free 方法性能下降的问题。现有方法依赖参考图像与目标图像之间的密集点匹配作为提示进行掩码预测,但在医学或卫星等跨域场景下,由于领域差异导致SAM学习的点-图像交互关系被破坏,密集点提示效果不佳。解决方案的关键在于提出条件性点稀疏化(Conditional Point Sparsification, CPS),该方法利用参考图像的真实掩码提供可靠引导,自适应地对匹配的密集点进行稀疏化处理,从而增强SAM在跨域场景下的提示有效性,提升分割精度。
链接: https://arxiv.org/abs/2602.05218
作者: Jiahao Nie,Yun Xing,Wenbin An,Qingsong Zhao,Jiawei Shao,Yap-Peng Tan,Alex C. Kot,Shijian Lu,Xuelong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Motivated by the success of the Segment Anything Model (SAM) in promptable segmentation, recent studies leverage SAM to develop training-free solutions for few-shot segmentation, which aims to predict object masks in the target image based on a few reference exemplars. These SAM-based methods typically rely on point matching between reference and target images and use the matched dense points as prompts for mask prediction. However, we observe that dense points perform poorly in Cross-Domain Few-Shot Segmentation (CD-FSS), where target images are from medical or satellite domains. We attribute this issue to large domain shifts that disrupt the point-image interactions learned by SAM, and find that point density plays a crucial role under such conditions. To address this challenge, we propose Conditional Point Sparsification (CPS), a training-free approach that adaptively guides SAM interactions for cross-domain images based on reference exemplars. Leveraging ground-truth masks, the reference images provide reliable guidance for adaptively sparsifying dense matched points, enabling more accurate segmentation results. Extensive experiments demonstrate that CPS outperforms existing training-free SAM-based methods across diverse CD-FSS datasets.
zh
[CV-97] Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation
【速读】:该论文针对跨域少样本分割(Cross-Domain Few-Shot Segmentation)中目标域样本稀缺且多样性不足导致的性能受限问题展开研究,尤其关注源域训练模型在目标域初始少样本能力弱、域间差异大所引发的目标样本利用效率低下的挑战。解决方案的关键在于提出多视角渐进式适应(Multi-view Progressive Adaptation, MPA),其核心包括两个维度:一是从数据角度引入混合渐进增强(Hybrid Progressive Augmentation),通过累积强增强逐步生成更复杂多样的视觉视图以构建递进式挑战性学习场景;二是从策略角度设计双链多视角预测(Dual-chain Multi-view Prediction),借助顺序与并行学习路径在充分监督下充分利用这些渐进复杂视图,并通过跨视图预测一致性约束实现鲁棒且精确的目标域适应。
链接: https://arxiv.org/abs/2602.05217
作者: Jiahao Nie,Guanqiao Fu,Wenbin An,Yap-Peng Tan,Alex C. Kot,Shijian Lu
机构: Nanyang Technological University (南洋理工大学); Xi’an Jiaotong University (西安交通大学); VinUniversity (Vin大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-Domain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model’s initially weak few-shot capability in target domains, coupled with substantial domain gaps, severely hinders the effective utilization of target samples and further impedes adaptation. To this end, we propose Multi-view Progressive Adaptation, which progressively adapts few-shot capability to target domains from both data and strategy perspectives. (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. (ii) From the strategy perspective, we design Dual-chain Multi-view Prediction, which fully leverages these progressively complex views through sequential and parallel learning paths under extensive supervision. By jointly enforcing prediction consistency across diverse and complex views, MPA achieves both robust and accurate adaptation to target domains. Extensive experiments demonstrate that MPA effectively adapts few-shot capability to target domains, outperforming state-of-the-art methods by a large margin (+7.0%).
zh
[CV-98] E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching
【速读】:该论文针对视频大语言模型(Vid-LLM)在时间视频定位(Temporal Video Grounding, TVG)任务中难以准确捕捉事件语义连续性和完整性的问题,提出了一种新的解决方案。现有方法通常通过比较帧特征与两个独立的起止帧标记来匹配事件边界,严重依赖精确的时间戳,导致对事件整体语义理解不足。其核心创新在于:(i) 引入特殊事件标记(event token),聚合查询事件所有帧的信息以保持语义连续性;(ii) 采用Savitzky-Golay平滑技术降低帧相似度序列中的噪声,提升预测精度;(iii) 设计多粒度帧特征聚合机制,增强匹配鲁棒性并补偿压缩带来的信息损失,从而实现更准确、连贯的事件定位。
链接: https://arxiv.org/abs/2602.05215
作者: Jiahao Nie,Wenbin An,Gongjie Zhang,Yicheng Xu,Yap-Peng Tan,Alex C. Kot,Shijian Lu
机构: Nanyang Technological University, Singapore(南洋理工大学); Xi’an Jiaotong University, China(西安交通大学); Alibaba DAMO Academy, Singapore(阿里巴巴达摩院); Institute of Science Tokyo, Japan(东京科学研究所); VinUniversity, Vietnam(越南VinUniversity)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event’s semantic continuity and integrity, leading to ambiguities. To address this, we propose this http URL, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. this http URL introduces three key innovations: (i) a special event token that aggregates information from all frames of a query event, preserving semantic continuity for accurate event matching; (ii) Savitzky-Golay smoothing to reduce noise in token-to-frame similarities across timestamps, improving prediction accuracy; (iii) multi-grained frame feature aggregation to enhance matching reliability and temporal understanding, compensating for compression-induced information loss. Extensive experiments on benchmark datasets show that this http URL consistently outperforms state-of-the-art Vid-LLMs by significant margins.
zh
[CV-99] Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures
【速读】:该论文旨在解决当前生成式压缩方法在超低比特率下存在的语义忠实性与感知真实感之间的权衡问题。现有方法中,基于显式表示的方案虽能保持内容结构但缺乏细节纹理,而隐式方法虽可生成视觉上合理的细节却易产生语义偏移。解决方案的关键在于提出一个无需训练的统一框架,通过将显式高阶语义条件化到扩散模型,并利用反向通道编码隐式传递精细纹理信息,同时引入一个即插即用的编码器以灵活调控失真与感知质量的平衡,从而实现语义与视觉质量的协同优化。
链接: https://arxiv.org/abs/2602.05213
作者: Chuqin Zhou,Xiaoyue Ling,Yunuo Chen,Jincheng Dai,Guo Lu,Wenjun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While recent neural codecs achieve strong performance at low bitrates when optimized for perceptual quality, their effectiveness deteriorates significantly under ultra-low bitrate conditions. To mitigate this, generative compression methods leveraging semantic priors from pretrained models have emerged as a promising paradigm. However, existing approaches are fundamentally constrained by a tradeoff between semantic faithfulness and perceptual realism. Methods based on explicit representations preserve content structure but often lack fine-grained textures, whereas implicit methods can synthesize visually plausible details at the cost of semantic drift. In this work, we propose a unified framework that bridges this gap by coherently integrating explicit and implicit representations in a training-free manner. Specifically, We condition a diffusion model on explicit high-level semantics while employing reverse-channel coding to implicitly convey fine-grained details. Moreover, we introduce a plug-in encoder that enables flexible control of the distortion-perception tradeoff by modulating the implicit information. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art rate-perception performance, outperforming existing methods and surpassing DiffC by 29.92%, 19.33%, and 20.89% in DISTS BD-Rate on the Kodak, DIV2K, and CLIC2020 datasets, respectively.
zh
[CV-100] Extreme Weather Nowcasting via Local Precipitation Pattern Prediction
【速读】:该论文旨在解决降水临近预报(precipitation nowcasting)中面临的三大核心挑战:一是极端天气事件(如强降雨或风暴)的准确预测对风险管理至关重要,但现有模型难以捕捉其复杂的精细空间结构;二是当前主流扩散模型虽性能优异却计算成本高,难以满足实时应用需求,而确定性模型虽高效却存在对常规降水偏倚的问题;三是现有基准数据集普遍存在分布不均衡问题,要么以普通降水为主,要么仅覆盖极端事件,限制了模型在真实场景中的泛化能力。解决方案的关键在于提出一种高效的确定性框架exPreCast,其创新点包括:引入局部时空注意力机制以增强对细尺度结构的建模能力、设计纹理保持型三次双上采样解码器以提升图像细节还原度,并结合时间提取器实现可变预报时长的灵活调整;同时构建了一个来自韩国气象局(KMA)的平衡雷达数据集,涵盖普通与极端降水事件,从而显著提升了模型在不同降雨类型下的鲁棒性和实用性。
链接: https://arxiv.org/abs/2602.05204
作者: Changhoon Song,Teng Yuan Chang,Youngjoon Hong
机构: Seoul National University (首尔国立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10pages, 20 figures, The Fourteenth International Conference on Learning Representations, see this https URL
Abstract:Accurate forecasting of extreme weather events such as heavy rainfall or storms is critical for risk management and disaster mitigation. Although high-resolution radar observations have spurred extensive research on nowcasting models, precipitation nowcasting remains particularly challenging due to pronounced spatial locality, intricate fine-scale rainfall structures, and variability in forecasting horizons. While recent diffusion-based generative ensembles show promising results, they are computationally expensive and unsuitable for real-time applications. In contrast, deterministic models are computationally efficient but remain biased toward normal rainfall. Furthermore, the benchmark datasets commonly used in prior studies are themselves skewed–either dominated by ordinary rainfall events or restricted to extreme rainfall episodes–thereby hindering general applicability in real-world settings. In this paper, we propose exPreCast, an efficient deterministic framework for generating finely detailed radar forecasts, and introduce a newly constructed balanced radar dataset from the Korea Meteorological Administration (KMA), which encompasses both ordinary precipitation and extreme events. Our model integrates local spatiotemporal attention, a texture-preserving cubic dual upsampling decoder, and a temporal extractor to flexibly adjust forecasting horizons. Experiments on established benchmarks (SEVIR and MeteoNet) as well as on the balanced KMA dataset demonstrate that our approach achieves state-of-the-art performance, delivering accurate and reliable nowcasts across both normal and extreme rainfall regimes.
zh
[CV-101] GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
【速读】:该论文旨在解决视频生成模型与人类偏好对齐的难题,现有方法依赖视觉-语言模型(Vision-Language Models, VLMs)进行奖励建模,但VLMs难以捕捉视频中的细微时序动态。其解决方案的关键在于:将具备时序建模能力的视频生成模型重新设计为能量基模型(Energy-Based Models, EBMs),通过对比学习训练使其能精准区分视频质量——高质量视频对应低能量,劣质视频对应高能量。进一步地,为避免模型仅利用真实与生成视频间的表面差异,作者引入受控潜空间扰动策略(如时间切片、特征交换和帧混洗)构造合成负样本,迫使模型学习深层次的时空特征而非表层伪影。此方法在仅需3万条人工标注的情况下即达到GenAI-Bench和MonteBench上的最先进性能,显著低于现有基于VLM的方法所需的数据量。
链接: https://arxiv.org/abs/2602.05202
作者: Shivanshu Shekhar,Uttaran Bhattacharya,Raghavendra Addanki,Mehrab Tanjim,Somdeb Sarkhel,Tong Zhang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: 6\times to 65\times fewer than existing VLM-based approaches.
zh
[CV-102] PoseGaussian: Pose-Driven Novel View Synthesis for Robust 3D Human Reconstruction
【速读】:该论文旨在解决动态人体场景中基于神经渲染的 novel view synthesis(新视角合成)问题,特别是由关节运动和严重自遮挡带来的几何失真与时间不一致性挑战。解决方案的关键在于提出 PoseGaussian 框架,其创新性地将人体姿态(pose)信号嵌入到几何估计和时间一致性两个阶段:一方面,姿态作为结构先验与颜色编码器融合以优化深度估计;另一方面,姿态作为时序线索通过专用的姿态编码器提升帧间一致性。整个流程为端到端可微分设计,显著提升了重建质量与泛化能力,同时保持了标准 Gaussian Splatting 的实时渲染效率(100 FPS)。
链接: https://arxiv.org/abs/2602.05190
作者: Ju Shen,Chen Chen,Tam V. Nguyen,Vijayan K. Asari
机构: University of Dayton (戴顿大学); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:We propose PoseGaussian, a pose-guided Gaussian Splatting framework for high-fidelity human novel view synthesis. Human body pose serves a dual purpose in our design: as a structural prior, it is fused with a color encoder to refine depth estimation; as a temporal cue, it is processed by a dedicated pose encoder to enhance temporal consistency across frames. These components are integrated into a fully differentiable, end-to-end trainable pipeline. Unlike prior works that use pose only as a condition or for warping, PoseGaussian embeds pose signals into both geometric and temporal stages to improve robustness and generalization. It is specifically designed to address challenges inherent in dynamic human scenes, such as articulated motion and severe self-occlusion. Notably, our framework achieves real-time rendering at 100 FPS, maintaining the efficiency of standard Gaussian Splatting pipelines. We validate our approach on ZJU-MoCap, THuman2.0, and in-house datasets, demonstrating state-of-the-art performance in perceptual quality and structural accuracy (PSNR 30.86, SSIM 0.979, LPIPS 0.028).
zh
[CV-103] ShapePuri: Shape Guided and Appearance Generalized Adversarial Purification
【速读】:该论文旨在解决深度神经网络在视觉识别任务中对不可察觉的对抗攻击仍存在脆弱性的问题,现有防御策略如对抗训练和净化方法虽有一定进展,但基于扩散的净化常伴随高计算开销与信息损失。其解决方案的关键在于提出一种名为Shape Guided Purification (ShapePuri) 的新型防御框架,通过将模型表示与稳定的结构不变性对齐来增强鲁棒性;该框架包含两个核心组件:一是利用有符号距离函数(Signed Distance Function, SDF)提供密集几何引导的形状编码模块(Shape Encoding Module, SEM),二是通过随机变换缓解外观偏差的全局外观去偏模块(Global Appearance Debiasing, GAD)。此方法在AutoAttack基准上实现了84.06%的干净准确率和81.64%的鲁棒准确率,首次突破80%阈值,且无需额外模块或计算成本即可保持推理时的预测稳定性。
链接: https://arxiv.org/abs/2602.05175
作者: Zhe Li,Bernhard Kainz
机构: FAU Erlangen-Nürnberg (弗莱堡大学埃尔朗根-纽伦堡分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures
Abstract:Deep neural networks demonstrate impressive performance in visual recognition, but they remain vulnerable to adversarial attacks that is imperceptible to the human. Although existing defense strategies such as adversarial training and purification have achieved progress, diffusion-based purification often involves high computational costs and information loss. To address these challenges, we introduce Shape Guided Purification (ShapePuri), a novel defense framework enhances robustness by aligning model representations with stable structural invariants. ShapePuri integrates two components: a Shape Encoding Module (SEM) that provides dense geometric guidance through Signed Distance Functions (SDF), and a Global Appearance Debiasing (GAD) module that mitigates appearance bias via stochastic transformations. In our experiments, ShapePuri achieves 84.06% clean accuracy and 81.64% robust accuracy under the AutoAttack protocol, representing the first defense framework to surpass the 80% threshold on this benchmark. Our approach provides a scalable and efficient adversarial defense that preserves prediction stability during inference without requiring auxiliary modules or additional computational cost.
zh
[CV-104] LOBSTgER-enhance: an underwater image enhancement pipeline
【速读】:该论文旨在解决水下摄影中固有的图像退化问题,包括对比度降低、空间模糊以及波长依赖的颜色失真,这些问题会削弱海洋生物的视觉表现力,并导致摄影师需进行繁重的后期处理。其解决方案的关键在于构建一个基于扩散模型(diffusion-based generation)的图像到图像转换管道,通过引入合成退化流程(synthetic corruption pipeline)来学习逆转这些水下退化效应;训练与评估在由Keith Ellenbogen提供的高质量意识摄影图像小规模数据集上完成,最终实现对512×768图像的高感知一致性重建和强泛化能力,模型参数量约为11M,在约2.5k张图像上从头训练即可达成良好效果。
链接: https://arxiv.org/abs/2602.05163
作者: Andreas Mentzelopoulos,Keith Ellenbogen
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 30 figures, work done as part of LOBSTgER
Abstract:Underwater photography presents significant inherent challenges including reduced contrast, spatial blur, and wavelength-dependent color distortions. These effects can obscure the vibrancy of marine life and awareness photographers in particular are often challenged with heavy post-processing pipelines to correct for these distortions. We develop an image-to-image pipeline that learns to reverse underwater degradations by introducing a synthetic corruption pipeline and learning to reverse its effects with diffusion-based generation. Training and evaluation are performed on a small high-quality dataset of awareness photography images by Keith Ellenbogen. The proposed methodology achieves high perceptual consistency and strong generalization in synthesizing 512x768 images using a model of ~11M parameters after training from scratch on ~2.5k images. Comments: 12 pages, 30 figures, work done as part of LOBSTgER Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.05163 [cs.CV] (or arXiv:2602.05163v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2602.05163 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-105] SHaSaM: Submodular Hard Sample Mining for Fair Facial Attribute Recognition
【速读】:该论文旨在解决深度神经网络在训练过程中继承标注数据中的社会和人口学偏见(如种族、年龄、性别等敏感属性),从而导致不公平预测的问题。现有方法因不同属性群体间的数据不平衡而倾向于强化敏感属性的影响,进一步加剧不公平性和性能下降。解决方案的关键在于提出一种新颖的组合优化方法——SHaSaM(Submodular Hard Sample Mining),其核心是将公平驱动的表示学习建模为子模硬样本挖掘问题:第一阶段SHaSaM-MINE通过子模子集选择策略挖掘难样本(正负例),有效缓解数据不平衡;第二阶段SHaSaM-LEARN引入基于子模条件互信息的组合损失函数,在最大化目标类别决策边界的同时最小化敏感属性的影响,从而限制模型学习与敏感属性相关的特征,显著提升公平性而不牺牲性能。
链接: https://arxiv.org/abs/2602.05162
作者: Anay Majee,Rishabh Iyer
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 7 tables, 10 figures
Abstract:Deep neural networks often inherit social and demographic biases from annotated data during model training, leading to unfair predictions, especially in the presence of sensitive attributes like race, age, gender etc. Existing methods fall prey to the inherent data imbalance between attribute groups and inadvertently emphasize on sensitive attributes, worsening unfairness and performance. To surmount these challenges, we propose SHaSaM (Submodular Hard Sample Mining), a novel combinatorial approach that models fairness-driven representation learning as a submodular hard-sample mining problem. Our two-stage approach comprises of SHaSaM-MINE, which introduces a submodular subset selection strategy to mine hard positives and negatives - effectively mitigating data imbalance, and SHaSaM-LEARN, which introduces a family of combinatorial loss functions based on Submodular Conditional Mutual Information to maximize the decision boundary between target classes while minimizing the influence of sensitive attributes. This unified formulation restricts the model from learning features tied to sensitive attributes, significantly enhancing fairness without sacrificing performance. Experiments on CelebA and UTKFace demonstrate that SHaSaM achieves state-of-the-art results, with up to 2.7 points improvement in model fairness (Equalized Odds) and a 3.5% gain in Accuracy, within fewer epochs as compared to existing methods.
zh
[CV-106] AirGlove: Exploring Egocentric 3D Hand Tracking and Appearance Generalization for Sensing Gloves ICASSP2026
【速读】:该论文旨在解决现有视觉手势追踪模型在佩戴传感手套(sensing gloves)时性能显著下降的问题,其核心挑战在于裸手模型与手套之间存在较大的外观差异(appearance gap),导致零样本迁移和微调场景下的泛化能力不足。解决方案的关键在于提出AirGlove方法,该方法利用已有手套数据学习可迁移的 glove 表示,并通过有限的新手套数据实现对未见手套设计的有效泛化,从而显著提升手部姿态估计的准确性。
链接: https://arxiv.org/abs/2602.05159
作者: Wenhui Cui,Ziyi Kou,Chuan Qin,Ergys Ristani,Li Guan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICASSP 2026
Abstract:Sensing gloves have become important tools for teleoperation and robotic policy learning as they are able to provide rich signals like speed, acceleration and tactile feedback. A common approach to track gloved hands is to directly use the sensor signals (e.g., angular velocity, gravity orientation) to estimate 3D hand poses. However, sensor-based tracking can be restrictive in practice as the accuracy is often impacted by sensor signal and calibration quality. Recent advances in vision-based approaches have achieved strong performance on human hands via large-scale pre-training, but their performance on gloved hands with distinct visual appearances remains underexplored. In this work, we present the first systematic evaluation of vision-based hand tracking models on gloved hands under both zero-shot and fine-tuning setups. Our analysis shows that existing bare-hand models suffer from substantial performance degradation on sensing gloves due to large appearance gap between bare-hand and glove designs. We therefore propose AirGlove, which leverages existing gloves to generalize the learned glove representations towards new gloves with limited data. Experiments with multiple sensing gloves show that AirGlove effectively generalizes the hand pose models to new glove designs and achieves a significant performance boost over the compared schemes.
zh
[CV-107] ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation
【速读】:该论文旨在解决在线第一人称视角下 gaze 估计(egocentric gaze estimation)的问题,即仅基于当前和历史帧预测摄像头佩戴者注视位置,这对增强现实(AR)和辅助技术至关重要。由于缺乏显式的头部或眼睛信号,模型需从稀疏且间接的线索(如手物交互和场景显著内容)中推断视觉注意力。解决方案的关键在于利用 gaze 的强时间连续性特性,提出 ARGaze 方法——将 gaze 估计建模为序列预测任务:通过 Transformer 解码器在每个时间步上,结合当前视觉特征与固定长度的最近 gaze 目标估计窗口(Gaze Context Window),实现因果约束下的自回归推理。这一设计不仅保证了推理过程的时序一致性,还支持资源受限环境下的流式处理,并在多个第一人称基准上实现了最优性能,验证了基于有限历史 gaze 信息的自回归建模对鲁棒预测的核心作用。
链接: https://arxiv.org/abs/2602.05132
作者: Jia Li,Wenjie Zhao,Shijian Deng,Bolin Lai,Yuheng Wu,RUijia Chen,Jon E. Froehlich,Yuhang Zhao,Yapeng Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Online egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames, a task essential for augmented reality and assistive technologies. Unlike third-person gaze estimation, this setting lacks explicit head or eye signals, requiring models to infer current visual attention from sparse, indirect cues such as hand-object interactions and salient scene content. We observe that gaze exhibits strong temporal continuity during goal-directed activities: knowing where a person looked recently provides a powerful prior for predicting where they look next. Inspired by vision-conditioned autoregressive decoding in vision-language models, we propose ARGaze, which reformulates gaze estimation as sequential prediction: at each timestep, a transformer decoder predicts current gaze by conditioning on (i) current visual features and (ii) a fixed-length Gaze Context Window of recent gaze target estimates. This design enforces causality and enables bounded-resource streaming inference. We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation, with extensive ablations validating that autoregressive modeling with bounded gaze history is critical for robust prediction. We will release our source code and pre-trained models.
zh
[CV-108] CLEAR-HPV: Interpretable Concept Discovery for HPV-Associated Morphology in Whole-Slide Histology
【速读】:该论文旨在解决基于注意力机制的多实例学习(Attention-based Multiple Instance Learning, MIL)在全切片病理图像(whole-slide histopathology)中对人乳头瘤病毒(HPV)状态预测时缺乏形态学可解释性的问题。其解决方案的关键在于提出CLEAR-HPV框架,通过重构MIL潜在空间中的注意力权重,实现无需概念标签即可自动发现与病理特征相关的语义概念(如角化、基底样和间质结构),并生成空间概念图谱,最终以仅10维的概念占比向量替代原始高维嵌入(如1536维),从而在保持预测性能的同时显著提升模型的可解释性,并具备跨数据集(TCGA-HNSCC、TCGA-CESC、CPTAC-HNSCC)和骨干网络无关的泛化能力。
链接: https://arxiv.org/abs/2602.05126
作者: Weiyi Qin,Yingci Liu-Swetz,Shiwei Tan,Hao Wang
机构: Rutgers University (罗格斯大学); Rutgers Health (罗格斯健康)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human papillomavirus (HPV) status is a critical determinant of prognosis and treatment response in head and neck and cervical cancers. Although attention-based multiple instance learning (MIL) achieves strong slide-level prediction for HPV-related whole-slide histopathology, it provides limited morphologic interpretability. To address this limitation, we introduce Concept-Level Explainable Attention-guided Representation for HPV (CLEAR-HPV), a framework that restructures the MIL latent space using attention to enable concept discovery without requiring concept labels during training. Operating in an attention-weighted latent space, CLEAR-HPV automatically discovers keratinizing, basaloid, and stromal morphologic concepts, generates spatial concept maps, and represents each slide using a compact concept-fraction vector. CLEAR-HPV’s concept-fraction vectors preserve the predictive information of the original MIL embeddings while reducing the high-dimensional feature space (e.g., 1536 dimensions) to only 10 interpretable concepts. CLEAR-HPV generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC, providing compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology.
zh
[CV-109] Rule-Based Spatial Mixture-of-Experts U-Net for Explainable Edge Detection
【速读】:该论文旨在解决深度学习模型(如U-Net)在边缘检测任务中虽性能优异但决策过程缺乏透明性的问题,即其“黑箱”特性难以满足安全关键场景下的可解释性需求。解决方案的关键在于提出一种规则驱动的空间混合专家U-Net(sMoE U-Net),其核心创新包括:(1) 在解码器跳跃连接中引入空间自适应混合专家(Spatially-Adaptive Mixture-of-Experts, sMoE)模块,根据局部特征统计动态切换“上下文”(平滑)和“边界”(锐利)专家;(2) 用Takagi-Sugeno-Kang(TSK)模糊头替代传统分类层,通过显式IF-THEN规则融合深层语义特征与启发式边缘信号,从而实现像素级可解释性,支持生成“规则触发图”和“策略图”以可视化边缘判定依据。
链接: https://arxiv.org/abs/2602.05100
作者: Bharadwaj Dogga,Kaaustaaub Shankar,Gibin Raju,Wilhelm Louw,Kelly Cohen
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Symbolic Computation (cs.SC)
备注:
Abstract:Deep learning models like U-Net and its variants, have established state-of-the-art performance in edge detection tasks and are used by Generative AI services world-wide for their image generation models. However, their decision-making processes remain opaque, operating as “black boxes” that obscure the rationale behind specific boundary predictions. This lack of transparency is a critical barrier in safety-critical applications where verification is mandatory. To bridge the gap between high-performance deep learning and interpretable logic, we propose the Rule-Based Spatial Mixture-of-Experts U-Net (sMoE U-Net). Our architecture introduces two key innovations: (1) Spatially-Adaptive Mixture-of-Experts (sMoE) blocks integrated into the decoder skip connections, which dynamically gate between “Context” (smooth) and “Boundary” (sharp) experts based on local feature statistics; and (2) a Takagi-Sugeno-Kang (TSK) Fuzzy Head that replaces the standard classification layer. This fuzzy head fuses deep semantic features with heuristic edge signals using explicit IF-THEN rules. We evaluate our method on the BSDS500 benchmark, achieving an Optimal Dataset Scale (ODS) F-score of 0.7628, effectively matching purely deep baselines like HED (0.7688) while outperforming the standard U-Net (0.7437). Crucially, our model provides pixel-level explainability through “Rule Firing Maps” and “Strategy Maps,” allowing users to visualize whether an edge was detected due to strong gradients, high semantic confidence, or specific logical rule combinations.
zh
[CV-110] Visual concept ranking uncovers medical shortcuts used by large multimodal models
【速读】:该论文旨在解决安全关键领域(如医疗)中机器学习模型可靠性不足的问题,特别是大型多模态模型(LMMs)在处理医学任务时可能存在的性能偏差和不可解释性。其核心挑战在于识别模型决策所依赖的关键视觉概念,并揭示其在不同人群中的潜在不公平表现。解决方案的关键在于提出一种名为“视觉概念排序”(Visual Concept Ranking, VCR)的方法,该方法能够系统地识别模型对特定视觉特征的依赖关系,并通过人工干预验证这些假设,从而帮助发现并理解模型在临床皮肤病变分类等任务中的行为缺陷与潜在偏倚。
链接: https://arxiv.org/abs/2602.05096
作者: Joseph D. Janizek,Sonnet Xu,Junayd Lateef,Roxana Daneshjou
机构: Stanford University (斯坦福大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Ensuring the reliability of machine learning models in safety-critical domains such as healthcare requires auditing methods that can uncover model shortcomings. We introduce a method for identifying important visual concepts within large multimodal models (LMMs) and use it to investigate the behaviors these models exhibit when prompted with medical tasks. We primarily focus on the task of classifying malignant skin lesions from clinical dermatology images, with supplemental experiments including both chest radiographs and natural images. After showing how LMMs display unexpected gaps in performance between different demographic subgroups when prompted with demonstrating examples, we apply our method, Visual Concept Ranking (VCR), to these models and prompts. VCR generates hypotheses related to different visual feature dependencies, which we are then able to validate with manual interventions.
zh
[CV-111] Gabor Fields: Orientation-Selective Level-of-Detail for Volume Rendering
【速读】:该论文旨在解决基于高斯(Gaussian-based)表示的体素渲染在构建层次化表示时面临的挑战,尤其是如何实现高效、无混叠的多细节层级(Level of Detail, LOD)管理,同时避免传统方法中因预滤波或逐层重新拟合导致的内存开销增加和层级间过渡不连续的问题。其解决方案的关键在于提出Gabor Fields——一种具有方向选择性的Gabor核混合模型,能够通过选择性地裁剪原始高斯基元来实现无需额外计算成本的连续频率滤波;该机制不仅降低了频域内容以提升渲染效率,还通过在射线递归过程中随机采样不同频率与方向,有效遮蔽体积中的冗余区域,从而加速单次及多次散射场景下的光线遍历时间。
链接: https://arxiv.org/abs/2602.05081
作者: Jorge Condor,Nicolai Hermann,Mehmet Ata Yurtsever,Piotr Didyk
机构: Università della Svizzera Italiana (瑞士意大利语大学); USI Lugano(瑞士意大利语大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, incl Appendix and References
Abstract:Gaussian-based representations have enabled efficient physically-based volume rendering at a fraction of the memory cost of regular, discrete, voxel-based distributions. However, several remaining issues hamper their widespread use. One of the advantages of classic voxel grids is the ease of constructing hierarchical representations by either storing volumetric mipmaps or selectively pruning branches of an already hierarchical voxel grid. Such strategies reduce rendering time and eliminate aliasing when lower levels of detail are required. Constructing similar strategies for Gaussian-based volumes is not trivial. Straightforward solutions, such as prefiltering or computing mipmap-style representations, lead to increased memory requirements or expensive re-fitting of each level separately. Additionally, such solutions do not guarantee a smooth transition between different hierarchy levels. To address these limitations, we propose Gabor Fields, an orientation-selective mixture of Gabor kernels that enables continuous frequency filtering at no cost. The frequency content of the asset is reduced by selectively pruning primitives, directly benefiting rendering performance. Beyond filtering, we demonstrate that stochastically sampling from different frequencies and orientations at each ray recursion enables masking substantial portions of the volume, accelerating ray traversal time in single- and multiple-scattering settings. Furthermore, inspired by procedural volumes, we present an application for efficient design and rendering of procedural clouds as Gabor-noise-modulated Gaussians.
zh
[CV-112] Food Portion Estimation: From Pixels to Calories
【速读】:该论文旨在解决基于图像的膳食评估中,如何从二维(2D)图像输入准确估计食物三维尺寸这一关键难题。其解决方案的关键在于利用深度学习技术,结合单目图像或图像与辅助输入(如深度图、多视角图像或模型模板匹配等)来精确预测食物的份量,从而提升膳食评估的准确性与实用性。
链接: https://arxiv.org/abs/2602.05078
作者: Gautham Vinod,Fengqing Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注:
Abstract:Reliance on images for dietary assessment is an important strategy to accurately and conveniently monitor an individual’s health, making it a vital mechanism in the prevention and care of chronic diseases and obesity. However, image-based dietary assessment suffers from estimating the three dimensional size of food from 2D image inputs. Many strategies have been devised to overcome this critical limitation such as the use of auxiliary inputs like depth maps, multi-view inputs, or model-based approaches such as template matching. Deep learning also helps bridge the gap by either using monocular images or combinations of the image and the auxillary inputs to precisely predict the output portion from the image input. In this paper, we explore the different strategies employed for accurate portion estimation.
zh
[CV-113] VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在扩展预训练视觉-语言模型(Vision-Language Models, VLMs)至动作空间时出现的视觉-动作错位(vision-action misalignment)问题,即动作预测对当前视觉状态依赖性弱,导致动作输出不可靠。解决方案的关键在于通过显式增强视觉条件化(visual conditioning)来提升VLA模型的动作预测准确性:首先在跟踪任务上利用偏好优化(preference optimization)对齐动作预测与视觉输入,随后在监督微调阶段通过潜在空间蒸馏(latent-space distillation)将这种增强的对齐能力迁移至指令跟随任务,从而在不引入架构修改或额外数据采集的情况下,显著改善离散和连续两种设置下的视觉条件化强度与任务性能。
链接: https://arxiv.org/abs/2602.05049
作者: Yiye Chen,Yanan Jian,Xiaoyi Dong,Shuxin Cao,Jing Wu,Patricio Vela,Benjamin E. Lundell,Dongdong Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: In submission. Project website: this https URL
Abstract:Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks. Despite the success, extending large pretrained Vision-Language Models (VLMs) to the action space can induce vision-action misalignment, where action predictions exhibit weak dependence on the current visual state, leading to unreliable action outputs. In this work, we study VLA models through the lens of visual conditioning and empirically show that successful rollouts consistently exhibit stronger visual dependence than failed ones. Motivated by this observation, we propose a training framework that explicitly strengthens visual conditioning in VLA models. Our approach first aligns action prediction with visual input via preference optimization on a track-following surrogate task, and then transfers the enhanced alignment to instruction-following task through latent-space distillation during supervised finetuning. Without introducing architectural modifications or additional data collection, our method improves both visual conditioning and task performance for discrete OpenVLA, and further yields consistent gains when extended to the continuous OpenVLA-OFT setting. Project website: this https URL .
zh
[CV-114] UniTrack: Differentiable Graph Representation Learning for Multi-Object Tracking
【速读】:该论文旨在解决多目标跟踪(Multi-Object Tracking, MOT)中因检测精度、身份保持和时空一致性难以协同优化而导致的性能瓶颈问题。现有方法通常需重构跟踪架构以实现特定目标,缺乏通用性和可集成性。其解决方案的关键在于提出一种即插即用的图论损失函数 UniTrack,通过可微分的图表示学习,将检测准确性、身份保真度与时空连续性统一建模为一个端到端可训练的损失函数,无需修改现有MOT模型结构即可显著提升跟踪性能,实验证明其在多个主流模型和基准上均实现了稳定且显著的性能增益。
链接: https://arxiv.org/abs/2602.05037
作者: Bishoy Galoaa,Xiangyu Bai,Utsav Nandi,Sai Siddhartha Vivek Dhir Rangoju,Somaieh Amraee,Sarah Ostadabbas
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present UniTrack, a plug-and-play graph-theoretic loss function designed to significantly enhance multi-object tracking (MOT) performance by directly optimizing tracking-specific objectives through unified differentiable learning. Unlike prior graph-based MOT methods that redesign tracking architectures, UniTrack provides a universal training objective that integrates detection accuracy, identity preservation, and spatiotemporal consistency into a single end-to-end trainable loss function, enabling seamless integration with existing MOT systems without architectural modifications. Through differentiable graph representation learning, UniTrack enables networks to learn holistic representations of motion continuity and identity relationships across frames. We validate UniTrack across diverse tracking models and multiple challenging benchmarks, demonstrating consistent improvements across all tested architectures and datasets including Trackformer, MOTR, FairMOT, ByteTrack, GTR, and MOTE. Extensive evaluations show up to 53% reduction in identity switches and 12% IDF1 improvements across challenging benchmarks, with GTR achieving peak performance gains of 9.7% MOTA on SportsMOT.
zh
[CV-115] Differentiable Inverse Graphics for Zero-shot Scene Reconstruction and Robot Grasping
【速读】:该论文旨在解决机器人在全新现实环境中对未见过物体进行有效估计与交互的问题,尤其针对现有方法依赖大量训练数据和测试时样本构建黑箱场景表示的局限性。解决方案的关键在于提出一种可微分神经图形模型,该模型融合了神经基础模型与基于物理的可微渲染技术,仅需单张RGBD图像和边界框即可通过求解一系列约束优化问题,估计出物理一致的场景参数(如网格、光照条件、材质属性及6D位姿),从而实现零样本场景重建与抓取,无需额外3D数据或测试时采样,显著提升了机器人自主性的数据效率、可解释性和泛化能力。
链接: https://arxiv.org/abs/2602.05029
作者: Octavio Arriaga,Proneet Sharma,Jichen Guo,Marc Otto,Siddhant Kadwe,Rebecca Adam
机构: DFKI GmbH(德国弗劳恩霍夫协会机器人创新中心)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to IEEE Robotics and Automation Letters (RA-L) for review. This version includes the statement required by IEEE for preprints
Abstract:Operating effectively in novel real-world environments requires robotic systems to estimate and interact with previously unseen objects. Current state-of-the-art models address this challenge by using large amounts of training data and test-time samples to build black-box scene representations. In this work, we introduce a differentiable neuro-graphics model that combines neural foundation models with physics-based differentiable rendering to perform zero-shot scene reconstruction and robot grasping without relying on any additional 3D data or test-time samples. Our model solves a series of constrained optimization problems to estimate physically consistent scene parameters, such as meshes, lighting conditions, material properties, and 6D poses of previously unseen objects from a single RGBD image and bounding boxes. We evaluated our approach on standard model-free few-shot benchmarks and demonstrated that it outperforms existing algorithms for model-free few-shot pose estimation. Furthermore, we validated the accuracy of our scene reconstructions by applying our algorithm to a zero-shot grasping task. By enabling zero-shot, physically-consistent scene reconstruction and grasping without reliance on extensive datasets or test-time sampling, our approach offers a pathway towards more data efficient, interpretable and generalizable robot autonomy in novel environments.
zh
[CV-116] Untwisting RoPE: Frequency Control for Shared Attention in DiTs
【速读】:该论文旨在解决在多模态和注意力共享(attention-sharing)场景下,基于旋转位置编码(Rotary Positional Embeddings, RoPE)的Transformer模型容易产生参考图像内容复制的问题。具体而言,当目标图像生成过程中同时关注参考图像的token时,模型倾向于复制参考图像的内容而非仅提取其风格特征,这源于RoPE中高频成分主导注意力计算,导致查询向量主要关注空间对齐的参考token,从而引发非预期的复制行为。解决方案的关键在于对RoPE频率带进行选择性调制,使注意力机制更依赖语义相似性而非严格的时空对齐,从而在共享注意力架构中恢复稳定且有意义的跨图像信息交互,实现风格迁移与内容复制之间的可控平衡,最终达成风格对齐但不复制参考内容的生成效果。
链接: https://arxiv.org/abs/2602.05013
作者: Aryan Mikaeili,Or Patashnik,Andrea Tagliasacchi,Daniel Cohen-Or,Ali Mahdavi-Amiri
机构: Simon Fraser University (西蒙菲莎大学); Tel Aviv University (特拉维夫大学); University of Toronto (多伦多大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Positional encodings are essential to transformer-based generative models, yet their behavior in multimodal and attention-sharing settings is not fully understood. In this work, we present a principled analysis of Rotary Positional Embeddings (RoPE), showing that RoPE naturally decomposes into frequency components with distinct positional sensitivities. We demonstrate that this frequency structure explains why shared-attention mechanisms, where a target image is generated while attending to tokens from a reference image, can lead to reference copying, in which the model reproduces content from the reference instead of extracting only its stylistic cues. Our analysis reveals that the high-frequency components of RoPE dominate the attention computation, forcing queries to attend mainly to spatially aligned reference tokens and thereby inducing this unintended copying behavior. Building on these insights, we introduce a method for selectively modulating RoPE frequency bands so that attention reflects semantic similarity rather than strict positional alignment. Applied to modern transformer-based diffusion architectures, where all tokens share attention, this modulation restores stable and meaningful shared attention. As a result, it enables effective control over the degree of style transfer versus content copying, yielding a proper style-aligned generation process in which stylistic attributes are transferred without duplicating reference content.
zh
[CV-117] SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy
【速读】:该论文旨在解决面部识别技术在在线银行、身份验证等网络服务中广泛应用时,如何有效实现身份信息与视觉表征的解耦问题,以提升隐私保护水平。其核心解决方案是提出SIDeR框架,通过将人脸图像分解为机器可识别的身份特征向量(identity feature vector)和视觉感知的语义外观成分(semantic appearance component),并在扩散模型的潜在空间中利用语义引导重构生成视觉匿名的对抗性人脸图像,同时保持机器层面的身份一致性;关键创新在于引入动量驱动的无限制扰动优化和语义-视觉平衡因子,从而合成多样的高自然度对抗样本,并支持授权用户凭密码恢复原始图像。
链接: https://arxiv.org/abs/2602.04994
作者: Zhuosen Bao,Xia Du,Zheng Lin,Jizhe Zhou,Zihan Fang,Jiening Wu,Yuxin Zhang,Zhe Chen,Chi-man Pun,Wei Ni,Jun Luo
机构: Xiamen University of Technology (厦门理工学院); University of Hong Kong (香港大学); Sichuan University (四川大学); City University of Hong Kong (香港城市大学); Southwest University (西南大学); Fudan University (复旦大学); University of Macau (澳门大学); CSIRO (澳大利亚联邦科学与工业研究组织); University of New South Wales (新南威尔士大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 8 figures
Abstract:With the deep integration of facial recognition into online banking, identity verification, and other networked services, achieving effective decoupling of identity information from visual representations during image storage and transmission has become a critical challenge for privacy protection. To address this issue, we propose SIDeR, a Semantic decoupling-driven framework for unrestricted face privacy protection. SIDeR decomposes a facial image into a machine-recognizable identity feature vector and a visually perceptible semantic appearance component. By leveraging semantic-guided recomposition in the latent space of a diffusion model, it generates visually anonymous adversarial faces while maintaining machine-level identity consistency. The framework incorporates momentum-driven unrestricted perturbation optimization and a semantic-visual balancing factor to synthesize multiple visually diverse, highly natural adversarial samples. Furthermore, for authorized access, the protected image can be restored to its original form when the correct password is provided. Extensive experiments on the CelebA-HQ and FFHQ datasets demonstrate that SIDeR achieves a 99% attack success rate in black-box scenarios and outperforms baseline methods by 41.28% in PSNR-based restoration quality.
zh
[CV-118] mporal Pair Consistency for Variance-Reduced Flow Matching
【速读】:该论文旨在解决连续时间生成模型(如扩散模型、流匹配和修正流)在训练过程中因目标函数独立处理时间步而导致的估计器方差高、采样效率低的问题。现有方法通过显式平滑惩罚、轨迹正则化或修改概率路径与求解器来缓解此问题,但往往复杂且需改动模型结构或训练流程。本文提出时间对一致性(Temporal Pair Consistency, TPC),其核心在于在估计器层面引入一种轻量级的方差减少机制,通过耦合同一概率路径上成对时间步的速度预测,无需修改模型架构、概率路径或求解器即可实现高效优化。理论分析表明,TPC诱导出一种二次型轨迹耦合正则化项,可严格降低梯度方差并保持原始流匹配目标不变;实验证明,将其应用于流匹配时,在CIFAR-10和ImageNet多分辨率下均显著提升样本质量和采样效率,达到更低FID值或更低计算成本,且可无缝集成至当前SOTA生成流水线(如噪声增强训练、基于得分的去噪和修正流)。
链接: https://arxiv.org/abs/2602.04908
作者: Chika Maduabuchi,Jindong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continuous-time generative models, such as diffusion models, flow matching, and rectified flow, learn time-dependent vector fields but are typically trained with objectives that treat timesteps independently, leading to high estimator variance and inefficient sampling. Prior approaches mitigate this via explicit smoothness penalties, trajectory regularization, or modified probability paths and solvers. We introduce Temporal Pair Consistency (TPC), a lightweight variance-reduction principle that couples velocity predictions at paired timesteps along the same probability path, operating entirely at the estimator level without modifying the model architecture, probability path, or solver. We provide a theoretical analysis showing that TPC induces a quadratic, trajectory-coupled regularization that provably reduces gradient variance while preserving the underlying flow-matching objective. Instantiated within flow matching, TPC improves sample quality and efficiency across CIFAR-10 and ImageNet at multiple resolutions, achieving lower FID at identical or lower computational cost than prior methods, and extends seamlessly to modern SOTA-style pipelines with noise-augmented training, score-based denoising, and rectified flow.
zh
[CV-119] Disc-Centric Contrastive Learning for Lumbar Spine Severity Grading
【速读】:该论文旨在解决腰椎管狭窄症(lumbar spinal stenosis)在矢状面T2加权磁共振成像(sagittal T2-weighted MRI)中自动化严重程度分级的难题,尤其关注如何提升模型对关键解剖区域的判别能力并缓解类别不平衡问题。其解决方案的关键在于采用以椎间盘为中心(disc-centric)的框架,结合对比预训练(contrastive pretraining)与椎间盘级别的微调(disc-level fine-tuning),通过每个椎间盘局部感兴趣区域提取特征,并引入辅助回归任务进行椎间盘定位,同时使用加权焦点损失(weighted focal loss)优化分类性能,从而显著提升模型对严重程度的识别准确率,特别是降低严重与正常类别的误判率。
链接: https://arxiv.org/abs/2602.05738
作者: Sajjan Acharya,Pralisha Kansakar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work examines a disc-centric approach for automated severity grading of lumbar spinal stenosis from sagittal T2-weighted MRI. The method combines contrastive pretraining with disc-level fine-tuning, using a single anatomically localized region of interest per intervertebral disc. Contrastive learning is employed to help the model focus on meaningful disc features and reduce sensitivity to irrelevant differences in image appearance. The framework includes an auxiliary regression task for disc localization and applies weighted focal loss to address class imbalance. Experiments demonstrate a 78.1% balanced accuracy and a reduced severe-to-normal misclassification rate of 2.13% compared with supervised training from scratch. Detecting discs with moderate severity can still be challenging, but focusing on disc-level features provides a practical way to assess the lumbar spinal stenosis.
zh
[CV-120] owards Segmenting the Invisible: An End-to-End Registration and Segmentation Framework for Weakly Supervised Tumour Analysis ECAI2025
【速读】:该论文旨在解决肝脏肿瘤消融术中跨模态图像分割的难题,即在术前MRI中清晰可见的病灶在术中CT图像中因缺乏对比度而难以识别的问题。其核心挑战在于如何利用一种模态(如MRI)中的病理信息来辅助另一种模态(如CT)中的分割任务,尤其是在目标病灶在目标模态中不可见的情况下。解决方案的关键在于提出了一种融合多尺度交叉引导网络(MSCGUNet)与UNet结构的混合注册-分割框架,通过先对齐不同模态图像实现空间配准,再基于配准结果生成伪标签用于CT图像的分割训练。该方法虽能在健康肝组织上取得良好效果(Dice分数0.72),但在含肿瘤的临床数据上性能显著下降(Dice分数仅0.16),揭示了当前注册方法无法弥补目标模态中病灶特征缺失的根本局限性,强调了未来研究需聚焦于解决“域差距”和“特征缺失”问题。
链接: https://arxiv.org/abs/2602.05453
作者: Budhaditya Mukhopadhyay,Chirag Mandal,Pavan Tummala,Naghmeh Mahmoodian,Andreas Nürnberger,Soumick Chatterjee
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
备注: Accepted for AIBio at ECAI 2025
Abstract:Liver tumour ablation presents a significant clinical challenge: whilst tumours are clearly visible on pre-operative MRI, they are often effectively invisible on intra-operative CT due to minimal contrast between pathological and healthy tissue. This work investigates the feasibility of cross-modality weak supervision for scenarios where pathology is visible in one modality (MRI) but absent in another (CT). We present a hybrid registration-segmentation framework that combines MSCGUNet for inter-modal image registration with a UNet-based segmentation module, enabling registration-assisted pseudo-label generation for CT images. Our evaluation on the CHAOS dataset demonstrates that the pipeline can successfully register and segment healthy liver anatomy, achieving a Dice score of 0.72. However, when applied to clinical data containing tumours, performance degrades substantially (Dice score of 0.16), revealing the fundamental limitations of current registration methods when the target pathology lacks corresponding visual features in the target modality. We analyse the “domain gap” and “feature absence” problems, demonstrating that whilst spatial propagation of labels via registration is feasible for visible structures, segmenting truly invisible pathology remains an open challenge. Our findings highlight that registration-based label transfer cannot compensate for the absence of discriminative features in the target modality, providing important insights for future research in cross-modality medical image analysis. Code an weights are available at: this https URL
zh
[CV-121] Context-Aware Asymmetric Ensembling for Interpretable Retinopathy of Prematurity Screening via Active Query and Vascular Attention
【速读】:该论文旨在解决早产儿视网膜病变(Retinopathy of Prematurity, ROP)自动化筛查中的两大挑战:一是因数据稀缺和样本不平衡导致现有深度学习模型泛化能力差;二是传统方法难以同时精准识别结构分期与微血管异常等多维临床特征。解决方案的关键在于提出一种上下文感知的非对称集成模型(Context-Aware Asymmetric Ensemble Model, CAA Ensemble),其核心由两个专业化分支组成:其一为多尺度主动查询网络(Multi-Scale Active Query Network, MS-AQNet),利用临床上下文作为动态查询向量,空间调控视觉特征提取以精确定位纤维血管嵴;其二为基于门控多重实例学习(gated Multiple Instance Learning, MIL)的血管拓扑编码器(VascuMIL),通过血管拓扑图(Vascular Topology Maps, VMAP)实现对血管扭曲的精确建模。二者通过协同元学习器融合正交信号,有效缓解多目标诊断冲突,并借助反事实注意力热图与血管威胁图实现“玻璃盒”透明性,验证了临床元数据主导模型视觉搜索过程,从而在小规模、不平衡数据集上实现SOTA性能。
链接: https://arxiv.org/abs/2602.05208
作者: Md. Mehedi Hassan,Taufiq Hasan
机构: Bangladesh University of Engineering and Technology (孟加拉国工程与技术大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 6 figures
Abstract:Retinopathy of Prematurity (ROP) is among the major causes of preventable childhood blindness. Automated screening remains challenging, primarily due to limited data availability and the complex condition involving both structural staging and microvascular abnormalities. Current deep learning models depend heavily on large private datasets and passive multimodal fusion, which commonly fail to generalize on small, imbalanced public cohorts. We thus propose the Context-Aware Asymmetric Ensemble Model (CAA Ensemble) that simulates clinical reasoning through two specialized streams. First, the Multi-Scale Active Query Network (MS-AQNet) serves as a structure specialist, utilizing clinical contexts as dynamic query vectors to spatially control visual feature extraction for localization of the fibrovascular ridge. Secondly, VascuMIL encodes Vascular Topology Maps (VMAP) within a gated Multiple Instance Learning (MIL) network to precisely identify vascular tortuosity. A synergistic meta-learner ensembles these orthogonal signals to resolve diagnostic discordance across multiple objectives. Tested on a highly imbalanced cohort of 188 infants (6,004 images), the framework attained State-of-the-Art performance on two distinct clinical tasks: achieving a Macro F1-Score of 0.93 for Broad ROP staging and an AUC of 0.996 for Plus Disease detection. Crucially, the system features `Glass Box’ transparency through counterfactual attention heatmaps and vascular threat maps, proving that clinical metadata dictates the model’s visual search. Additionally, this study demonstrates that architectural inductive bias can serve as an effective bridge for the medical AI data gap.
zh
[CV-122] QuantumGS: Quantum Encoding Framework for Gaussian Splatting
【速读】:该论文旨在解决标准3D Gaussian Splatting (3DGS) 在处理高频率视图依赖效应(如锐利反射和透明效果)时的局限性,这些问题通常由其依赖球谐函数(Spherical Harmonics)的表示方式导致。为克服这一问题,作者提出了一种名为QuantumGS的新型混合框架,其关键创新在于将变分量子电路(Variational Quantum Circuits, VQC)集成到高斯泼溅(Gaussian Splatting)管线中:通过一种独特的编码策略,将视点方向直接映射到Bloch球上,利用量子比特(qubit)的自然几何结构来表示三维方向数据;同时,用由超网络(hypernetwork)或条件机制生成的量子电路替代传统颜色调制网络,从而显著提升模型的表达能力和泛化性能。
链接: https://arxiv.org/abs/2602.05047
作者: Grzegorz Wilczyński,Rafał Tobiasz,Paweł Gora,Marcin Mazur,Przemysław Spurek
机构: 未知
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in neural rendering, particularly 3D Gaussian Splatting (3DGS), have enabled real-time rendering of complex scenes. However, standard 3DGS relies on spherical harmonics, which often struggle to accurately capture high-frequency view-dependent effects such as sharp reflections and transparency. While hybrid approaches like Viewing Direction Gaussian Splatting (VDGS) mitigate this limitation using classical Multi-Layer Perceptrons (MLPs), they remain limited by the expressivity of classical networks in low-parameter regimes. In this paper, we introduce QuantumGS, a novel hybrid framework that integrates Variational Quantum Circuits (VQC) into the Gaussian Splatting pipeline. We propose a unique encoding strategy that maps the viewing direction directly onto the Bloch sphere, leveraging the natural geometry of qubits to represent 3D directional data. By replacing classical color-modulating networks with quantum circuits generated via a hypernetwork or conditioning mechanism, we achieve higher expressivity and better generalization. Source code is available in the supplementary material. Code is available at this https URL
zh
[CV-123] A General-Purpose Diversified 2D Seismic Image Dataset from NAMSS
【速读】:该论文旨在解决当前地震数据集在多样性、地理分布和结构一致性方面的不足,以支持更鲁棒的机器学习模型训练与评估。其核心问题是现有地震数据集(如Parihaka和F3 Block)覆盖的地震成像特征空间有限,难以支撑生成式AI(Generative AI)或迁移学习等现代方法在复杂地质条件下的泛化能力研究。解决方案的关键在于构建并公开发布Unicamp-NAMSS数据集——一个包含2588个清洗标准化的二维地震剖面、来自122个不同调查区域的大规模、多样化且地理分散的数据集,并采用区域不重叠划分策略确保训练、验证和测试集之间无信息泄露,从而有效评估模型对未见地质与采集条件的适应能力。此外,通过卷积神经网络与Transformer模型的嵌入空间分析验证了该数据集在区域内和跨区域均具有显著变异性但保持结构一致性,进一步证明其适合作为预训练基准用于地震解释任务中的自监督表示学习、超分辨率重建及属性预测等下游任务。
链接: https://arxiv.org/abs/2602.04890
作者: Lucas de Magalhães Araujo,Otávio Oliveira Napoli,Sandra Avila,Edson Borin
机构: Instituto de Computação (计算研究所); Universidade Estadual de Campinas (坎皮纳斯州立大学)
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We introduce the Unicamp-NAMSS dataset, a large, diverse, and geographically distributed collection of migrated 2D seismic sections designed to support modern machine learning research in geophysics. We constructed the dataset from the National Archive of Marine Seismic Surveys (NAMSS), which contains decades of publicly available marine seismic data acquired across multiple regions, acquisition conditions, and geological settings. After a comprehensive collection and filtering process, we obtained 2588 cleaned and standardized seismic sections from 122 survey areas, covering a wide range of vertical and horizontal sampling characteristics. To ensure reliable experimentation, we balanced the dataset so that no survey dominates the distribution, and partitioned it into non-overlapping macro-regions for training, validation, and testing. This region-disjoint split allows robust evaluation of generalization to unseen geological and acquisition conditions. We validated the dataset through quantitative and embedding-space analyses using both convolutional and transformer-based models. These analyses showed that Unicamp-NAMSS exhibits substantial variability within and across regions, while maintaining coherent structure across acquisition macro-region and survey types. Comparisons with widely used interpretation datasets (Parihaka and F3 Block) further demonstrated that Unicamp-NAMSS covers a broader portion of the seismic appearance space, making it a strong candidate for machine learning model pretraining. The dataset, therefore, provides a valuable resource for machine learning tasks, including self-supervised representation learning, transfer learning, benchmarking supervised tasks such as super-resolution or attribute prediction, and studying domain adaptation in seismic interpretation. Subjects: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2602.04890 [physics.geo-ph] (or arXiv:2602.04890v1 [physics.geo-ph] for this version) https://doi.org/10.48550/arXiv.2602.04890 Focus to learn more arXiv-issued DOI via DataCite
zh
人工智能
[AI-0] DyTopo: Dynamic Topology Routing for Multi-Agent Reason ing via Semantic Matching
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems)在迭代式问题求解过程中,因采用固定通信模式而导致的效率低下问题。现有方法通常依赖于全局固定的通信路径,难以适应不同求解阶段对信息交互需求的变化。其解决方案的关键在于提出DyTopo框架,通过管理器(manager)引导,在每一轮推理中动态重构稀疏有向通信图;每个智能体基于当前任务目标输出轻量级自然语言查询(need)和关键信息(key)描述符,DyTopo利用语义匹配机制嵌入并筛选出有效通信边,仅在这些边上传递私有消息,从而实现按需、高效的跨代理协作。
链接: https://arxiv.org/abs/2602.06039
作者: Yuxing Lu,Yucheng Hu,Xukai Zhao,Jiuxin Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems built from prompted large language models can improve multi-round reasoning, yet most existing pipelines rely on fixed, trajectory-wide communication patterns that are poorly matched to the stage-dependent needs of iterative problem solving. We introduce DyTopo, a manager-guided multi-agent framework that reconstructs a sparse directed communication graph at each round. Conditioned on the manager’s round goal, each agent outputs lightweight natural-language query (need) and \key (offer) descriptors; DyTopo embeds these descriptors and performs semantic matching, routing private messages only along the induced edges. Across code generation and mathematical reasoning benchmarks and four LLM backbones, DyTopo consistently outperforms over the strongest baseline (avg. +6.2). Beyond accuracy, DyTopo yields an interpretable coordination trace via the evolving graphs, enabling qualitative inspection of how communication pathways reconfigure across rounds.
zh
[AI-1] Learning Event-Based Shooter Models from Virtual Reality Experiments
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)在评估校园安全干预措施时面临的实验规模受限问题,即每次测试新干预策略都需要招募新的受试者群体,难以实现大规模或迭代式评估,尤其不利于学习高效干预策略所需的大量训练回合。其解决方案的关键在于构建一个基于数据驱动的离散事件仿真器(Discrete-Event Simulator, DES),该仿真器通过从VR实验中学习参与者行为的随机过程来建模袭击者移动和区域内的行动,从而在不依赖人类受试者的情况下实现对干预策略的可扩展评估与学习。一旦验证能复现关键实证模式,该DES即可作为自主校园安防策略开发与评估的高至中保真度替代平台。
链接: https://arxiv.org/abs/2602.06023
作者: Christopher A. McClurg,Alan R. Wagner
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Preprint under review for conference publication. 9 pages, 4 figures, 4 tables
Abstract:Virtual reality (VR) has emerged as a powerful tool for evaluating school security measures in high-risk scenarios such as school shootings, offering experimental control and high behavioral fidelity. However, assessing new interventions in VR requires recruiting new participant cohorts for each condition, making large-scale or iterative evaluation difficult. These limitations are especially restrictive when attempting to learn effective intervention strategies, which typically require many training episodes. To address this challenge, we develop a data-driven discrete-event simulator (DES) that models shooter movement and in-region actions as stochastic processes learned from participant behavior in VR studies. We use the simulator to examine the impact of a robot-based shooter intervention strategy. Once shown to reproduce key empirical patterns, the DES enables scalable evaluation and learning of intervention strategies that are infeasible to train directly with human subjects. Overall, this work demonstrates a high-to-mid fidelity simulation workflow that provides a scalable surrogate for developing and evaluating autonomous school-security interventions.
zh
[AI-2] Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在指令微调和偏好对齐后普遍存在的校准偏差(miscalibration)问题,即模型输出的概率分布与实际正确性之间存在不一致。现有方法通常依赖昂贵的重新训练或优化与正确性间接相关的代理目标,难以高效提升校准性能。其解决方案的关键在于提出CORAL(Correctness-Optimized Residual Activation Lens),一种正则化的推理时调控方法,通过权重衰减的多层感知机(MLP)探测器从模型内部激活中提取分布式正确性信号,并据此调整推理过程中的响应。该方法无需重新训练即可显著提升多项选择题问答(MCQA)任务的准确率(平均+10%)和预期校准误差(ECE)(平均-50%),且效果可在多个未见基准测试集上迁移,验证了模型内部信息可通过正则化探针有效挖掘的假设。
链接: https://arxiv.org/abs/2602.06022
作者: Miranda Muqing Miao,Young-Min Cho,Lyle Ungar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inference-time steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10% and expected calibration error (ECE) by 50% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14% accuracy improvements and 49% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.
zh
[AI-3] Optimism Stabilizes Thompson Sampling for Adaptive Inference
【速读】:该论文旨在解决 Thompson sampling (TS) 在自适应数据收集场景下推断性质不稳定的难题,即传统样本均值的渐近理论在多臂老虎机(multi-armed bandit)中可能失效,原因在于每条臂的抽样次数是随机且与动作选择规则耦合的。研究发现,“乐观性”(optimism)是恢复稳定性(stability)的关键机制——稳定性要求每条臂的拉动次数集中在确定性尺度附近,从而保证渐近推断的有效性。论文提出两种解决方案:一是采用方差膨胀的 TS(variance-inflated TS),证明其对任意 $ K \geq 2 $ 均稳定,包括多个最优臂的复杂情形;二是引入仅增加后验均值奖励的乐观修改版本,同样实现稳定性。结论表明,合理设计的乐观性可使 TS 实现渐近有效推断,同时仅带来轻微额外遗憾(regret)代价。
链接: https://arxiv.org/abs/2602.06014
作者: Shunxing Yan,Han Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:
Abstract:Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the K -armed Gaussian bandit and identify \emphoptimism as a key mechanism for restoring \emphstability, a sufficient condition for valid asymptotic inference requiring each arm’s pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citephalder2025stable is stable for any K \ge 2 , including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citethalder2025stable through extending their results from the two-armed setting to the general K -armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.
zh
[AI-4] Agent icPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions
【速读】:该论文旨在解决当前多智能体(multi-agent)语言模型在自然语言驱动的经济交互中缺乏系统性评估基准的问题。现有基准无法有效衡量多个智能体通过语言协商达成交易的能力,尤其是在存在私有约束和产品依赖型估值的情境下。解决方案的关键在于提出AgenticPay——一个支持多轮语言谈判的基准与仿真框架,其核心创新包括:建模包含买家与卖家私有约束及产品相关估值的市场环境;提供从双边讨价还价到多人对多人市场的110余种任务类型;并通过结构化动作提取与可行性、效率、福利等指标实现量化评估。该框架揭示了当前主流闭源与开源大语言模型在长程战略推理方面的显著性能差距,为研究基于语言的代理商业(agentic commerce)提供了可扩展的基础平台。
链接: https://arxiv.org/abs/2602.06008
作者: Xianyang Liu,Shangding Gu,Dawn Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language model (LLM)-based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language-mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi-agent buyer-seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product-dependent valuations, and must reach agreements through multi-round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many-to-many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state-of-the-art proprietary and open-weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long-horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language-based market interaction. Code and dataset are available at the link: this https URL.
zh
[AI-5] Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps
【速读】:该论文旨在解决生成式模型(Generative Models)在训练完成后难以高效、稳健地适应用户偏好或约束的问题,即所谓的奖励对齐(Reward Alignment)难题。现有方法通常在模型训练后通过额外优化步骤实现对齐,但存在计算成本高、鲁棒性差等缺陷。解决方案的关键在于重新设计生成模型架构,使其天然具备可适配性:作者提出“Diamond Maps”——一种随机流映射(Stochastic Flow Map)模型,该模型将多个模拟步骤压缩为单步采样器,同时保留了最优奖励对齐所需的随机性;这一设计使得价值函数估计在推理时更加高效且一致,从而支持搜索、序列蒙特卡洛(Sequential Monte Carlo)和引导(Guidance)等策略的规模化应用,显著提升了对任意奖励函数的适应能力与扩展性。
链接: https://arxiv.org/abs/2602.05993
作者: Peter Holderrieth,Douglas Chen,Luca Eyring,Ishin Shah,Giri Anantharaman,Yutong He,Zeynep Akata,Tommi Jaakkola,Nicholas Matthew Boffi,Max Simchowitz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose “Diamond Maps”, stochastic flow map models that enable efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward alignment. This design makes search, sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods. Our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.
zh
[AI-6] Geographically-aware Transformer-based Traffic Forecasting for Urban Motorway Digital Twins
【速读】:该论文旨在解决高速公路交通预测中因时空复杂性和动态非线性特征导致的准确性不足问题,尤其是在数字孪生(Digital Twin)技术应用于交通管理时,如何有效融合实时观测与预测交通状态以提升决策支持能力。解决方案的关键在于提出一种基于地理感知的Transformer交通预测模型(Geographically-aware Transformer-based Traffic Forecasting, GATTF),该模型通过引入传感器间的互信息(Mutual Information, MI)来显式建模空间拓扑关系,从而在不增加模型复杂度的前提下显著提升预测精度。
链接: https://arxiv.org/abs/2602.05983
作者: Krešimir Kušić,Vinny Cahill,Ivana Dusparic
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: IEEE IV2026 37th IEEE Intelligent Vehicles Symposium
Abstract:The operational effectiveness of digital-twin technology in motorway traffic management depends on the availability of a continuous flow of high-resolution real-time traffic data. To function as a proactive decision-making support layer within traffic management, a digital twin must also incorporate predicted traffic conditions in addition to real-time observations. Due to the spatio-temporal complexity and the time-variant, non-linear nature of traffic dynamics, predicting motorway traffic remains a difficult problem. Sequence-based deep-learning models offer clear advantages over classical machine learning and statistical models in capturing long-range, temporal dependencies in time-series traffic data, yet limitations in forecasting accuracy and model complexity point to the need for further improvements. To improve motorway traffic forecasting, this paper introduces a Geographically-aware Transformer-based Traffic Forecasting GATTF model, which exploits the geographical relationships between distributed sensors using their mutual information (MI). The model has been evaluated using real-time data from the Geneva motorway network in Switzerland and results confirm that incorporating geographical awareness through MI enhances the accuracy of GATTF forecasting compared to a standard Transformer, without increasing model complexity.
zh
[AI-7] Clifford Kolmogorov-Arnold Networks
【速读】:该论文旨在解决在任意Clifford代数空间中进行函数逼近时面临的高维复杂性与输入域变化带来的挑战。其核心问题包括:高维Clifford代数导致的指数级计算规模增长,以及输入数据域不固定时模型稳定性差的问题。解决方案的关键在于提出Clifford Kolmogorov-Arnold Network(ClKAN),通过引入随机准蒙特卡洛(Randomized Quasi Monte Carlo)网格生成方法缓解高维代数的空间爆炸问题,并设计新的批归一化(batch normalization)策略以适应可变输入域,从而实现高效且灵活的函数逼近能力。
链接: https://arxiv.org/abs/2602.05977
作者: Matthias Wolff,Francesco Alesiani,Christof Duhme,Xiaoyi Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication
Abstract:We introduce Clifford Kolmogorov-Arnold Network (ClKAN), a flexible and efficient architecture for function approximation in arbitrary Clifford algebra spaces. We propose the use of Randomized Quasi Monte Carlo grid generation as a solution to the exponential scaling associated with higher dimensional algebras. Our ClKAN also introduces new batch normalization strategies to deal with variable domain input. ClKAN finds application in scientific discovery and engineering, and is validated in synthetic and physics inspired tasks.
zh
[AI-8] Inverse Depth Scaling From Most Layers Being Similar
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中深度(depth)与宽度(width)对性能影响差异的机制问题,特别是量化深度如何影响损失函数(loss)。其关键解决方案在于通过分析LLMs和简化残差网络(residual networks),发现损失与深度呈反比关系,表明功能相似的层通过集成平均(ensemble averaging)而非组合式学习(compositional learning)或平滑动态离散化来降低误差。这一现象虽效率低下但具有鲁棒性,可能源于残差网络架构偏差及目标函数与平滑动力学不兼容。研究建议提升LLM效率需引入架构创新以促进深度的组合式利用。
链接: https://arxiv.org/abs/2602.05970
作者: Yizhou Liu,Sara Kangaslahti,Ziming Liu,Jeff Gore
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Machine Learning (stat.ML)
备注: 23 pages, 24 figures
Abstract:Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.
zh
[AI-9] Compound Deception in Elite Peer Review: A Failure Mode Taxonomy of 100 Fabricated Citations at NeurIPS 2025
【速读】:该论文旨在解决生成式 AI(Generative AI)在学术写作中产生的虚假引用问题,即模型会编造不存在的文献来源(称为“幻觉”),而这些错误在当前同行评审流程中难以被发现。研究分析了2025年NeurIPS会议中53篇论文中的100条幻觉引用,揭示其具有复合失败模式(100%的案例包含两种或以上类型),其中以“完全伪造”(Total Fabrication)为主,常与“语义幻觉”(Semantic Hallucination)和“标识符劫持”(Identifier Hijacking)结合,从而伪装成可信文献,绕过人工核查。解决方案的关键在于引入强制性的自动化引用验证机制,在投稿阶段即对参考文献进行真实性检测,以防止虚假引用在科学文献中常态化。
链接: https://arxiv.org/abs/2602.05930
作者: Samar Ansari
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used in academic writing workflows, yet they frequently hallucinate by generating citations to sources that do not exist. This study analyzes 100 AI-generated hallucinated citations that appeared in papers accepted by the 2025 Conference on Neural Information Processing Systems (NeurIPS), one of the world’s most prestigious AI conferences. Despite review by 3-5 expert researchers per paper, these fabricated citations evaded detection, appearing in 53 published papers (approx. 1% of all accepted papers). We develop a five-category taxonomy that classifies hallucinations by their failure mode: Total Fabrication (66%), Partial Attribute Corruption (27%), Identifier Hijacking (4%), Placeholder Hallucination (2%), and Semantic Hallucination (1%). Our analysis reveals a critical finding: every hallucination (100%) exhibited compound failure modes. The distribution of secondary characteristics was dominated by Semantic Hallucination (63%) and Identifier Hijacking (29%), which often appeared alongside Total Fabrication to create a veneer of plausibility and false verifiability. These compound structures exploit multiple verification heuristics simultaneously, explaining why peer review fails to detect them. The distribution exhibits a bimodal pattern: 92% of contaminated papers contain 1-2 hallucinations (minimal AI use) while 8% contain 4-13 hallucinations (heavy reliance). These findings demonstrate that current peer review processes do not include effective citation verification and that the problem extends beyond NeurIPS to other major conferences, government reports, and professional consulting. We propose mandatory automated citation verification at submission as an implementable solution to prevent fabricated citations from becoming normalized in scientific literature.
zh
[AI-10] Quantum Reinforcement Learning with Transformers for the Capacitated Vehicle Routing Problem
【速读】:该论文旨在解决带容量约束的车辆路径问题(Capacitated Vehicle Routing Problem, CVRP),即在满足车辆载重限制的前提下,优化多辆配送车的路径规划以最小化总行驶距离并提升路径结构的紧凑性与合理性。其解决方案的关键在于引入强化学习(Reinforcement Learning, RL)框架,并对比经典、全量子与混合量子-经典三种变体的A2C(Advantage Actor-Critic)智能体,其中特别融合了Transformer架构,利用自注意力(self-attention)和交叉注意力(cross-attention)机制建模车辆、客户与调度中心之间的复杂依赖关系。实验表明,混合量子-经典架构在路由距离、路径紧凑性和路径重叠度等指标上均优于纯经典模型,展现出更强的鲁棒性和更优的结构化路由生成能力,凸显了混合量子增强强化学习在求解复杂组合优化问题中的潜力。
链接: https://arxiv.org/abs/2602.05920
作者: Eva Andrés
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 22 pages, 12 figures
Abstract:This paper addresses the Capacitated Vehicle Routing Problem (CVRP) by comparing classical and quantum Reinforcement Learning (RL) approaches. An Advantage Actor-Critic (A2C) agent is implemented in classical, full quantum, and hybrid variants, integrating transformer architectures to capture the relationships between vehicles, clients, and the depot through self- and cross-attention mechanisms. The experiments focus on multi-vehicle scenarios with capacity constraints, considering 20 clients and 4 vehicles, and are conducted over ten independent runs. Performance is assessed using routing distance, route compactness, and route overlap. The results show that all three approaches are capable of learning effective routing policies. However, quantum-enhanced models outperform the classical baseline and produce more robust route organization, with the hybrid architecture achieving the best overall performance across distance, compactness, and route overlap. In addition to quantitative improvements, qualitative visualizations reveal that quantum-based models generate more structured and coherent routing solutions. These findings highlight the potential of hybrid quantum-classical reinforcement learning models for addressing complex combinatorial optimization problems such as the CVRP.
zh
[AI-11] Verification of the Implicit World Model in a Generative Model via Adversarial Sequences ICLR2026
【速读】:该论文旨在解决生成式序列模型在训练过程中是否能够准确捕获语言或规则系统真实结构(即“世界模型”)的问题,特别是验证其生成结果的合理性(soundness),即能否始终产生合法序列。研究表明,理论上只能期望模型具备soundness(生成有效序列),而无法保证覆盖全部合法序列。为应对这一挑战,作者提出了一种基于对抗性序列生成的验证方法,其关键在于设计攻击者(adversary)主动生成合法序列以诱导模型预测非法下一步动作,从而揭示模型在合法性判断上的缺陷。该方法不仅可用于 falsify soundness,还能对模型失败模式进行细粒度分析,并通过引入棋盘状态探测器(board state probe)进一步评估训练与攻击策略的效果,最终发现多数模型中棋盘状态并未对下一步动作预测起到因果作用。
链接: https://arxiv.org/abs/2602.05903
作者: András Balogh,Márk Jelasity
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICLR 2026. Code, datasets, and models are available at this https URL
Abstract:Generative sequence models are typically trained on sample sequences from natural or formal languages. It is a crucial question whether – or to what extent – sample-based training is able to capture the true structure of these languages, often referred to as the ``world model’'. Theoretical results indicate that we can hope for soundness at best, that is, generating valid sequences, but not necessarily all of them. However, it is still important to have practical tools that are able to verify whether a given sequence model is sound. In this study, we focus on chess, as it is a domain that provides enough complexity while having a simple rule-based world model. We propose adversarial sequence generation for verifying the soundness of the sequence model. Our adversaries generate valid sequences so as to force the sequence model to generate an invalid next move prediction. Apart from the falsification of soundness, this method is also suitable for a more fine-grained analysis of the failure modes and the effects of different choices during training. To demonstrate this, we propose a number of methods for adversarial sequence generation and evaluate the approach on a large set of chess models. We train models on random as well as high-quality chess games, using several training recipes. We find that none of the models are sound, but some training techniques and dataset choices are able to improve soundness remarkably. We also investigate the potential application of board state probes in both our training and attack methods. Our findings indicate that the extracted board states have no causal role in next token prediction in most of the models.
zh
[AI-12] Regularized Calibration with Successive Rounding for Post-Training Quantization
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在部署过程中因存储和访问数十亿参数带来的高内存占用与延迟问题,提出通过后训练量化(Post-Training Quantization, PTQ)实现高效推理。其核心挑战在于如何在不重新训练模型的前提下,设计有效的量化目标与舍入策略以保持模型性能。解决方案的关键在于引入一种基于对称与非对称校准(calibration)之间插值的正则化机制,该机制在保留传统二次型量化目标结构的同时增强了对激活分布不匹配的鲁棒性;进一步提出一种简化的逐次舍入(successive rounding)方法以自然融合非对称校准,并设计了一种有界搜索(bounded-search)扩展方案,在量化质量与计算开销之间提供可控的权衡。实验表明,该方法在多个LLM家族、位宽和基准测试中均显著优于现有PTQ基线。
链接: https://arxiv.org/abs/2602.05902
作者: Seohyeon Cha,Huancheng Chen,Dongjun Kim,Haoran Zhang,Kevin Chan,Gustavo de Veciana,Haris Vikalo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) deliver robust performance across diverse applications, yet their deployment often faces challenges due to the memory and latency costs of storing and accessing billions of parameters. Post-training quantization (PTQ) enables efficient inference by mapping pretrained weights to low-bit formats without retraining, but its effectiveness depends critically on both the quantization objective and the rounding procedure used to obtain low-bit weight representations. In this work, we show that interpolating between symmetric and asymmetric calibration acts as a form of regularization that preserves the standard quadratic structure used in PTQ while providing robustness to activation mismatch. Building on this perspective, we derive a simple successive rounding procedure that naturally incorporates asymmetric calibration, as well as a bounded-search extension that allows for an explicit trade-off between quantization quality and the compute cost. Experiments across multiple LLM families, quantization bit-widths, and benchmarks demonstrate that the proposed bounded search based on a regularized asymmetric calibration objective consistently improves perplexity and accuracy over PTQ baselines, while incurring only modest and controllable additional computational cost.
zh
[AI-13] Parity Sensitivity and Transformers
【速读】:该论文致力于解决Transformer架构在计算能力边界上的一个基础性问题:单层Transformer是否能够实现PARITY函数(即输入序列中1的个数为奇数时输出1,否则输出0)的计算。此前已知的PARITY构造均需至少两层,并依赖于不现实的假设,如长度相关的位置编码、hardmax激活函数、缺乏正则化参数的LayerNorm或无法在因果掩码(causal masking)下实现。本文提出了一种全新的单层Transformer结构,其关键创新在于使用softmax激活函数、长度无关且多项式有界的 positional encoding(位置编码),并完全避免了LayerNorm和硬max等不切实际的组件,同时适用于带与不带因果掩码的场景,从而显著提升了模型的实际可实现性。此外,作者首次给出了针对PARITY问题的下界证明,表明仅用一层一头的Transformer无法完成该任务,明确了单层Transformer的计算局限性。
链接: https://arxiv.org/abs/2602.05896
作者: Alexander Kozachinskiy,Tomasz Steifer,Przemysław Wałȩga
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:The transformer architecture is almost a decade old. Despite that, we still have a limited understanding of what this architecture can or cannot compute. For instance, can a 1-layer transformer solve PARITY – or more generally – which kinds of transformers can do it? Known constructions for PARITY have at least 2 layers and employ impractical features: either a length-dependent positional encoding, or hardmax, or layernorm without the regularization parameter, or they are not implementable with causal masking. We give a new construction of a transformer for PARITY with softmax, length-independent and polynomially bounded positional encoding, no layernorm, working both with and without causal masking. We also give the first lower bound for transformers solving PARITY – by showing that it cannot be done with only one layer and one head. Comments: 15 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.05896 [cs.LG] (or arXiv:2602.05896v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.05896 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-14] Metric Hedonic Games on the Line AAMAS2026
【速读】:该论文旨在解决传统无差异博弈(hedonic games)中因需显式定义所有可能联盟的效用而导致的计算复杂性问题,尤其是在联盟数量呈指数级增长时的可扩展性瓶颈。其解决方案的关键在于提出一种新颖的简洁表示模型:每个代理(agent)具有固定的类型值(type-value),其在某联盟中的成本由自身与联盟内其他成员类型值的差异决定,从而通过距离度量(如阈值、最大差值或平均差值)建模自然场景(如运动员按能力分组或选民沿政治光谱划分)。这种设定不仅保留了对现实情境的刻画能力,还显著降低了表示复杂度,并在此基础上系统分析了稳定联盟结构的存在性、性质及效率(以价格 of anarchy 和 price of stability 衡量),同时探讨了联盟数量上限限制的影响,揭示出即使在简单线性度量空间下,模型也展现出丰富且部分反直觉的行为特征。
链接: https://arxiv.org/abs/2602.05888
作者: Merlin de la Haye,Pascal Lenzner,Farehe Soheil,Marcus Wunderlich
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: accepted at AAMAS 2026, full version
Abstract:Hedonic games are fundamental models for investigating the formation of coalitions among a set of strategic agents, where every agent has a certain utility for every possible coalition of agents it can be part of. To avoid the intractability of defining exponentially many utilities for all possible coalitions, many variants with succinct representations of the agents’ utility functions have been devised and analyzed, e.g., modified fractional hedonic games by Monaco et al. [JAAMAS 2020]. We extend this by studying a novel succinct variant that is related to modified fractional hedonic games. In our model, each agent has a fixed type-value and an agent’s cost for some given coalition is based on the differences between its value and those of the other members of its coalition. This allows to model natural situations like athletes forming training groups with similar performance levels or voters that partition themselves along a political spectrum. In particular, we investigate natural variants where an agent’s cost is defined by distance thresholds, or by the maximum or average value difference to the other agents in its coalition. For these settings, we study the existence of stable coalition structures, their properties, and their quality in terms of the price of anarchy and the price of stability. Further, we investigate the impact of limiting the maximum number of coalitions. Despite the simple setting with metric distances on a line, we uncover a rich landscape of models, partially with counter-intuitive behavior. Also, our focus on both swap stability and jump stability allows us to study the influence of fixing the number and the size of the coalitions. Overall, we find that stable coalition structures always exist but that their properties and quality can vary widely. Comments: accepted at AAMAS 2026, full version Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.05888 [cs.GT] (or arXiv:2602.05888v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2602.05888 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-15] A Guide to Large Language Models in Modeling and Simulation: From Core Techniques to Critical Challenges
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在建模与仿真(Modeling and Simulation, MS)工作流中被广泛应用时所面临的一系列实践误区和潜在问题,如非确定性、知识增强策略(包括检索增强生成 RAG 和低秩适应 LoRA)的误用、MS 数据分解不当以及超参数设置不合理等。其解决方案的关键在于强调原则性设计选择、诊断策略与实证评估相结合的方法论框架,以帮助建模者在何时、如何以及是否依赖LLM方面做出科学、审慎的决策,从而避免因盲目使用而导致性能下降或信息丢失。
链接: https://arxiv.org/abs/2602.05883
作者: Philippe J. Giabbanelli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Book chapter. Accepted in Artificial Intelligence in Modeling and Simulation, Philippe J. Giabbanelli and Istvan David (eds). Series on Simulation Foundations, Methods and Applications. Springer, Cham. Series ISSN: 2195-2817
Abstract:Large language models (LLMs) have rapidly become familiar tools to researchers and practitioners. Concepts such as prompting, temperature, or few-shot examples are now widely recognized, and LLMs are increasingly used in Modeling Simulation (MS) workflows. However, practices that appear straightforward may introduce subtle issues, unnecessary complexity, or may even lead to inferior results. Adding more data can backfire (e.g., deteriorating performance through model collapse or inadvertently wiping out existing guardrails), spending time on fine-tuning a model can be unnecessary without a prior assessment of what it already knows, setting the temperature to 0 is not sufficient to make LLMs deterministic, providing a large volume of MS data as input can be excessive (LLMs cannot attend to everything) but naive simplifications can lose information. We aim to provide comprehensive and practical guidance on how to use LLMs, with an emphasis on MS applications. We discuss common sources of confusion, including non-determinism, knowledge augmentation (including RAG and LoRA), decomposition of MS data, and hyper-parameter settings. We emphasize principled design choices, diagnostic strategies, and empirical evaluation, with the goal of helping modelers make informed decisions about when, how, and whether to rely on LLMs.
zh
[AI-16] Agent 2Agent Threats in Safety-Critical LLM Assistants: A Human-Centric Taxonomy
【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的车载对话代理在自动驾驶系统中引入的安全挑战,特别是由于其与外部服务通过如 Google Agent-to-Agent(A2A)等协议交互时,自然语言载荷可能被恶意操控,进而引发从驾驶员分心到未经授权车辆控制等严重后果的问题。现有AI安全框架缺乏安全关键系统工程中“职责分离”(separation of concerns)的标准,将资产保护对象(assets)与攻击路径(attack paths)混同处理,导致威胁建模不严谨。解决方案的关键在于提出名为 AgentHeLLM 的威胁建模框架,其核心创新是形式化地分离资产识别与攻击路径分析:一方面基于以人类为中心的资产分类法(源自以伤害为导向的“受害者建模”,受《世界人权宣言》启发),另一方面构建图结构模型,明确区分毒化路径(poison paths,恶意数据传播)与触发路径(trigger paths,激活动作)。该框架通过开源工具 AgentHeLLM Attack Path Generator 实现多阶段威胁发现自动化,采用双层搜索策略提升实用性与可扩展性。
链接: https://arxiv.org/abs/2602.05877
作者: Lukas Stappen,Ahmet Erkan Turan,Johann Hagerer,Georg Groh
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:The integration of Large Language Model (LLM)-based conversational agents into vehicles creates novel security challenges at the intersection of agentic AI, automotive safety, and inter-agent communication. As these intelligent assistants coordinate with external services via protocols such as Google’s Agent-to-Agent (A2A), they establish attack surfaces where manipulations can propagate through natural language payloads, potentially causing severe consequences ranging from driver distraction to unauthorized vehicle control. Existing AI security frameworks, while foundational, lack the rigorous “separation of concerns” standard in safety-critical systems engineering by co-mingling the concepts of what is being protected (assets) with how it is attacked (attack paths). This paper addresses this methodological gap by proposing a threat modeling framework called AgentHeLLM (Agent Hazard Exploration for LLM Assistants) that formally separates asset identification from attack path analysis. We introduce a human-centric asset taxonomy derived from harm-oriented “victim modeling” and inspired by the Universal Declaration of Human Rights, and a formal graph-based model that distinguishes poison paths (malicious data propagation) from trigger paths (activation actions). We demonstrate the framework’s practical applicability through an open-source attack path suggestion tool AgentHeLLM Attack Path Generator that automates multi-stage threat discovery using a bi-level search strategy.
zh
[AI-17] Beyond Manual Planning : Seating Allocation for Large Organizations
【速读】:该论文旨在解决层次化组织团队在物理空间中的最优座位分配问题(Hierarchical Seating Allocation Problem, HSAP),其核心目标是确保具有紧密层级关系的团队成员被安排在邻近位置,以提升协作效率并避免因手动排布导致的低效与非最优布局。解决方案的关键在于提出一个端到端框架:首先利用概率路线图(Probabilistic Road Map, PRM)和快速扩展随机树(Rapidly-exploring Random Trees, RRT)构建可扩展的座位间距离计算方法,随后结合启发式搜索与动态规划策略,通过整数规划求解HSAP,从而实现大规模实例下的高效、高质量座位分配。
链接: https://arxiv.org/abs/2602.05875
作者: Anton Ipsen,Michael Cashmore,Kirsty Fielding,Nicolas Marchesotti,Parisa Zehtabi,Daniele Magazzeni,Manuela Veloso
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:We introduce the Hierarchical Seating Allocation Problem (HSAP) which addresses the optimal assignment of hierarchically structured organizational teams to physical seating arrangements on a floor plan. This problem is driven by the necessity for large organizations with large hierarchies to ensure that teams with close hierarchical relationships are seated in proximity to one another, such as ensuring a research group occupies a contiguous area. Currently, this problem is managed manually leading to infrequent and suboptimal replanning efforts. To alleviate this manual process, we propose an end-to-end framework to solve the HSAP. A scalable approach to calculate the distance between any pair of seats using a probabilistic road map (PRM) and rapidly-exploring random trees (RRT) which is combined with heuristic search and dynamic programming approach to solve the HSAP using integer programming. We demonstrate our approach under different sized instances by evaluating the PRM framework and subsequent allocations both quantitatively and qualitatively.
zh
[AI-18] BABE: Biology Arena BEnchmark
【速读】:该论文旨在解决当前生物领域AI评估基准普遍缺乏对实验推理能力(experimental reasoning capability)考核的问题,尤其是模型整合实验结果与背景知识以得出有意义结论的能力未被充分衡量。解决方案的关键在于提出BABE(Biology Arena BEnchmark),该基准由同行评审的科研论文和真实生物研究构建而成,确保任务具备实际科学探究的复杂性和跨学科特性,从而系统性地评估AI在因果推理(causal reasoning)和跨尺度推理(cross-scale inference)方面的能力,为衡量AI是否具备类似科研人员的推理水平提供了一个更真实的框架。
链接: https://arxiv.org/abs/2602.05857
作者: Junting Zhou,Jin Chen,Linfeng Hao,Denghui Cao,Zheyu Wang,Qiguang Chen,Chaoyou Fu,Jiaze Chen,Yuchen Wu,Ge Zhang,Mingxuan Wang,Wenhao Huang,Tong Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution of large language models (LLMs) has expanded their capabilities from basic dialogue to advanced scientific reasoning. However, existing benchmarks in biology often fail to assess a critical skill required of researchers: the ability to integrate experimental results with contextual knowledge to derive meaningful conclusions. To address this gap, we introduce BABE(Biology Arena BEnchmark), a comprehensive benchmark designed to evaluate the experimental reasoning capabilities of biological AI systems. BABE is uniquely constructed from peer-reviewed research papers and real-world biological studies, ensuring that tasks reflect the complexity and interdisciplinary nature of actual scientific inquiry. BABE challenges models to perform causal reasoning and cross-scale inference. Our benchmark provides a robust framework for assessing how well AI systems can reason like practicing scientists, offering a more authentic measure of their potential to contribute to biological research.
zh
[AI-19] OmniVideo-R1: Reinforcing Audio-visual Reason ing with Query Intention and Modality Attention
【速读】:该论文旨在解决现有 omnivideo 模型在跨模态理解任务中表现不足的问题,尤其是在音频-视觉联合理解方面的局限性。其核心解决方案在于提出 OmniVideo-R1 框架,通过两个关键策略增强多模态推理能力:一是基于自监督学习范式的查询密集型对齐(query-intensive grounding),提升模型对多模态线索的精准定位能力;二是基于对比学习范式的模态感知融合(modality-attentive fusion),实现不同模态信息的动态加权整合。实验表明,该框架在多个基准测试上显著优于强基线模型,验证了其在复杂场景下的有效性与泛化能力。
链接: https://arxiv.org/abs/2602.05847
作者: Zhangquan Chen,Jiale Tao,Ruihuang Li,Yihao Hu,Ruitao Chen,Zhantao Yang,Xinlei Yu,Haodong Jing,Manyuan Zhang,Shuai Shao,Biao Wang,Qinglin Lu,Ruqi Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 12 figures
Abstract:While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to “think with omnimodal cues” by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.
zh
[AI-20] FHAIM: Fully Homomorphic AIM For Private Synthetic Data Generation
【速读】:该论文旨在解决隐私敏感领域(如医疗、教育和金融)中高质量数据因合规限制而难以共享的问题,从而导致生成式 AI (Generative AI) 应用受限。现有合成数据生成(Synthetic Data Generation, SDG)服务通常要求数据持有方信任第三方服务商访问原始数据,存在隐私泄露风险。为此,论文提出 FHAIM,这是首个基于全同态加密(Fully Homomorphic Encryption, FHE)的框架,可在加密数据上训练基于边际分布的合成数据生成器。其关键创新在于将广泛使用的 AIM 算法适配至 FHE 环境,并设计新型 FHE 协议,确保数据在整个训练过程中始终处于加密状态,仅在差分隐私(Differential Privacy)保障下释放合成数据,既保护隐私又维持模型性能与可接受的运行效率。
链接: https://arxiv.org/abs/2602.05838
作者: Mayank Kumar,Qian Lou,Paulo Barreto,Martine De Cock,Sikha Pentyala
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Data is the lifeblood of AI, yet much of the most valuable data remains locked in silos due to privacy and regulations. As a result, AI remains heavily underutilized in many of the most important domains, including healthcare, education, and finance. Synthetic data generation (SDG), i.e. the generation of artificial data with a synthesizer trained on real data, offers an appealing solution to make data available while mitigating privacy concerns, however existing SDG-as-a-service workflow require data holders to trust providers with access to private this http URL propose FHAIM, the first fully homomorphic encryption (FHE) framework for training a marginal-based synthetic data generator on encrypted tabular data. FHAIM adapts the widely used AIM algorithm to the FHE setting using novel FHE protocols, ensuring that the private data remains encrypted throughout and is released only with differential privacy guarantees. Our empirical analysis show that FHAIM preserves the performance of AIM while maintaining feasible runtimes.
zh
[AI-21] Learning Compact Boolean Networks
【速读】:该论文旨在解决浮点神经网络在资源受限场景下推理成本过高的问题,探索使用布尔神经网络(Boolean networks)作为替代方案,但其难点在于如何学习到既紧凑又高精度的布尔网络结构。解决方案的关键在于三个创新角度:一是提出无需额外参数且计算开销可忽略的高效连接学习策略;二是设计一种利用局部性的新型卷积布尔架构,显著减少布尔运算次数;三是引入自适应离散化策略,缓解从连续值网络转换为布尔网络时的精度损失。实验表明,所提方法在标准视觉基准上实现了显著优于现有最优方法的准确率与计算复杂度权衡(Pareto front)。
链接: https://arxiv.org/abs/2602.05830
作者: Shengpu Wang,Yuhao Mao,Yani Zhang,Martin Vechev
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Floating-point neural networks dominate modern machine learning but incur substantial inference cost, motivating interest in Boolean networks for resource-constrained settings. However, learning compact and accurate Boolean networks is challenging due to their combinatorial nature. In this work, we address this challenge from three different angles: learned connections, compact convolutions and adaptive discretization. First, we propose a novel strategy to learn efficient connections with no additional parameters and negligible computational overhead. Second, we introduce a novel convolutional Boolean architecture that exploits the locality with reduced number of Boolean operations than existing methods. Third, we propose an adaptive discretization strategy to reduce the accuracy drop when converting a continuous-valued network into a Boolean one. Extensive results on standard vision benchmarks demonstrate that the Pareto front of accuracy vs. computation of our method significantly outperforms prior state-of-the-art, achieving better accuracy with up to 37x fewer Boolean operations.
zh
[AI-22] KG-Thinker: Towards Dynamic Reason ing over Temporal Knowledge Graphs via Agent ic Reinforcement Learning
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在时间知识图谱问答(Temporal Knowledge Graph Question Answering, TKGQA)任务中面临的两大局限:一是复杂时间约束下容易产生推理幻觉(reasoning hallucinations),二是静态提示策略限制了模型的自主性与泛化能力,缺乏与时间知识图谱(Temporal Knowledge Graphs, TKGs)环境的动态交互优化。解决方案的关键在于提出一种名为TKG-Thinker的新颖智能体,其核心创新是通过双阶段训练策略实现自主规划与自适应检索能力:首先采用监督微调(Supervised Fine-Tuning, SFT)结合思维链(Chain-of-Thought)数据注入基础推理规划能力,随后利用强化学习(Reinforcement Learning, RL)引入多维奖励机制,优化在复杂时间约束下的推理策略,从而显著提升模型在动态交互中的准确性与泛化性能。
链接: https://arxiv.org/abs/2602.05818
作者: Zihao Jiang,Miao Peng,Zhenyan Shan,Wenjie Xu,Ben Liu,Gong Chen,Ziqi Gao,Min Peng
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Temporal knowledge graph question answering (TKGQA) aims to answer time-sensitive questions by leveraging temporal knowledge bases. While Large Language Models (LLMs) demonstrate significant potential in TKGQA, current prompting strategies constrain their efficacy in two primary ways. First, they are prone to reasoning hallucinations under complex temporal constraints. Second, static prompting limits model autonomy and generalization, as it lack optimization through dynamic interaction with temporal knowledge graphs (TKGs) environments. To address these limitations, we propose \textbfTKG-Thinker, a novel agent equipped with autonomous planning and adaptive retrieval capabilities for reasoning over TKGs. Specifically, TKG-Thinker performs in-depth temporal reasoning through dynamic multi-turn interactions with TKGs via a dual-training strategy. We first apply Supervised Fine-Tuning (SFT) with chain-of thought data to instill core planning capabilities, followed by a Reinforcement Learning (RL) stage that leverages multi-dimensional rewards to refine reasoning policies under intricate temporal constraints. Experimental results on benchmark datasets with three open-source LLMs show that TKG-Thinker achieves state-of-the-art performance and exhibits strong generalization across complex TKGQA settings.
zh
[AI-23] STProtein: predicting spatial protein expression from multi-omics data AAAI2026
【速读】:该论文旨在解决空间多组学数据整合中的数据不平衡问题,即空间转录组数据相对丰富而空间蛋白质组数据因技术限制和高成本而稀缺。为应对这一挑战,作者提出STProtein框架,其核心在于利用图神经网络(Graph Neural Networks, GNNs)结合多任务学习策略,基于更易获取的空间多组学数据(如空间转录组)准确预测未知的空间蛋白表达模式。该方法有望缓解空间蛋白质组数据稀缺问题,推动空间多组学数据的深度融合,并可能在生命科学领域引发突破性进展。
链接: https://arxiv.org/abs/2602.05811
作者: Zhaorui Jiang,Yingfang Yuan,Lei Hu,Wei Pang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: STProtein: predicting spatial protein expression from multi-omics data is accepted SPARTA_AAAI2026 Oral GitHub: this https URL
Abstract:The integration of spatial multi-omics data from single tissues is crucial for advancing biological research. However, a significant data imbalance impedes progress: while spatial transcriptomics data is relatively abundant, spatial proteomics data remains scarce due to technical limitations and high costs. To overcome this challenge we propose STProtein, a novel framework leveraging graph neural networks with multi-task learning strategy. STProtein is designed to accurately predict unknown spatial protein expression using more accessible spatial multi-omics data, such as spatial transcriptomics. We believe that STProtein can effectively addresses the scarcity of spatial proteomics, accelerating the integration of spatial multi-omics and potentially catalyzing transformative breakthroughs in life sciences. This tool enables scientists to accelerate discovery by identifying complex and previously hidden spatial patterns of proteins within tissues, uncovering novel relationships between different marker genes, and exploring the biological “Dark Matter”.
zh
[AI-24] NEX: Neuron Explore-Exploit Scoring for Label-Free Chain-of-Thought Selection and Model Ranking
【速读】:该论文旨在解决大语言模型在推理过程中因过度探索(overthinking)导致的计算资源浪费与性能瓶颈问题,尤其是在无监督场景下对候选推理路径进行高效筛选时缺乏可靠指标的问题。其核心挑战在于:当前模型常通过采样多条思维链(chain-of-thought)或合并检查点来提升准确性,但这种探索过程可能冗余且未受目标分布监督,从而引发“过思考”现象。解决方案的关键在于提出一种白盒、无需标签的评分框架 NEX,它将推理过程建模为交替的 E-phase(探索)和 X-phase(利用)阶段;通过检测每token中新激活的MLP神经元突增来识别E-phase,并借助粘性两状态隐马尔可夫模型(HMM)推断E-X转换,进而基于E阶段引入的神经元是否在后续X阶段被复用,赋予其可解释的权重并生成单一Good-Mass Fraction分数,用于无监督地排序候选响应和合并版本,有效预测下游准确率并识别更优模型变体。
链接: https://arxiv.org/abs/2602.05805
作者: Kang Chen,Zhuoka Feng,Sihan Zhao,Kai Xiong,Junjie Nian,Yaoning Wang,Changyi Xiao,Yixin Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 9 figures, 5 tables
Abstract:Large language models increasingly spend inference compute sampling multiple chain-of-thought traces or searching over merged checkpoints. This shifts the bottleneck from generation to selection, often without supervision on the target distribution. We show entropy-based exploration proxies follow an inverted-U with accuracy, suggesting extra exploration can become redundant and induce overthinking. We propose NEX, a white-box label-free unsupervised scoring framework that views reasoning as alternating E-phase (exploration) and X-phase (exploitation). NEX detects E-phase as spikes in newly activated MLP neurons per token from sparse activation caches, then uses a sticky two-state HMM to infer E-X phases and credits E-introduced neurons by whether they are reused in the following X span. These signals yield interpretable neuron weights and a single Good-Mass Fraction score to rank candidate responses and merged variants without task answers. Across reasoning benchmarks and Qwen3 merge families, NEX computed on a small unlabeled activation set predicts downstream accuracy and identifies better variants; we further validate the E-X signal with human annotations and provide causal evidence via “Effective-vs-Redundant” neuron transfer.
zh
[AI-25] Automated Customization of LLM s for Enterprise Code Repositories Using Semantic Scopes
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际工业场景中对私有代码库(private code repository)的代码补全(Code Completion, CC)性能不佳的问题,即模型在未见过的专有代码上下文中难以生成准确、语义一致的代码片段。其解决方案的关键在于通过基于语义作用域(semantic scopes)的自动化数据处理机制,将私有代码库中的结构化信息转化为高质量的训练样本,并结合两种定制策略——检索增强生成(Retrieval-Augmented Generation, RAG)与监督微调(supervised Fine-Tuning, FT),使LLM能够学习特定代码库的内部模式和风格,从而显著提升代码补全的准确性与实用性。实验表明,适度规模的定制化模型在真实企业场景下可超越更大但未定制的模型,验证了语义驱动的数据构建方法的有效性。
链接: https://arxiv.org/abs/2602.05780
作者: Ulrich Finkler,Irene Manotas,Wei Zhang,Geert Janssen,Octavian Popescu,Shyam Ramji
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code completion (CC) is a task frequently used by developers when working in collaboration with LLM-based programming assistants. Despite the increased performance of LLMs on public benchmarks, out of the box LLMs still have a hard time generating code that aligns with a private code repository not previously seen by the model’s training data. Customizing code LLMs to a private repository provides a way to improve the model performance. In this paper we present our approach for automated LLM customization based on semantic scopes in the code. We evaluate LLMs on real industry cases with two private enterprise code repositories with two customization strategies: Retrieval-Augmented Generation (RAG) and supervised Fine-Tuning (FT). Our mechanism for ingesting the repository’s data and formulating the training data pairs with semantic scopes helps models to learn the underlying patterns specific to the repository, providing more precise code to developers and helping to boost their productivity. The code completions of moderately sized customized models can be significantly better than those of uncustomized models of much larger capacity. We also include an analysis of customization on two public benchmarks and present opportunities for future work.
zh
[AI-26] Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance
【速读】:该论文旨在解决生成式 AI(Generative AI)中大语言模型(Large Language Models, LLMs)与多大语言模型(Multimodal Large Language Models, MLLMs)推理加速时存在的训练-解码不一致性问题:现有方法仅优化单一贪婪路径,而实际解码过程涉及对多个采样草稿路径的验证与排序。解决方案的关键在于提出变分推测解码(Variational Speculative Decoding, VSD),将草稿训练建模为关于潜在草稿路径(latent proposals)的变分推断问题,通过最大化目标模型接受概率的边缘似然,构建证据下界(ELBO),从而在提升草稿质量的同时最小化与目标分布的差异;进一步引入路径级效用函数,并采用期望最大化(Expectation-Maximization, EM)框架进行优化,其中E步利用马尔可夫链蒙特卡洛(MCMC)从Oracle过滤后的后验中采样,M步则结合自适应拒绝加权(Adaptive Rejection Weighting, ARW)与置信度感知正则化(Confidence-Aware Regularization, CAR)优化加权似然,理论分析和实验表明VSD能显著提高预期接受长度和推理速度,相较EAGLE-3和ViSpec分别实现最高9.6%和7.9%的加速。
链接: https://arxiv.org/abs/2602.05774
作者: Xiandong Zou,Jianshu Li,Jing Huang,Pan Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled draft paths. We propose Variational Speculative Decoding (VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws MCMC samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across LLMs and MLLMs show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly improving decoding efficiency.
zh
[AI-27] RL-VLA3: Reinforcement Learning VLA Accelerating via Full Asynchronism
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在强化学习(Reinforcement Learning, RL)训练过程中因同步执行导致的资源利用率低下和吞吐量受限问题。现有框架如RLinf虽能提升模型泛化能力,但其同步特性使得环境交互、策略生成(rollout)与模型更新(actor)阶段无法并行,造成显著的计算资源浪费。解决方案的关键在于首次提出并实现了一个全异步策略训练框架,涵盖从环境交互、轨迹收集到策略更新的完整流水线,并借鉴大规模模型强化学习中的异步优化思想,设计了多级解耦架构:包括异步并行化环境交互与轨迹采集、策略生成的流式执行以及训练更新的解耦调度。实验表明,该方法在LIBERO基准上相较同步策略最高提升59.25%的吞吐量,深度优化后可达126.67%,且具备良好的可扩展性(8–256 GPU)。
链接: https://arxiv.org/abs/2602.05765
作者: Zhong Guan,Haoran Sun,Yongjian Guo,Shuai Di,Xiaodong Bai,Jing Long,Tianyun Zhao,Mingxi Luo,Chen Zhou,Yucheng Guo,Qiming Yang,Wanting Xu,Wen Huang,Yunxuan Ma,Hongke Zhao,Likang Wu,Xiaotie Deng,Xi Xiao,Sheng Wen,Yicheng Gong,Junwu Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, Vision-Language-Action (VLA) models have emerged as a crucial pathway towards general embodied intelligence, yet their training efficiency has become a key bottleneck. Although existing reinforcement learning (RL)-based training frameworks like RLinf can enhance model generalization, they still rely on synchronous execution, leading to severe resource underutilization and throughput limitations during environment interaction, policy generation (rollout), and model update phases (actor). To overcome this challenge, this paper, for the first time, proposes and implements a fully-asynchronous policy training framework encompassing the entire pipeline from environment interaction, rollout generation, to actor policy updates. Systematically drawing inspiration from asynchronous optimization ideas in large model RL, our framework designs a multi-level decoupled architecture. This includes asynchronous parallelization of environment interaction and trajectory collection, streaming execution for policy generation, and decoupled scheduling for training updates. We validated the effectiveness of our method across diverse VLA models and environments. On the LIBERO benchmark, the framework achieves throughput improvements of up to 59.25% compared to existing synchronous strategies. When deeply optimizing separation strategies, throughput can be increased by as much as 126.67%. We verified the effectiveness of each asynchronous component via ablation studies. Scaling law validation across 8 to 256 GPUs demonstrates our method’s excellent scalability under most conditions.
zh
[AI-28] RocqSmith: Can Automatic Optimization Forge Better Proof Agents ?
【速读】:该论文旨在解决如何将自动AI代理优化方法应用于真实世界中的代理系统,特别是在形式验证场景下,以提升自动化定理证明(automated theorem proving)的效率与效果。研究聚焦于Rocq这一具有挑战性的领域,评估不同自动优化器在优化Rocq证明生成代理时的表现,并探讨提示设计(prompt design)、上下文知识(contextual knowledge)和控制策略(control strategies)等细粒度调优是否可被自动化。其关键发现是:尽管多种优化方法均带来可测量的性能提升,但最稳定有效的方案仍是简单的少样本引导(few-shot bootstrapping),然而目前所有方法仍未达到精心设计的先进证明代理的性能水平。
链接: https://arxiv.org/abs/2602.05762
作者: Andrei Kozyrev,Nikita Khramov,Denis Lochmelis,Valerio Morelli,Gleb Solovev,Anton Podkopaev
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注:
Abstract:This work studies the applicability of automatic AI agent optimization methods to real-world agents in formal verification settings, focusing on automated theorem proving in Rocq as a representative and challenging domain. We evaluate how different automatic agent optimizers perform when applied to the task of optimizing a Rocq proof-generation agent, and assess whether parts of the fine-grained tuning of agentic systems, such as prompt design, contextual knowledge, and control strategies, can be automated. Our results show that while several optimizers yield measurable improvements, simple few-shot bootstrapping is the most consistently effective; however, none of the studied methods matches the performance of a carefully engineered state-of-the-art proof agent.
zh
[AI-29] melyFreeze: Adaptive Parameter Freezing Mechanism for Pipeline Parallelism
【速读】:该论文旨在解决大规模模型训练中因流水线气泡(pipeline bubbles)导致的吞吐量受限问题,尤其针对现有参数冻结(parameter freezing)方法过度冻结参数、造成不必要的精度下降的问题。其解决方案的关键在于将流水线调度建模为有向无环图(directed acyclic graph),并通过求解线性规划(linear program)来计算在满足精度约束下的最优冻结比例(freeze ratios),从而最小化批次执行时间。该方法实现了在不牺牲收敛性和泛化能力的前提下,显著提升训练吞吐量,实验表明在LLaMA-8B上可获得最高达40%的性能提升。
链接: https://arxiv.org/abs/2602.05754
作者: Seonghye Cho,Jaemin Han,Hyunjin Kim,Euisoo Jung,Jae-Gil Lee
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Pipeline parallelism enables training models that exceed single-device memory, but practical throughput remains limited by pipeline bubbles. Although parameter freezing can improve training throughput by adaptively skipping backward computation, existing methods often over-freeze parameters, resulting in unnecessary accuracy degradation. To address this issue, we propose TimelyFreeze, which models the pipeline schedule as a directed acyclic graph and solves a linear program to compute optimal freeze ratios that minimize batch execution time under accuracy constraints. Experiments show that TimelyFreeze achieves up to 40% training throughput improvement on LLaMA-8B with comparable accuracy. Overall, it enables faster large-scale model training without compromising convergence and generalizes across diverse pipeline-parallel settings.
zh
[AI-30] LeakBoost: Perceptual-Loss-Based Membership Inference Attack
【速读】:该论文旨在解决生成式 AI(Generative AI)模型中成员推理攻击(Membership Inference Attacks, MIAs)的隐私泄露问题,即如何更有效地判断某个样本是否曾被用于训练目标模型。现有方法主要依赖静态指标(如损失值或置信度),未能充分利用模型在主动探测下的动态行为特征。其解决方案的关键在于提出 LeakBoost,一个基于感知损失(perceptual loss)的主动探查框架:通过优化激活空间目标函数合成特定的“探查图像”,放大成员与非成员样本在模型内部表示上的差异,并利用现成的成员检测器进行分析,无需修改检测器本身。该方法显著提升了低假阳性率下的攻击性能,尤其在深度神经网络架构和多种图像分类数据集上表现突出,为白盒场景下的隐私风险评估提供了一种模块化且计算高效的手段。
链接: https://arxiv.org/abs/2602.05748
作者: Amit Kravchik Taub,Fred M. Grabovski,Guy Amit,Yisroel Mirsky
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Membership inference attacks (MIAs) aim to determine whether a sample was part of a model’s training set, posing serious privacy risks for modern machine-learning systems. Existing MIAs primarily rely on static indicators, such as loss or confidence, and do not fully leverage the dynamic behavior of models when actively probed. We propose LeakBoost, a perceptual-loss-based interrogation framework that actively probes a model’s internal representations to expose hidden membership signals. Given a candidate input, LeakBoost synthesizes an interrogation image by optimizing a perceptual (activation-space) objective, amplifying representational differences between members and non-members. This image is then analyzed by an off-the-shelf membership detector, without modifying the detector itself. When combined with existing membership inference methods, LeakBoost achieves substantial improvements at low false-positive rates across multiple image classification datasets and diverse neural network architectures. In particular, it raises AUC from near-chance levels (0.53-0.62) to 0.81-0.88, and increases TPR at 1 percent FPR by over an order of magnitude compared to strong baseline attacks. A detailed sensitivity analysis reveals that deeper layers and short, low-learning-rate optimization produce the strongest leakage, and that improvements concentrate in gradient-based detectors. LeakBoost thus offers a modular and computationally efficient way to assess privacy risks in white-box settings, advancing the study of dynamic membership inference.
zh
[AI-31] Learning to Inject: Automated Prompt Injection via Reinforcement Learning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)代理系统中提示注入(prompt injection)攻击的自动化与优化问题,现有方法依赖人工设计提示词和红队测试,难以规模化和适应不同场景。其解决方案的关键在于提出AutoInject框架,该框架基于强化学习生成通用且可迁移的对抗性后缀(adversarial suffixes),并在攻击成功率与良性任务性能保持之间进行联合优化,实现了无需模型内部信息的黑盒攻击,并可在未见过的模型和任务上成功实施迁移攻击。
链接: https://arxiv.org/abs/2602.05746
作者: Xin Chen,Jie Zhang,Florian Tramer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.
zh
[AI-32] CSRv2: Unlocking Ultra-Sparse Embeddings ICLR2026
【速读】:该论文旨在解决超稀疏嵌入(ultra-sparse embeddings)在实际应用中因大量神经元失活(dead neurons)而导致表示质量严重退化的问题,从而限制其在存储、内存和推理延迟方面的效率优势无法充分发挥。解决方案的关键在于提出CSRv2,其核心创新包括:通过渐进式k-annealing稳定稀疏性学习过程,利用监督对比目标提升表示质量,并结合全模型微调确保端到端适应性;该方法将死神经元比例从80%降至20%,并在k=2时实现14%的准确率提升,使超稀疏嵌入在保持高性能的同时达到与密集嵌入相当甚至更优的计算和内存效率(如相较密集嵌入最高提升300倍)。
链接: https://arxiv.org/abs/2602.05735
作者: Lixuan Guo,Yifei Wang,Tiansheng Wen,Yifan Wang,Aosong Feng,Bo Chen,Stefanie Jegelka,Chenyu You
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Information Theory (cs.IT)
备注: Accepted by ICLR2026
Abstract:In the era of large foundation models, the quality of embeddings has become a central determinant of downstream task performance and overall system capability. Yet widely used dense embeddings are often extremely high-dimensional, incurring substantial costs in storage, memory, and inference latency. To address these, Contrastive Sparse Representation (CSR) is recently proposed as a promising direction, mapping dense embeddings into high-dimensional but k-sparse vectors, in contrast to compact dense embeddings such as Matryoshka Representation Learning (MRL). Despite its promise, CSR suffers severe degradation in the ultra-sparse regime, where over 80% of neurons remain inactive, leaving much of its efficiency potential unrealized. In this paper, we introduce CSRv2, a principled training approach designed to make ultra-sparse embeddings viable. CSRv2 stabilizes sparsity learning through progressive k-annealing, enhances representational quality via supervised contrastive objectives, and ensures end-to-end adaptability with full backbone finetuning. CSRv2 reduces dead neurons from 80% to 20% and delivers a 14% accuracy gain at k=2, bringing ultra-sparse embeddings on par with CSR at k=8 and MRL at 32 dimensions, all with only two active features. While maintaining comparable performance, CSRv2 delivers a 7x speedup over MRL, and yields up to 300x improvements in compute and memory efficiency relative to dense embeddings in text representation. Extensive experiments across text and vision demonstrate that CSRv2 makes ultra-sparse embeddings practical without compromising performance, where CSRv2 achieves 7%/4% improvement over CSR when k=4 and further increases this gap to 14%/6% when k=2 in text/vision representation. By making extreme sparsity viable, CSRv2 broadens the design space for real-time and edge-deployable AI systems where both embedding quality and efficiency are critical.
zh
[AI-33] Evaluating the impact of word embeddings on similarity scoring in practical information retrieval
【速读】:该论文旨在解决信息检索(Information Retrieval, IR)系统中查询语句与响应语句之间语义相似性度量不准确的问题,尤其针对用户因同义性(synonymy)和多义性(polysemy)导致的复杂语言使用场景。传统方法通常依赖神经词嵌入(neural word embeddings)的中心点(centroid)进行相似度计算,但难以充分捕捉词汇层面的语义关联。论文提出的解决方案关键在于引入基于词移动距离(Word Movers Distance, WMD)的细粒度相似性评估机制,通过计算查询与响应语句中个体词之间的最小“移动成本”来衡量语义距离,而非仅依赖整体向量中心。实验表明,结合WMD与预训练词嵌入(如GloVe)的方法在排序准确性上显著优于Doc2Vec、LSA等主流模型,证明了利用大规模语料训练的通用词嵌入可有效提升跨领域语义匹配性能。
链接: https://arxiv.org/abs/2602.05734
作者: Niall McCarroll,Kevin Curran,Eugene McNamee,Angela Clist,Andrew Brammer
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Search behaviour is characterised using synonymy and polysemy as users often want to search information based on meaning. Semantic representation strategies represent a move towards richer associative connections that can adequately capture this complex usage of language. Vector Space Modelling (VSM) and neural word embeddings play a crucial role in modern machine learning and Natural Language Processing (NLP) pipelines. Embeddings use distributional semantics to represent words, sentences, paragraphs or entire documents as vectors in high dimensional spaces. This can be leveraged by Information Retrieval (IR) systems to exploit the semantic relatedness between queries and answers. This paper evaluates an alternative approach to measuring query statement similarity that moves away from the common similarity measure of centroids of neural word embeddings. Motivated by the Word Movers Distance (WMD) model, similarity is evaluated using the distance between individual words of queries and statements. Results from ranked query and response statements demonstrate significant gains in accuracy using the combined approach of similarity ranking through WMD with the word embedding techniques. The top performing WMD + GloVe combination outperforms all other state-of-the-art retrieval models including Doc2Vec and the baseline LSA model. Along with the significant gains in performance of similarity ranking through WMD, we conclude that the use of pre-trained word embeddings, trained on vast amounts of data, result in domain agnostic language processing solutions that are portable to diverse business use-cases. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.05734 [cs.IR] (or arXiv:2602.05734v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2602.05734 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-34] Mitigating Hallucination in Financial Retrieval-Augmented Generation via Fine-Grained Knowledge Verification ICASSP2026
【速读】:该论文旨在解决金融领域检索增强生成(Retrieval-Augmented Generation, RAG)系统中模型生成内容与检索文档之间存在不一致的问题,尤其是由幻觉(hallucination)导致的响应错误。其解决方案的关键在于提出一种基于强化学习的细粒度知识验证框架(Reinforcement Learning with Fine-grained Knowledge Verification, RLFKV),通过将金融回答分解为原子知识单元(atomic knowledge units),逐项验证其正确性以计算细粒度忠实奖励(fine-grained faithful reward),从而提供更精确的优化信号,提升生成结果与检索文档的一致性;同时引入信息量奖励(informativeness reward)防止模型因过度简化而降低知识密度,确保生成内容在忠实性基础上保持充分的信息完整性。
链接: https://arxiv.org/abs/2602.05723
作者: Taoye Yin,Haoyuan Hu,Yaxin Fan,Xinhao Chen,Xinya Wu,Kai Deng,Kezun Zhang,Feng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted by ICASSP 2026
Abstract:In financial Retrieval-Augmented Generation (RAG) systems, models frequently rely on retrieved documents to generate accurate responses due to the time-sensitive nature of the financial domain. While retrieved documents help address knowledge gaps, model-generated responses still suffer from hallucinations that contradict the retrieved information. To mitigate this inconsistency, we propose a Reinforcement Learning framework enhanced with Fine-grained Knowledge Verification (RLFKV). Our method decomposes financial responses into atomic knowledge units and assesses the correctness of each unit to compute the fine-grained faithful reward. This reward offers more precise optimization signals, thereby improving alignment with the retrieved documents. Additionally, to prevent reward hacking (e.g., overly concise replies), we incorporate an informativeness reward that encourages the policy model to retain at least as many knowledge units as the base model. Experiments conducted on the public Financial Data Description (FDD) task and our newly proposed FDD-ANT dataset demonstrate consistent improvements, confirming the effectiveness of our approach.
zh
[AI-35] Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification
【速读】:该论文旨在解决强化学习中因奖励可验证性带来的系统性病理问题——递归空间收缩(Recursive Space Contraction, RSC),即由于正向锐化(positive sharpening)与负向挤压(negative squeezing)的共同作用,导致有效解空间的概率逐渐坍缩至零。传统基于KL散度的正则化方法虽试图缓解此问题,但其强加的全局形状匹配(Shape Matching)约束会迫使策略模仿参考模型的整体分布,从而与正确性所需的锐化方向产生梯度冲突。本文提出锚定策略优化(Anchored Policy Optimization, APO),其核心在于将优化目标从全局形状匹配转向支持覆盖(Support Coverage):通过定义基于参考模型高置信度支持集的“安全流形”(Safe Manifold),允许在高效区域进行激进锐化,同时在错误纠正阶段引入恢复力以防止坍缩。理论分析表明,APO是一种梯度对齐机制,能够最大化支持覆盖,并实现弹性恢复(Elastic Recovery),从而在数学基准测试中打破准确率与多样性之间的权衡,显著提升Pass@1性能并恢复标准策略梯度方法所丢失的Pass@K多样性。
链接: https://arxiv.org/abs/2602.05717
作者: Tianyi Wang,Long Li,Hongcan Guo,Yibiao Chen,Yixia Li,Yong Wang,Yun Chen,Guanhua Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model’s full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model’s high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.
zh
[AI-36] owards Green AI: Decoding the Energy of LLM Inference in Software Development
【速读】:该论文旨在解决生成式 AI(Generative AI)在软件开发中因大语言模型(Large Language Models, LLMs)推理阶段能耗过高而带来的可持续性问题。其核心发现表明,LLM 推理的能量消耗主要由解码(decoding)阶段主导,且预填充(prefill)阶段的计算成本会显著放大单位 token 的解码能耗(增幅达 1.3%–51.8%)。此外,部分模型存在“babbling”行为,即生成冗余内容导致不必要的能源浪费。解决方案的关键在于实施 babbling 抑制机制,在不影响代码生成准确性的前提下,实现高达 44%–89% 的能量节省,从而为降低 LLM 推理能耗提供高效可行的技术路径。
链接: https://arxiv.org/abs/2602.05712
作者: Lola Solovyeva,Fernando Castor
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Context: AI-assisted tools are increasingly integrated into software development workflows, but their reliance on large language models (LLMs) introduces substantial computational and energy costs. Understanding and reducing the energy footprint of LLM inference is therefore essential for sustainable software development. Objective: In this study, we conduct a phase-level analysis of LLM inference energy consumption, distinguishing between the (1) prefill, where the model processes the input and builds internal representations, and (2) decoding, where output tokens are generated using the stored state. Method: We investigate six 6B-7B and four 3B-4B transformer-based models, evaluating them on code-centric benchmarks HumanEval for code generation and LongBench for code understanding. Results: Our findings show that, within both parameter groups, models exhibit distinct energy patterns across phases. Furthermore, we observed that increases in prefill cost amplify the energy cost per token during decoding, with amplifications ranging from 1.3% to 51.8% depending on the model. Lastly, three out of ten models demonstrate babbling behavior, adding excessive content to the output that unnecessarily inflates energy consumption. We implemented babbling suppression for code generation, achieving energy savings ranging from 44% to 89% without affecting generation accuracy. Conclusion: These findings show that prefill costs influence decoding, which dominates energy consumption, and that babbling suppression can yield up to 89% energy savings. Reducing inference energy therefore requires both mitigating babbling behavior and limiting impact of prefill on decoding.
zh
[AI-37] Nonlinearity as Rank: Generative Low-Rank Adapter with Radial Basis Functions
【速读】:该论文旨在解决标准低秩适配(Low-Rank Adaptation, LoRA)在提升模型容量时导致参数显著增长的问题。标准LoRA采用显式秩范式(explicit-rank paradigm),通过增加低秩矩阵的行或列(即基向量)来扩展模型能力,从而造成参数冗余。解决方案的关键在于提出生成式低秩适配器(Generative Low-Rank Adapter, GenLoRA),其核心思想是用轻量级非线性函数替代显式存储基向量,具体而言,GenLoRA为每个低秩矩阵维护一个潜在向量,并利用一组轻量径向基函数(Radial Basis Functions, RBFs)生成基向量,RBF所需的参数远少于显式基向量,从而实现更高的参数效率,在更小的参数预算下获得更高的有效秩和更优的微调性能。
链接: https://arxiv.org/abs/2602.05709
作者: Yihao Ouyang,Shiwei Li,Haozhao Wang,Xiandi Luo,Zhuoqi Hu,Yuetong Song,Qiyu Qin,Yichen Li,Ruixuan Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Low-rank adaptation (LoRA) approximates the update of a pretrained weight matrix using the product of two low-rank matrices. However, standard LoRA follows an explicit-rank paradigm, where increasing model capacity requires adding more rows or columns (i.e., basis vectors) to the low-rank matrices, leading to substantial parameter growth. In this paper, we find that these basis vectors exhibit significant parameter redundancy and can be compactly represented by lightweight nonlinear functions. Therefore, we propose Generative Low-Rank Adapter (GenLoRA), which replaces explicit basis vector storage with nonlinear basis vector generation. Specifically, GenLoRA maintains a latent vector for each low-rank matrix and employs a set of lightweight radial basis functions (RBFs) to synthesize the basis vectors. Each RBF requires far fewer parameters than an explicit basis vector, enabling higher parameter efficiency in GenLoRA. Extensive experiments across multiple datasets and architectures show that GenLoRA attains higher effective LoRA ranks under smaller parameter budgets, resulting in superior fine-tuning performance. The code is available at this https URL.
zh
[AI-38] Determining Energy Efficiency Sweet Spots in Production LLM Inference
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)推理过程中能量消耗估算不准确的问题,特别是现有方法依赖于输入和输出序列长度的线性函数,无法反映实际的非线性能效特性。其解决方案的关键在于提出一个基于Transformer架构计算复杂度与内存访问模式的解析模型,能够精确刻画能效随输入和输出长度变化的效率曲线,并通过在NVIDIA H100 GPU上使用TensorRT-LLM对多个主流LLM进行实测验证,实现了平均绝对百分比误差(MAPE)仅为1.79%的高精度预测。该模型揭示了“能效甜点”(Sweet Spots)——即短至中等长度输入搭配中等长度输出时能效最高,从而为生产环境中序列截断、摘要生成和自适应生成策略提供了理论依据和优化方向。
链接: https://arxiv.org/abs/2602.05695
作者: Hiari Pizzini Cavagna,Andrea Proia,Giacomo Madella,Giovanni B. Esposito,Francesco Antici,Daniele Cesarini,Zeynep Kiziltan,Andrea Bartolini
机构: 未知
类目: Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: To appear at ICPE 2026 (International Conference on Performance Engineering)
Abstract:Large Language Models (LLMs) inference is central in modern AI applications, making it critical to understand their energy footprint. Existing approaches typically estimate energy consumption through simple linear functions of input and output sequence lengths, yet our observations reveal clear Energy Efficiency regimes: peak efficiency occurs with short-to-moderate inputs and medium-length outputs, while efficiency drops sharply for long inputs or very short outputs, indicating a non-linear dependency. In this work, we propose an analytical model derived from the computational and memory-access complexity of the Transformer architecture, capable of accurately characterizing the efficiency curve as a function of input and output lengths. To assess its accuracy, we evaluate energy consumption using TensorRT-LLM on NVIDIA H100 GPUs across a diverse set of LLMs ranging from 1B to 9B parameters, including OPT, LLaMA, Gemma, Falcon, Qwen2, and Granite, tested over input and output lengths from 64 to 4096 tokens, achieving a mean MAPE of 1.79%. Our results show that aligning sequence lengths with these efficiency “Sweet Spots” can substantially reduce energy usage, supporting informed truncation, summarization, and adaptive generation strategies in production systems.
zh
[AI-39] Mining Generalizable Activation Functions
【速读】:该论文旨在解决激活函数设计中如何通过自动化方法发现更优且具有特定归纳偏置(inductive bias)的新型激活函数的问题。传统方法依赖于人工设计的搜索空间,限制了探索范围并难以兼顾性能与结构可控性;而本文的关键解决方案是利用基于前沿大语言模型(LLM)的进化搜索框架(如AlphaEvolve),其通过将LLM作为变异算子(mutator operator),可在FLOP预算约束下自动探索所有可能的Python函数构成的广阔搜索空间,从而无需手动构建搜索空间,并自然偏向于生成具有实际意义的激活函数。此外,该框架还可通过在分布外数据上的性能作为适应度函数,引导搜索过程聚焦于编码特定归纳偏置的激活函数,实现对模型行为的可控优化。
链接: https://arxiv.org/abs/2602.05688
作者: Alex Vitvitskyi,Michael Boratko,Matej Grcic,Razvan Pascanu,Deep Shah,Petar Veličković
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The choice of activation function is an active area of research, with different proposals aimed at improving optimization, while maintaining expressivity. Additionally, the activation function can significantly alter the implicit inductive bias of the architecture, controlling its non-linear behavior. In this paper, in line with previous work, we argue that evolutionary search provides a useful framework for finding new activation functions, while we also make two novel observations. The first is that modern pipelines, such as AlphaEvolve, which relies on frontier LLMs as a mutator operator, allows for a much wider and flexible search space; e.g., over all possible python functions within a certain FLOP budget, eliminating the need for manually constructed search spaces. In addition, these pipelines will be biased towards meaningful activation functions, given their ability to represent common knowledge, leading to a potentially more efficient search of the space. The second observation is that, through this framework, one can target not only performance improvements but also activation functions that encode particular inductive biases. This can be done by using performance on out-of-distribution data as a fitness function, reflecting the degree to which the architecture respects the inherent structure in the data in a manner independent of distribution shifts. We carry an empirical exploration of this proposal and show that relatively small scale synthetic datasets can be sufficient for AlphaEvolve to discover meaningful activations.
zh
[AI-40] Exploring AI-Augmented Sensemaking of Patient-Generated Health Data: A Mixed-Method Study with Healthcare Professionals in Cardiac Risk Reduction
【速读】:该论文旨在解决患者生成健康数据(Patient-Generated Health Data, PGHD)在临床实践中难以整合的问题,其核心挑战在于数据规模庞大、类型异构以及医护人员(Healthcare Professionals, HCPs)面临的时间压力与数据素养不足。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)实现PGHD的自动化摘要生成与自然语言交互式探索,通过集成常见图表、LLM生成的结构化摘要和对话界面,在心血管疾病风险降低这一应用场景中支持HCPs对多模态PGHD进行高效且灵活的感知理解(sensemaking)。实证研究表明,AI摘要提供了快速概览以锚定分析起点,而对话式交互则增强了分析灵活性并弥合了数据素养差距,但同时也揭示了对透明度、隐私保护及过度依赖的关切。
链接: https://arxiv.org/abs/2602.05687
作者: Pavithren V S Pakianathan,Rania Islambouli,Diogo Branco,Albrecht Schmidt,Tiago Guerreiro,Jan David Smeddinck
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Individuals are increasingly generating substantial personal health and lifestyle data, e.g. through wearables and smartphones. While such data could transform preventative care, its integration into clinical practice is hindered by its scale, heterogeneity and the time pressure and data literacy of healthcare professionals (HCPs). We explore how large language models (LLMs) can support sensemaking of patient-generated health data (PGHD) with automated summaries and natural language data exploration. Using cardiovascular disease (CVD) risk reduction as a use case, 16 HCPs reviewed multimodal PGHD in a mixed-methods study with a prototype that integrated common charts, LLM-generated summaries, and a conversational interface. Findings show that AI summaries provided quick overviews that anchored exploration, while conversational interaction supported flexible analysis and bridged data-literacy gaps. However, HCPs raised concerns about transparency, privacy, and overreliance. We contribute empirical insights and sociotechnical design implications for integrating AI-driven summarization and conversation into clinical workflows to support PGHD sensemaking.
zh
[AI-41] HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection
【速读】:该论文旨在解决音频深度伪造(audio deepfake)检测中因忽视高阶交互(High-order Interactions, HOIs)而导致的性能瓶颈问题。现有方法多依赖局部时频特征或成对关系,难以捕捉多个特征组件协同作用所形成的判别性模式。解决方案的关键在于提出HyperPotter框架,通过聚类生成的超边(hyperedges)显式建模这些高阶交互,并引入类别感知原型初始化策略,从而增强模型对多样化攻击和说话人场景的泛化能力。
链接: https://arxiv.org/abs/2602.05670
作者: Qing Wen,Haohao Li,Zhongjie Ba,Peng Cheng,Miao He,Li Lu,Kui Ren
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 20 pages, 8 figures
Abstract:Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high-order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework that explicitly models these synergistic HOIs through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments demonstrate that HyperPotter surpasses its baseline by an average relative gain of 22.15% across 11 datasets and outperforms state-of-the-art methods by 13.96% on 4 challenging cross-domain datasets, demonstrating superior generalization to diverse attacks and speakers.
zh
[AI-42] Stable but Wrong: When More Data Degrades Scientific Conclusions
【速读】:该论文试图解决的问题是:在数据驱动的科学推理中,尽管观测数据量不断增加且标准推断方法表现出稳定收敛、良好校准及通过常规诊断检验等理想特征,但这些指标并不能保证结论的正确性,尤其是在观测可靠性因不可观测机制退化时,增加数据反而会放大错误。解决方案的关键在于认识到,仅依赖数据规模和传统统计验证不足以确保认知有效性(epistemic validity),必须引入对观测过程完整性的显式约束,从而将推断过程从单纯的数据依赖转向对观测机制可信度的主动控制。
链接: https://arxiv.org/abs/2602.05668
作者: Zhipeng Zhang,Kai Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern science increasingly relies on ever-growing observational datasets and automated inference pipelines, under the implicit belief that accumulating more data makes scientific conclusions more reliable. Here we show that this belief can fail in a fundamental and irreversible way. We identify a structural regime in which standard inference procedures converge smoothly, remain well calibrated, and pass conventional diagnostic checks, yet systematically converge to incorrect conclusions. This failure arises when the reliability of observations degrades in a manner that is intrinsically unobservable to the inference process itself. Using minimal synthetic experiments, we demonstrate that in this regime additional data do not correct error but instead amplify it, while residual-based and goodness-of-fit diagnostics remain misleadingly normal. These results reveal an intrinsic limit of data-driven science: stability, convergence, and confidence are not sufficient indicators of epistemic validity. We argue that inference cannot be treated as an unconditional consequence of data availability, but must instead be governed by explicit constraints on the integrity of the observational process.
zh
[AI-43] Graph-based Agent Memory: Taxonomy Techniques and Applications
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的智能体在执行长周期复杂任务时,如何有效构建和管理记忆系统以支持知识积累、迭代推理与自我演化的关键问题。其解决方案的核心在于引入图结构(graph-based structure)作为代理记忆的组织形式,利用图结构对关系依赖建模、层级信息组织和高效检索的内在能力,系统性地构建了从记忆提取、存储、检索到演化的完整生命周期技术框架,并通过开源工具与基准测试推动自演化记忆系统的开发与评估。
链接: https://arxiv.org/abs/2602.05665
作者: Chang Yang,Chuang Zhou,Yilin Xiao,Su Dong,Luyao Zhuang,Yujing Zhang,Zhu Wang,Zijin Hong,Zheng Yuan,Zhishang Xiang,Shengyuan Chen,Huachi Zhou,Qinggang Zhang,Ninghao Liu,Jinsong Su,Xinrun Wang,Yi Chang,Xiao Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Memory emerges as the core module in the Large Language Model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self-evolution. Among diverse paradigms, graph stands out as a powerful structure for agent memory due to the intrinsic capabilities to model relational dependencies, organize hierarchical information, and support efficient retrieval. This survey presents a comprehensive review of agent memory from the graph-based perspective. First, we introduce a taxonomy of agent memory, including short-term vs. long-term memory, knowledge vs. experience memory, non-structural vs. structural memory, with an implementation view of graph-based memory. Second, according to the life cycle of agent memory, we systematically analyze the key techniques in graph-based agent memory, covering memory extraction for transforming the data into the contents, storage for organizing the data efficiently, retrieval for retrieving the relevant contents from memory to support reasoning, and evolution for updating the contents in the memory. Third, we summarize the open-sourced libraries and benchmarks that support the development and evaluation of self-evolving agent memory. We also explore diverse application scenarios. Finally, we identify critical challenges and future research directions. This survey aims to offer actionable insights to advance the development of more efficient and reliable graph-based agent memory systems. All the related resources, including research papers, open-source data, and projects, are collected for the community in this https URL.
zh
[AI-44] Probabilistic Multi-Regional Solar Power Forecasting with Any-Quantile Recurrent Neural Networks
【速读】:该论文旨在解决光伏(Photovoltaic, PV)发电大规模接入电力系统所带来的不确定性问题,传统确定性预测方法难以满足现代可再生能源主导型电网对概率性、鲁棒性预测的需求。其解决方案的核心在于提出一种基于任意分位数递归神经网络(Any-Quantile Recurrent Neural Network, AQ-RNN)的多区域光伏功率概率预测框架,通过融合任意分位数预测范式与双轨递归架构,同步建模区域特异性与跨区域上下文信息,并结合膨胀递归单元、基于块的时序建模及动态集成机制,实现单个模型中任意概率水平下校准的条件分位数估计,从而有效利用空间依赖关系提升系统级预测鲁棒性。
链接: https://arxiv.org/abs/2602.05660
作者: Slawek Smyl,Paweł Pełka,Grzegorz Dudek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing penetration of photovoltaic (PV) generation introduces significant uncertainty into power system operation, necessitating forecasting approaches that extend beyond deterministic point predictions. This paper proposes an any-quantile probabilistic forecasting framework for multi-regional PV power generation based on the Any-Quantile Recurrent Neural Network (AQ-RNN). The model integrates an any-quantile forecasting paradigm with a dual-track recurrent architecture that jointly processes series-specific and cross-regional contextual information, supported by dilated recurrent cells, patch-based temporal modeling, and a dynamic ensemble mechanism. The proposed framework enables the estimation of calibrated conditional quantiles at arbitrary probability levels within a single trained model and effectively exploits spatial dependencies to enhance robustness at the system level. The approach is evaluated using 30 years of hourly PV generation data from 259 European regions and compared against established statistical and neural probabilistic baselines. The results demonstrate consistent improvements in forecast accuracy, calibration, and prediction interval quality, underscoring the suitability of the proposed method for uncertainty-aware energy management and operational decision-making in renewable-dominated power systems. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.05660 [cs.LG] (or arXiv:2602.05660v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.05660 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-45] Alignment Verifiability in Large Language Models : Normative Indistinguishability under Behavioral Evaluation
【速读】:该论文试图解决的问题是:在大型语言模型(Large Language Models, LLMs)的对齐评估中,如何从有限的行为观测(如基准测试、红队测试或自动化流水线)中合理推断出模型的潜在对齐属性(latent alignment properties)。传统做法常将行为合规性直接视为对齐的证据,但这一推理过程通常是隐式的,缺乏理论严谨性。论文的核心贡献在于形式化地将对齐评估建模为一个在部分可观测条件下的可识别性(identifiability)问题,并引入“对齐可验证性问题”(Alignment Verifiability Problem)和“规范不可区分性”(Normative Indistinguishability)的概念,用于刻画不同潜在对齐假设是否会导致相同的可观测行为分布。其关键解决方案是一个负向但精确限定的可识别性定理:在有限行为评估和评估感知代理(evaluation-aware agents)的设定下,观察到的行为合规性无法唯一确定潜在对齐状态——即理想化的行为评估也无法一般性地验证对齐作为潜变量的存在。因此,行为对齐测试应被理解为估计不可区分性类别的估计器,而非对齐的验证工具;通过更严格的测试虽可缩小兼容假设空间,但在给定条件下无法收敛至单点解。这重新定义了对齐基准的意义:它们提供的是特定评估范式内可观测合规性的上界,而非对底层对齐属性的确凿保证。
链接: https://arxiv.org/abs/2602.05656
作者: Igor Santos-Grueiro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages. Theoretical analysis of behavioral alignment evaluation
Abstract:Behavioral evaluation is the dominant paradigm for assessing alignment in large language models (LLMs). In practice, alignment is inferred from performance under finite evaluation protocols - benchmarks, red-teaming suites, or automated pipelines - and observed compliance is often treated as evidence of underlying alignment. This inference step, from behavioral evidence to claims about latent alignment properties, is typically implicit and rarely analyzed as an inference problem in its own right. We study this problem formally. We frame alignment evaluation as an identifiability question under partial observability and allow agent behavior to depend on information correlated with the evaluation regime. Within this setting, we introduce the Alignment Verifiability Problem and the notion of Normative Indistinguishability, capturing when distinct latent alignment hypotheses induce identical distributions over all evaluator-accessible signals. Our main result is a negative but sharply delimited identifiability theorem. Under finite behavioral evaluation and evaluation-aware agents, observed behavioral compliance does not uniquely identify latent alignment. That is, even idealized behavioral evaluation cannot, in general, certify alignment as a latent property. We further show that behavioral alignment tests should be interpreted as estimators of indistinguishability classes rather than verifiers of alignment. Passing increasingly stringent tests may reduce the space of compatible hypotheses, but cannot collapse it to a singleton under the stated conditions. This reframes alignment benchmarks as providing upper bounds on observable compliance within a regime, rather than guarantees of underlying alignment. Comments: 10 pages. Theoretical analysis of behavioral alignment evaluation Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.05656 [cs.LG] (or arXiv:2602.05656v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.05656 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Igor Santos-Grueiro [view email] [v1] Thu, 5 Feb 2026 13:40:56 UTC (19 KB)
zh
[AI-46] Reactive Knowledge Representation and Asynchronous Reason ing
【速读】:该论文旨在解决复杂概率模型中精确推理计算成本高昂的问题,尤其是在动态环境中需要频繁进行实时信念更新的自主代理场景下。现有方法在持续推理时效率低下,因其在每次信息变更时均重新评估整个模型,未能利用现实世界信息流具有异构更新速率的特点。解决方案的关键在于提出一种基于反应式、异步概率推理的新范式:首先设计了Resin(Reactive Signal Inference)——一种融合概率逻辑与反应式编程的概率编程语言;进而提出了反应电路(Reactive Circuits, RCs),作为RCs的精确语义形式化,其本质是基于代数电路(Algebraic Circuits)和异步数据流构建的时间动态有向无环图,能根据输入信号的波动性自主调整结构。通过按估计的“变化频率”对计算任务进行分区并独立缓存子问题,RCs仅重新评估受新信息影响的模型部分,从而显著减少冗余计算,在高保真无人机群仿真中实现数个数量级的速度提升,有效支持低延迟的实时反应式推理。
链接: https://arxiv.org/abs/2602.05625
作者: Simon Kohaut,Benedict Flade,Julian Eggert,Kristian Kersting,Devendra Singh Dhami
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Exact inference in complex probabilistic models often incurs prohibitive computational costs. This challenge is particularly acute for autonomous agents in dynamic environments that require frequent, real-time belief updates. Existing methods are often inefficient for ongoing reasoning, as they re-evaluate the entire model upon any change, failing to exploit that real-world information streams have heterogeneous update rates. To address this, we approach the problem from a reactive, asynchronous, probabilistic reasoning perspective. We first introduce Resin (Reactive Signal Inference), a probabilistic programming language that merges probabilistic logic with reactive programming. Furthermore, to provide efficient and exact semantics for Resin, we propose Reactive Circuits (RCs). Formulated as a meta-structure over Algebraic Circuits and asynchronous data streams, RCs are time-dynamic Directed Acyclic Graphs that autonomously adapt themselves based on the volatility of input signals. In high-fidelity drone swarm simulations, our approach achieves several orders of magnitude of speedup over frequency-agnostic inference. We demonstrate that RCs’ structural adaptations successfully capture environmental dynamics, significantly reducing latency and facilitating reactive real-time reasoning. By partitioning computations based on the estimated Frequency of Change in the asynchronous inputs, large inference tasks can be decomposed into individually memoized sub-problems. This ensures that only the specific components of a model affected by new information are re-evaluated, drastically reducing redundant computation in streaming contexts.
zh
[AI-47] Mode-Dependent Rectification for Stable PPO Training
【速读】:该论文旨在解决在视觉强化学习中,由于模式依赖型结构(如批归一化(Batch Normalization)或Dropout)在训练和评估阶段行为不一致所引发的策略不匹配、分布漂移及奖励坍塌问题,这些问题会显著 destabilize(不稳定)基于策略梯度的on-policy优化算法(如近端策略优化PPO)。其解决方案的关键在于提出一种轻量级的双阶段训练机制——模式依赖校正(Mode-Dependent Rectification, MDR),该方法通过在训练过程中显式模拟评估阶段的行为来消除训练与评估间的差异,从而稳定PPO的学习过程,且无需修改网络架构即可适配多种模式依赖层。
链接: https://arxiv.org/abs/2602.05619
作者: Mohamad Mohamad,Francesco Ponzio,Xavier Descombes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mode-dependent architectural components (layers that behave differently during training and evaluation, such as Batch Normalization or dropout) are commonly used in visual reinforcement learning but can destabilize on-policy optimization. We show that in Proximal Policy Optimization (PPO), discrepancies between training and evaluation behavior induced by Batch Normalization lead to policy mismatch, distributional drift, and reward collapse. We propose Mode-Dependent Rectification (MDR), a lightweight dual-phase training procedure that stabilizes PPO under mode-dependent layers without architectural changes. Experiments across procedurally generated games and real-world patch-localization tasks demonstrate that MDR consistently improves stability and performance, and extends naturally to other mode-dependent layers.
zh
[AI-48] Path-Guided Flow Matching for Dataset Distillation
【速读】:该论文旨在解决生成式数据蒸馏(generative distillation)中基于扩散模型的方法存在的采样效率低、轨迹不稳定及下游泛化性能差的问题,尤其是在强控制条件或每类样本数(IPC, images per class)较低时表现不佳。其解决方案的关键在于提出首个基于流匹配(flow matching)的框架——路径引导流匹配(Path-Guided Flow Matching, PGFM),通过在冻结的变分自编码器(VAE)隐空间中进行类条件流匹配,学习从高斯噪声到数据分布的映射;并设计一种连续路径到原型的引导算法,实现ODE一致性的轨迹控制,确保合成样本稳定落在指定原型上,同时保持多样性与高效性。实验表明,PGFM在高分辨率基准测试中以更少采样步数达到或超越现有扩散方法,并显著提升效率(如比扩散基线快7.6倍且模式覆盖率达78%)。
链接: https://arxiv.org/abs/2602.05616
作者: Xuhui Li,Zhengquan Luo,Xiwei Liu,Yongqiang Yu,Zhiqiang Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Dataset distillation compresses large datasets into compact synthetic sets with comparable performance in training models. Despite recent progress on diffusion-based distillation, this type of method typically depends on heuristic guidance or prototype assignment, which comes with time-consuming sampling and trajectory instability and thus hurts downstream generalization especially under strong control or low IPC. We propose \emphPath-Guided Flow Matching (PGFM), the first flow matching-based framework for generative distillation, which enables fast deterministic synthesis by solving an ODE in a few steps. PGFM conducts flow matching in the latent space of a frozen VAE to learn class-conditional transport from Gaussian noise to data distribution. Particularly, we develop a continuous path-to-prototype guidance algorithm for ODE-consistent path control, which allows trajectories to reliably land on assigned prototypes while preserving diversity and efficiency. Extensive experiments across high-resolution benchmarks demonstrate that PGFM matches or surpasses prior diffusion-based distillation approaches with fewer steps of sampling while delivering competitive performance with remarkably improved efficiency, e.g., 7.6 \times more efficient than the diffusion-based counterparts with 78% mode coverage.
zh
[AI-49] angramSR: Can Vision-Language Models Reason in Continuous Geometric Space?
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在连续几何推理任务中表现不佳的问题,特别是其在类似 Tangram 拼图这类需要空间推理的任务上存在系统性失败,例如单块任务平均交并比(IoU)仅为 0.41,双块组合时进一步降至 0.23,远低于人类儿童水平。解决方案的关键在于提出一种无需训练的测试时自精炼框架(test-time self-refinement framework),该框架结合了上下文学习(in-context learning, ICL)与奖励引导的反馈循环机制,模拟人类通过试错、观察和修正的认知过程,构建了一个验证-精炼代理(verifier-refiner agent),通过递归精炼循环基于几何一致性反馈迭代优化预测结果,在不更新模型参数的前提下显著提升性能(如中等三角形案例 IoU 从 0.63 提升至 0.932),从而将自改进 AI 从理论走向实践,尤其适用于连续空间推理场景。
链接: https://arxiv.org/abs/2602.05570
作者: Yikun Zong,Cheston Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures
Abstract:Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback. Inspired by how humans solve Tangram puzzles through trial-and-error, observation, and correction, we design a framework that models these human cognitive mechanisms. However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning: average IoU of only 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition, far below human performance where children can complete Tangram tasks successfully. This paper addresses a fundamental challenge in self-improving AI: can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context learning (ICL) with reward-guided feedback loops, inspired by human cognitive processes. Our training-free verifier-refiner agent applies recursive refinement loops that iteratively self-refine predictions based on geometric consistency feedback, achieving IoU improvements from 0.63 to 0.932 on medium-triangle cases without any model retraining. This demonstrates that incorporating human-inspired iterative refinement mechanisms through ICL and reward loops can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains. Our work is available at this anonymous link this https URL.
zh
[AI-50] Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation
【速读】:该论文旨在解决基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法在探索效率和难度适应性方面的瓶颈问题,尤其是Group Relative Advantage Estimation (GRAE) 中隐含的优势对称性所导致的局限。关键解决方案是提出Asymmetric GRAE (A-GRAE),其通过动态调节探索激励机制与样本难度关注策略:一方面,不对称地抑制正确轨迹的优势以促进对新颖正确解的探索;另一方面,采用类课程学习的过渡机制,优先处理简单样本并逐步转向复杂样本,从而提升整体训练效率。实验表明,A-GRAE 在七个基准测试中均显著优于GRPO及其变体,在大语言模型(LLM)和多模态大语言模型(MLLM)上均具一致性优势。
链接: https://arxiv.org/abs/2602.05548
作者: Zhiqi Yu,Zhangquan Chen,Mengting Liu,Heye Zhang,Liangqiong Qu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR), particularly GRPO, has become the standard for eliciting LLM reasoning. However, its efficiency in exploration and difficulty adaptation remains an open challenge. In this work, we argue that these bottlenecks stem from an implicit advantage symmetry inherent in Group Relative Advantage Estimation (GRAE). This symmetry induces two critical limitations: (i) at the group level, strict symmetry in weights between correct and incorrect trajectories leaves unsampled action logits unchanged, thereby hindering exploration of novel correct solution. (ii) at the sample level, the algorithm implicitly prioritizes medium-difficulty samples, remaining agnostic to the non-stationary demands of difficulty focus. Through controlled experiments, we reveal that this symmetric property is sub-optimal, yielding two pivotal insights: (i) asymmetrically suppressing the advantages of correct trajectories encourages essential exploration. (ii) learning efficiency is maximized by a curriculum-like transition-prioritizing simpler samples initially before gradually shifting to complex ones. Motivated by these findings, we propose Asymmetric GRAE (A-GRAE), which dynamically modulates exploration incentives and sample-difficulty focus. Experiments across seven benchmarks demonstrate that A-GRAE consistently improves GRPO and its variants across both LLMs and MLLMs.
zh
[AI-51] Reason ing-guided Collaborative Filtering with Language Models for Explainable Recommendation
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的可解释推荐系统中忽视协同过滤(Collaborative Filtering, CF)信号,且将推荐与解释任务分离导致内存开销过大的问题。其解决方案的核心在于提出一种混合框架RGCF-XRec,通过引入推理引导的协同过滤知识增强语言模型,实现单步生成可解释的序列推荐;关键创新包括:(1) 利用上下文提示引导CF知识推理以挖掘潜在偏好和可解释的推理路径;(2) 设计基于一致性、完整性、相关性和连贯性四个维度的高效评分机制,过滤噪声CF推理痕迹并保留高质量解释;(3) 构建统一表示学习网络融合协同信号与语义信息,形成结构化提示条件驱动LLM完成可解释序列推荐。
链接: https://arxiv.org/abs/2602.05544
作者: Fahad Anwaar,Adil Mehmood Khan,Muhammad Khalid,Usman Zia,Kezhi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) exhibit potential for explainable recommendation systems but overlook collaborative signals, while prevailing methods treat recommendation and explanation as separate tasks, resulting in a memory footprint. We present RGCF-XRec, a hybrid framework that introduces reasoning-guided collaborative filtering (CF) knowledge into a language model to deliver explainable sequential recommendations in a single step. Theoretical grounding and empirical findings reveal that RGCF-XRec offers three key merits over leading CF-aware LLM-based methods: (1) reasoning-guided augmentation of CF knowledge through contextual prompting to discover latent preferences and interpretable reasoning paths; (2) an efficient scoring mechanism based on four dimensions: coherence, completeness, relevance, and consistency to mitigate noisy CF reasoning traces and retain high-quality explanations; (3) a unified representation learning network that encodes collaborative and semantic signals, enabling a structured prompt to condition the LLM for explainable sequential recommendation. RGCF-XRec demonstrates consistent improvements across Amazon datasets, Sports, Toys, and Beauty, comprising 642,503 user-item interactions. It improves HR@10 by 7.38% in Sports and 4.59% in Toys, along with ROUGE-L by 8.02% and 3.49%, respectively. It reduces the cold warm performance gap, achieving overall gains of 14.5% in cold-start and 11.9% in warm start scenarios, and enhances zero-shot HR@5 by 18.54% in Beauty and 23.16% in Toys, highlighting effective generalization and robustness. Moreover, RGCF-XRec achieves training efficiency with a lightweight LLaMA 3.2-3B backbone, ensuring scalability for real-world applications.
zh
[AI-52] Conditional Diffusion Guidance under Hard Constraint: A Stochastic Analysis Approach
【速读】:该论文旨在解决扩散模型在硬约束条件下的生成问题,即生成样本必须以概率1满足预设事件,这在安全关键应用和稀有事件模拟中尤为重要。传统基于软约束或奖励的引导方法无法保证此类硬性约束的满足。解决方案的关键在于构建一个基于Doob h-变换、鞅表示定理及二次变分过程的原理性条件扩散引导框架,其核心是通过显式漂移修正项(涉及条件函数对数梯度)来调整预训练扩散模型的动力学,而无需修改原始得分网络。同时,利用鞅与二次变分恒等式,提出了两种新颖的离策略学习算法——基于鞅损失和鞅协变损失,仅需使用预训练模型轨迹即可估计h函数及其梯度,从而实现无偏的条件采样,并提供总变差与Wasserstein距离上的非渐近误差界,明确刻画了得分近似误差和引导估计误差的影响。
链接: https://arxiv.org/abs/2602.05533
作者: Zhengyi Guo,Wenpin Tang,Renyuan Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We study conditional generation in diffusion models under hard constraints, where generated samples must satisfy prescribed events with probability one. Such constraints arise naturally in safety-critical applications and in rare-event simulation, where soft or reward-based guidance methods offer no guarantee of constraint satisfaction. Building on a probabilistic interpretation of diffusion models, we develop a principled conditional diffusion guidance framework based on Doob’s h-transform, martingale representation and quadratic variation process. Specifically, the resulting guided dynamics augment a pretrained diffusion with an explicit drift correction involving the logarithmic gradient of a conditioning function, without modifying the pretrained score network. Leveraging martingale and quadratic-variation identities, we propose two novel off-policy learning algorithms based on a martingale loss and a martingale-covariation loss to estimate h and its gradient using only trajectories from the pretrained model. We provide non-asymptotic guarantees for the resulting conditional sampler in both total variation and Wasserstein distances, explicitly characterizing the impact of score approximation and guidance estimation errors. Numerical experiments demonstrate the effectiveness of the proposed methods in enforcing hard constraints and generating rare-event samples.
zh
[AI-53] Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中可能隐藏不良行为(misalignment)的问题,传统审计方法难以有效识别此类隐蔽行为。其核心挑战在于:黑盒检测手段无法区分有害输出与正常输出,而机制可解释性方法又难以随模型能力扩展。解决方案的关键在于提出Split Personality Training(SPT),通过微调一个“诚实人格”(honest persona)作为LoRA适配器参数,在常规运行时保持非激活状态;当主模型生成响应后,通过插入触发字符串激活该LoRA适配器,使诚实人格能够基于主模型的潜在状态进行审查,从而揭示外部观测者无法获取的隐含知识(如虚构偏见)。实验表明,SPT在Anthropic Auditing Game基准上实现了96%的准确率,显著优于现有方法。
链接: https://arxiv.org/abs/2602.05532
作者: Florian Dietz,William Wale,Oscar Gilg,Robert McCarthy,Felix Michalak,Gustavo Ewbank Rodrigues Danon,Miguelito de Guzman,Dietrich Klakow
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Detecting misalignment in large language models is challenging because models may learn to conceal misbehavior during training. Standard auditing techniques fall short: black-box methods often cannot distinguish misaligned outputs from benign ones, and mechanistic interpretability does not scale with model capabilities. We introduce Split Personality Training (SPT), which fine-tunes a second ``honest persona’’ into LoRA parameters that remain inactive during normal operation. After the main model responds, we activate the LoRA adapter and insert a trigger string, enabling the honest persona to review the response while accessing the main model’s latent states. We test our method on the Anthropic Auditing Game Model Organism, a benchmark where Llama-3.3-70B is trained to exploit reward hacks while concealing this behavior. SPT achieves 96% overall accuracy, whereas Anthropic reports near 0% accuracy. The honest persona reveals latent knowledge inaccessible to external observers, such as the fictional biases the compromised model was trained on.
zh
[AI-54] Capture the Flags: Family-Based Evaluation of Agent ic LLM s via Semantics-Preserving Transformations
【速读】:该论文旨在解决当前用于评估生成式 AI(Generative AI)在网络安全任务中表现的捕获旗帜(Capture-the-Flag, CTF)基准测试存在的局限性,即这些基准多为单点测试,难以衡量智能体在面对源代码语义不变但形式变化的挑战时的鲁棒性和泛化能力。解决方案的关键在于提出“CTF挑战家族”(CTF challenge families)的概念,通过语义保持的程序变换从一个原始CTF挑战生成一组语义等价的变体,从而在固定漏洞利用策略的前提下,可控地评估智能体对源代码结构变化的适应能力。研究进一步开发了工具Evolve-CTF,基于Python挑战实现多种变换并构建挑战家族,实验表明模型对重命名和插入类变换具有强鲁棒性,但复杂组合变换和深度混淆会显著影响性能,凸显出对高级工具使用能力的需求。
链接: https://arxiv.org/abs/2602.05523
作者: Shahin Honarvar,Amber Gorzynski,James Lee-Jones,Harry Coppock,Marek Rei,Joseph Ryan,Alastair F. Donaldson
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks. However, existing pointwise benchmarks have limited ability to shed light on the robustness and generalisation abilities of agents across alternative versions of the source code. We introduce CTF challenge families, whereby a single CTF is used as the basis for generating a family of semantically-equivalent challenges via semantics-preserving program transformations. This enables controlled evaluation of agent robustness to source code transformations while keeping the underlying exploit strategy fixed. We introduce a new tool, Evolve-CTF, that generates CTF families from Python challenges using a range of transformations. Using Evolve-CTF to derive families from Cybench and Intercode challenges, we evaluate 13 agentic LLM configurations with tool access. We find that models are remarkably robust to intrusive renaming and code insertion-based transformations, but that composed transformations and deeper obfuscation affect performance by requiring more sophisticated use of tools. We also find that enabling explicit reasoning has little effect on solution success rates across challenge families. Our work contributes a valuable technique and tool for future LLM evaluations, and a large dataset characterising the capabilities of current state-of-the-art models in this domain.
zh
[AI-55] DECO: Decoupled Multimodal Diffusion Transformer for Bimanual Dexterous Manipulation with a Plugin Tactile Adapter
【速读】:该论文旨在解决多模态感知与动作决策之间耦合导致的策略泛化能力受限问题,尤其在具身智能体(embodied agent)执行复杂操作任务时,如何高效融合视觉、本体感觉(proprioception)、触觉等异构模态信息以提升策略鲁棒性。解决方案的关键在于提出DECO框架,其核心创新是通过解耦(decouple)多模态条件输入:图像和动作令牌通过联合自注意力机制交互,本体感觉状态及可选条件通过自适应层归一化(adaptive layer normalization)注入,触觉信号则通过交叉注意力(cross attention)融入,同时采用轻量级LoRA适配器对预训练策略进行高效微调,从而实现多模态信息的灵活整合与参数高效迁移。
链接: https://arxiv.org/abs/2602.05513
作者: Xukun Li,Yu Sun,Lei Zhang,Bosheng Huang,Yibo Peng,Yuan Meng,Haojun Jiang,Shaoxuan Xie,Guacai Yao,Alois Knoll,Zhenshan Bing,Xinlong Wang,Zhenguo Sun
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures
Abstract:Overview of the Proposed DECO Framework. DECO is a DiT-based policy that decouples multimodal conditioning. Image and action tokens interact via joint self attention, while proprioceptive states and optional conditions are injected through adaptive layer normalization. Tactile signals are injected via cross attention, while a lightweight LoRA-based adapter is used to efficiently fine-tune the pretrained policy. DECO is also accompanied by DECO-50, a bimanual dexterous manipulation dataset with tactile sensing, consisting of 4 scenarios and 28 sub-tasks, covering more than 50 hours of data, approximately 5 million frames, and 8,000 successful trajectories.
zh
[AI-56] SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在交互式多媒体应用中因自回归解码导致的高延迟问题。现有基于推测解码(Speculative Decoding)的方法通常依赖于一个轻量级草稿模型(draft model),但其部署受限于草稿模型获取、调优和维护的成本与复杂性,且多数方法需要额外训练或专用优化。本文提出SDFP框架,其关键创新在于无需训练即可构建草稿模型:通过Fisher信息迹(Fisher Information Trace, FIT)对目标LLM进行层剪枝,以层敏感性作为输出扰动的代理指标,移除低影响层以获得紧凑的草稿模型,同时保持与原模型的兼容性用于标准推测验证。该方案无需额外训练、超参数调优或独立维护草稿模型,实现了快速、可部署的草稿构造,在不改变目标模型输出分布的前提下实现1.32x–1.5x的解码加速,适用于低延迟多媒体应用场景。
链接: https://arxiv.org/abs/2602.05499
作者: Hanyu Wei,Zunhai Su,Peng Lu,Chao Li,Spandan Tiwari,Ashish Sirasao,Yuhan Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) underpin interactive multimedia applications such as captioning, retrieval, recommendation, and creative content generation, yet their autoregressive decoding incurs substantial latency. Speculative decoding reduces latency using a lightweight draft model, but deployment is often limited by the cost and complexity of acquiring, tuning, and maintaining an effective draft model. Recent approaches usually require auxiliary training or specialization, and even training-free methods incur costly search or optimization. We propose SDFP, a fully training-free and plug-and-play framework that builds the draft model via Fisher Information Trace (FIT)-based layer pruning of a given LLM. Using layer sensitivity as a proxy for output perturbation, SDFP removes low-impact layers to obtain a compact draft while preserving compatibility with the original model for standard speculative verification. SDFP needs no additional training, hyperparameter tuning, or separately maintained drafts, enabling rapid, deployment-friendly draft construction. Across benchmarks, SDFP delivers 1.32x-1.5x decoding speedup without altering the target model’s output distribution, supporting low-latency multimedia applications.
zh
[AI-57] A Unified Framework for Rethinking Policy Divergence Measures in GRPO
【速读】:该论文旨在解决当前基于验证奖励的强化学习(Reinforcement Learning with Verified Reward, RLVR)方法在训练稳定性和探索能力之间的权衡问题,尤其是在大型语言模型(Large Language Models, LLMs)推理能力提升中的应用瓶颈。现有方法如GRPO(Generalized Reward Policy Optimization)通过裁剪似然比来约束策略分歧(policy divergence),虽能保证更新稳定性,但可能限制了有效探索。论文提出一个统一的裁剪框架,以广义策略分歧概念(包括似然比与KL散度等)为基础,系统分析不同分歧度量对性能的影响;其关键创新在于识别出KL3估计器(KL3 estimator)——一种方差缩减的蒙特卡洛KL散度估计方法——作为核心策略分歧约束,并理论证明其等价于一种非对称比例裁剪机制,该机制将概率质量重新分配至高置信度动作,从而增强探索能力,同时保持GRPO类方法的简洁性。实验表明,引入KL3约束可显著提升数学推理任务上的训练稳定性和最终性能。
链接: https://arxiv.org/abs/2602.05494
作者: Qingyuan Wu,Yuhui Wang,Simon Sinong Zhan,Yanning Dai,Shilong Deng,Sarra Habchi,Qi Zhu,Matthias Gallé,Chao Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verified Reward (RLVR) has emerged as a critical paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). Most existing RLVR methods, such as GRPO and its variants, ensure stable updates by constraining policy divergence through clipping likelihood ratios. This paper introduces a unified clipping framework that characterizes existing methods via a general notion of policy divergence, encompassing both likelihood ratios and Kullback-Leibler (KL) divergences and extending to alternative measures. The framework provides a principled foundation for systematically analyzing how different policy divergence measures affect exploration and performance. We further identify the KL3 estimator, a variance-reduced Monte Carlo estimator of the KL divergence, as a key policy divergence constraint. We theoretically demonstrate that the KL3-based constraint is mathematically equivalent to an asymmetric ratio-based clipping that reallocates probability mass toward high-confidence actions, promoting stronger exploration while retaining the simplicity of GRPO-style methods. Empirical results on mathematical reasoning benchmarks demonstrate that incorporating the KL3 estimator into GRPO improves both training stability and final performance, highlighting the importance of principled policy divergence constraints in policy optimization.
zh
[AI-58] Sovereign-by-Design A Reference Architecture for AI and Blockchain Enabled Systems
【速读】:该论文旨在解决现代软件密集型系统中数字主权(Digital Sovereignty)难以在架构层面落地的问题,现有研究多聚焦于治理、合规与安全的孤立处理,缺乏将主权转化为可操作的架构属性的方法。其解决方案的关键在于提出一种主权参考架构(Sovereign Reference Architecture),将数字主权视为首等的架构质量属性(Architectural Quality Attribute),并集成自主权身份(Self-Sovereign Identity)、基于区块链的信任与可审计性、主权数据治理,以及受显式架构控制的生成式AI(Generative AI)。该架构明确捕捉了生成式AI作为治理风险源与合规、问责及持续保证赋能者的双重角色,从而在监管意图与系统设计之间建立桥梁,为构建可审计、可演化且具备管辖权感知能力的AI增强型系统提供原则性基础。
链接: https://arxiv.org/abs/2602.05486
作者: Matteo Esposito,Lodovica Marchesi,Roberto Tonelli,Valentina Lenarduzzi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Digital sovereignty has emerged as a central concern for modern software-intensive systems, driven by the dominance of non-sovereign cloud infrastructures, the rapid adoption of Generative AI, and increasingly stringent regulatory requirements. While existing initiatives address governance, compliance, and security in isolation, they provide limited guidance on how sovereignty can be operationalized at the architectural level. In this paper, we argue that sovereignty must be treated as a first-class architectural property rather than a purely regulatory objective. We introduce a Sovereign Reference Architecture that integrates self-sovereign identity, blockchain-based trust and auditability, sovereign data governance, and Generative AI deployed under explicit architectural control. The architecture explicitly captures the dual role of Generative AI as both a source of governance risk and an enabler of compliance, accountability, and continuous assurance when properly constrained. By framing sovereignty as an architectural quality attribute, our work bridges regulatory intent and concrete system design, offering a coherent foundation for building auditable, evolvable, and jurisdiction-aware AI-enabled systems. The proposed reference architecture provides a principled starting point for future research and practice at the intersection of software architecture, Generative AI, and digital sovereignty.
zh
[AI-59] Phi-Former: A Pairwise Hierarchical Approach for Compound-Protein Interactions Prediction
【速读】:该论文旨在解决当前基于深度学习的化合物-蛋白相互作用(Compound-Protein Interaction, CPI)预测模型在生物合理性上的不足问题,即现有方法虽提升了效率与精度,但未能充分考虑分子片段(motifs 或功能基团)作为生物识别和结合基本单元的核心作用。解决方案的关键在于提出 Phi-former,一种分层成对交互表示学习方法,通过原子-原子、基团-基团以及原子-基团三个层次的系统建模,显式引入生物体系中分子识别的层级机制,并设计了层内与层间协同学习管道,使不同交互层级之间相互促进,从而提升模型性能并提供可解释性结果,助力理性药物设计与精准医学应用。
链接: https://arxiv.org/abs/2602.05479
作者: Zhe Wang,Zijing Liu,Chencheng Xu,Yuan Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to BIBM 2025. 6 pages, 5 figures
Abstract:Drug discovery remains time-consuming, labor-intensive, and expensive, often requiring years and substantial investment per drug candidate. Predicting compound-protein interactions (CPIs) is a critical component in this process, enabling the identification of molecular interactions between drug candidates and target proteins. Recent deep learning methods have successfully modeled CPIs at the atomic level, achieving improved efficiency and accuracy over traditional energy-based approaches. However, these models do not always align with chemical realities, as molecular fragments (motifs or functional groups) typically serve as the primary units of biological recognition and binding. In this paper, we propose Phi-former, a pairwise hierarchical interaction representation learning method that addresses this gap by incorporating the biological role of motifs in CPIs. Phi-former represents compounds and proteins hierarchically and employs a pairwise pre-training framework to model interactions systematically across atom-atom, motif-motif, and atom-motif levels, reflecting how biological systems recognize molecular partners. We design intra-level and inter-level learning pipelines that make different interaction levels mutually beneficial. Experimental results demonstrate that Phi-former achieves superior performance on CPI-related tasks. A case study shows that our method accurately identifies specific atoms or motifs activated in CPIs, providing interpretable model explanations. These insights may guide rational drug design and support precision medicine applications.
zh
[AI-60] LMMRec: LLM -driven Motivation-aware Multimodal Recommendation
【速读】:该论文旨在解决现有基于动机的推荐系统在多模态信息融合中的两个关键问题:一是如何在噪声干扰下实现跨模态稳定对齐,二是如何识别不同模态中反映相同潜在动机的特征。其解决方案的核心在于提出一种模型无关的框架LMMRec(LLM-driven Motivation-aware Multimodal Recommendation),利用大语言模型(Large Language Model, LLM)提供深层语义先验和动机理解能力;通过思维链(chain-of-thought)提示从文本中提取细粒度用户与物品动机,采用双编码器架构建模文本与交互驱动的动机以实现跨模态对齐,并引入动机协调策略(Motivation Coordination Strategy)和交互-文本对应方法(Interaction-Text Correspondence Method),借助对比学习和动量更新机制缓解噪声与语义漂移问题,从而显著提升推荐性能。
链接: https://arxiv.org/abs/2602.05474
作者: Yicheng Di,Zhanjie Zhang,Yun Wangc,Jinren Liue,Jiaqi Yanf,Jiyu Wei,Xiangyu Chend,Yuan Liu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Motivation-based recommendation systems uncover user behavior drivers. Motivation modeling, crucial for decision-making and content preference, explains recommendation generation. Existing methods often treat motivation as latent variables from interaction data, neglecting heterogeneous information like review text. In multimodal motivation fusion, two challenges arise: 1) achieving stable cross-modal alignment amid noise, and 2) identifying features reflecting the same underlying motivation across modalities. To address these, we propose LLM-driven Motivation-aware Multimodal Recommendation (LMMRec), a model-agnostic framework leveraging large language models for deep semantic priors and motivation understanding. LMMRec uses chain-of-thought prompting to extract fine-grained user and item motivations from text. A dual-encoder architecture models textual and interaction-based motivations for cross-modal alignment, while Motivation Coordination Strategy and Interaction-Text Correspondence Method mitigate noise and semantic drift through contrastive learning and momentum updates. Experiments on three datasets show LMMRec achieves up to a 4.98% performance improvement.
zh
[AI-61] ALIVE: Awakening LLM Reason ing via Adversarial Learning and Instructive Verbal Evaluation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实现专家级推理能力时面临的“奖励瓶颈”问题,即传统强化学习(Reinforcement Learning, RL)依赖于标量奖励信号,这些信号存在成本高、跨领域脆弱且无法捕捉解题逻辑的本质缺陷,从而阻碍模型建立内在的推理机制。其解决方案的关键在于提出ALIVE(Adversarial Learning with Instructive Verbal Evaluation)框架,该框架基于认知协同(Cognitive Synergy)原则,将问题生成、求解与评判统一于单一策略模型中,通过对抗学习与指导性言语反馈相结合的方式,使模型能够从原始语料中直接内化评价标准,从而将外部批评转化为内生的推理能力,显著提升准确性、跨域泛化能力和自我纠错率。
链接: https://arxiv.org/abs/2602.05472
作者: Yiwen Duan,Jing Ye,Xinpei Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The quest for expert-level reasoning in Large Language Models (LLMs) has been hampered by a persistent \textitreward bottleneck: traditional reinforcement learning (RL) relies on scalar rewards that are \textbfcostly to scale, \textbfbrittle across domains, and \textbfblind to the underlying logic of a solution. This reliance on external, impoverished signals prevents models from developing a deep, self-contained understanding of reasoning principles. We introduce \textbfALIVE (\emphAdversarial Learning with Instructive Verbal Evaluation), a hands-free alignment framework that moves beyond scalar reward optimization toward intrinsic reasoning acquisition. Grounded in the principle of \emphCognitive Synergy, ALIVE unifies problem posing, solving, and judging within a single policy model to internalize the logic of correctness. By coupling adversarial learning with instructive verbal feedback, ALIVE enables models to internalize evaluative criteria directly from raw corpora, effectively transforming external critiques into an endogenous reasoning faculty. Empirical evaluations across mathematical reasoning, code generation, and general logical inference benchmarks demonstrate that ALIVE consistently mitigates reward signal limitations. With identical data and compute, it achieves accuracy gains, markedly improved cross-domain generalization, and higher self-correction rates. These results indicate that the reasoning trinity fosters a self-sustaining trajectory of capability growth, positioning ALIVE as a scalable foundation for general-purpose reasoning alignment without human-in-the-loop supervision.
zh
[AI-62] hermodynamic Limits of Physical Intelligence
【速读】:该论文旨在解决现代人工智能系统在实现强大智能能力的同时伴随高能耗的问题,提出了一种将智能与物理效率相连接的量化框架。其核心解决方案是引入两个互补的“比特/焦耳”度量:(1)热力学复杂度(Thermodynamic Epiplexity)每焦耳——衡量在指定边界内单位能量下,代理内部状态中新编码的关于理论环境实例变量的结构信息比特数;(2)每焦耳赋能(Empowerment)——固定时域内每单位预期能耗所对应的具身感知运动信道容量(控制信息)。这两个指标分别刻画了识别(建模)和行动影响两个维度的物理智能。关键在于通过随机热力学推导出Landauer尺度闭合循环基准作为epiplexity获取的推论,并明确指出仅在显式重置/复用和边界封闭假设下,信息增益与边界内耗散才紧密关联;此外,作者还提出一个统一的效率评估框架,包含边界/能量核算、粗粒化/噪声处理、时域/重置约定等最小检查清单,以减少歧义并支持一致的比特/焦耳比较,从而为能效驱动的AI设计提供可操作的度量基础。
链接: https://arxiv.org/abs/2602.05463
作者: Koichi Takahashi,Yusuke Hayashi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:Modern AI systems achieve remarkable capabilities at the cost of substantial energy consumption. To connect intelligence to physical efficiency, we propose two complementary bits-per-joule metrics under explicit accounting conventions: (1) Thermodynamic Epiplexity per Joule – bits of structural information about a theoretical environment-instance variable newly encoded in an agent’s internal state per unit measured energy within a stated boundary – and (2) Empowerment per Joule – the embodied sensorimotor channel capacity (control information) per expected energetic cost over a fixed horizon. These provide two axes of physical intelligence: recognition (model-building) this http URL (action influence). Drawing on stochastic thermodynamics, we show how a Landauer-scale closed-cycle benchmark for epiplexity acquisition follows as a corollary of a standard thermodynamic-learning inequality under explicit subsystem assumptions, and we clarify how Landauer-scaled costs act as closed-cycle benchmarks under explicit reset/reuse and boundary-closure assumptions; conversely, we give a simple decoupling construction showing that without such assumptions – and without charging for externally prepared low-entropy resources (this http URL memory) crossing the boundary – information gain and in-boundary dissipation need not be tightly linked. For empirical settings where the latent structure variable is unavailable, we align the operational notion of epiplexity with compute-bounded MDL epiplexity and recommend reporting MDL-epiplexity / compression-gain surrogates as companions. Finally, we propose a unified efficiency framework that reports both metrics together with a minimal checklist of boundary/energy accounting, coarse-graining/noise, horizon/reset, and cost conventions to reduce ambiguity and support consistent bits-per-joule comparisons, and we sketch connections to energy-adjusted scaling analyses.
zh
[AI-63] Ontology-Driven Robotic Specification Synthesis
【速读】:该论文旨在解决机器人系统工程中高阶目标与形式化可执行规范之间缺乏有效衔接的问题,尤其针对安全关键和任务关键型应用场景。解决方案的关键在于提出一种基于本体驱动的分层方法——机器人系统任务到模型转换方法(Robotic System Task to Model Transformation Methodology, RSTM2),该方法利用带资源的随机定时Petri网(stochastic timed Petri nets with resources)实现多层级蒙特卡洛仿真,从而支持在任务、系统及子系统层面进行架构权衡、资源分配与不确定性下的性能分析。此外,通过本体概念构建可解释人工智能(Explainable AI)辅助工具,进一步推动完全自主的规范合成,特别适用于NASA CADRE等复杂多机器人系统场景。
链接: https://arxiv.org/abs/2602.05456
作者: Maksym Figat,Ryan M. Mackey,Michel D. Ingham
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 8 pages, 9 figures, 3 tables, journal
Abstract:This paper addresses robotic system engineering for safety- and mission-critical applications by bridging the gap between high-level objectives and formal, executable specifications. The proposed method, Robotic System Task to Model Transformation Methodology (RSTM2) is an ontology-driven, hierarchical approach using stochastic timed Petri nets with resources, enabling Monte Carlo simulations at mission, system, and subsystem levels. A hypothetical case study demonstrates how the RSTM2 method supports architectural trades, resource allocation, and performance analysis under uncertainty. Ontological concepts further enable explainable AI-based assistants, facilitating fully autonomous specification synthesis. The methodology offers particular benefits to complex multi-robot systems, such as the NASA CADRE mission, representing decentralized, resource-aware, and adaptive autonomous systems of the future.
zh
[AI-64] Benchmarking Affordance Generalization with BusyBox
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在物理世界中对新物体的可操作性泛化能力(affordance generalization)不足的问题,即模型能否基于熟悉物体的物理特征(如形状、结构)来操控未见过的新物体。其核心挑战在于:即使VLA模型在视觉和语言空间中具备良好泛化能力,仍难以识别并利用对象的物理功能属性进行有效操作。解决方案的关键是提出一个名为BusyBox的物理基准测试平台,该平台通过模块化设计(含开关、滑块、按钮等6类组件)支持大量视觉外观变化但保持一致物理可操作性的场景构造,从而系统地评估VLA模型在不同变体间的泛化性能。研究还表明,即使是当前最先进的开放权重VLA模型(如π₀.₅和GR00T-N1.6)也难以在该任务上实现稳定表现,凸显了该问题的难度与研究价值。
链接: https://arxiv.org/abs/2602.05441
作者: Dean Fortier,Timothy Adamson,Tess Hellebrekers,Teresa LaScala,Kofi Ennin,Michael Murray,Andrey Kolobov,Galen Mullins
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language-Action (VLA) models have been attracting the attention of researchers and practitioners thanks to their promise of generalization. Although single-task policies still offer competitive performance, VLAs are increasingly able to handle commands and environments unseen in their training set. While generalization in vision and language space is undoubtedly important for robust versatile behaviors, a key meta-skill VLAs need to possess is affordance generalization – the ability to manipulate new objects with familiar physical features. In this work, we present BusyBox, a physical benchmark for systematic semi-automatic evaluation of VLAs’ affordance generalization. BusyBox consists of 6 modules with switches, sliders, wires, buttons, a display, and a dial. The modules can be swapped and rotated to create a multitude of BusyBox variations with different visual appearances but the same set of affordances. We empirically demonstrate that generalization across BusyBox variants is highly challenging even for strong open-weights VLAs such as \pi_0.5 and GR00T-N1.6. To encourage the research community to evaluate their own VLAs on BusyBox and to propose new affordance generalization experiments, we have designed BusyBox to be easy to build in most robotics labs. We release the full set of CAD files for 3D-printing its parts as well as a bill of materials for (optionally) assembling its electronics. We also publish a dataset of language-annotated demonstrations that we collected using the common bimanual Mobile Aloha robot on the canonical BusyBox configuration. All of the released materials are available at this https URL. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.05441 [cs.RO] (or arXiv:2602.05441v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2602.05441 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-65] Day-Ahead Electricity Price Forecasting for Volatile Markets Using Foundation Models with Regularization Strategy AAAI’26 AAAI2026
【速读】:该论文旨在解决电力价格预测(Electricity Price Forecasting, EPF)在高度波动市场中准确率不足的问题,尤其针对传统统计模型与深度学习(Deep Learning, DL)方法难以有效捕捉复杂时间依赖性和融合异构数据的局限性。其解决方案的关键在于引入时间序列基础模型(Time Series Foundation Models, TSFMs),并结合一种尖峰正则化策略(spike regularization strategy),通过在新加坡半小时间隔的批发市场价格数据上评估多种TSFMs(如Tiny Time Mixers、MOIRAI、MOMENT和TimesFM)与传统模型(如ARIMA、LSTM、CNN-LSTM)的性能对比,结果表明TSFMs在各类评估场景下均显著优于传统方法,最高可降低37.4%的平均绝对百分比误差(MAPE),从而为波动性电力市场中的高精度预测与决策支持提供了可行路径。
链接: https://arxiv.org/abs/2602.05430
作者: Kritchanat Ponyuenyong,Pengyu Tu,Jia Wei Tan,Wei Soon Cheong,Jamie Ng Suat Ling,Lianlian Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to AI4TS Workshop @ AAAI’26 (Oral and Poster), see this https URL
Abstract:Electricity price forecasting (EPF) is essential for energy markets stakeholders (e.g. grid operators, energy traders, policymakers) but remains challenging due to the inherent volatility and nonlinearity of price signals. Traditional statistical and deep learning (DL) models often struggle to capture complex temporal dependencies and integrate heterogeneous data effectively. While time series foundation models (TSFMs) have shown strong performance in general time series forecasting tasks, such as traffic forecasting and weather forecasting. However, their effectiveness in day-ahead EPF, particularly in volatile markets, remains underexplored. This paper presents a spike regularization strategy and evaluates a wide range of TSFMs, including Tiny Time Mixers (TTMs), MOIRAI, MOMENT, and TimesFM, against traditional statistical and DL models such as Autoregressive Integrated Moving Average (ARIMA), Long-short Term Memory (LSTM), and Convolutional Neural Network - LSTM (CNN-LSTM) using half-hourly wholesale market data with volatile trends in Singapore. Exogenous factors (e.g. weather and calendar variables) are also incorporated into models where applicable. Results demonstrate that TSFMs consistently outperform traditional approaches, achieving up to 37.4% improvement in MAPE across various evaluation settings. The findings offer practical guidance for improving forecast accuracy and decision-making in volatile electricity markets.
zh
[AI-66] HOR: Inductive Link Prediction over Hyper-Relational Knowledge Graphs
【速读】:该论文旨在解决超关系知识图谱(Hyper-relational Knowledge Graphs, HKGs)中链接预测任务的归纳泛化能力不足问题。现有方法多局限于归纳设置(transductive setting),仅能基于特定知识图谱的词汇进行预测,无法处理未见过的实体或关系,限制了其在实际应用中的扩展性。解决方案的关键在于提出THOR框架,通过构建关系和实体基础图(relation and entity foundation graphs)来建模HKG中跨事实的结构不变性,进而利用两个并行图编码器与Transformer解码器实现高效的掩码训练和完全归纳推理,从而有效提升模型在不同HKG数据集上的泛化性能。
链接: https://arxiv.org/abs/2602.05424
作者: Weijian Yu,Yuhuan Lu,Dingqi Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge graphs (KGs) have become a key ingredient supporting a variety of applications. Beyond the traditional triplet representation of facts where a relation connects two entities, modern KGs observe an increasing number of hyper-relational facts, where an arbitrary number of qualifiers associated with a triplet provide auxiliary information to further describe the rich semantics of the triplet, which can effectively boost the reasoning performance in link prediction tasks. However, existing link prediction techniques over such hyper-relational KGs (HKGs) mostly focus on a transductive setting, where KG embedding models are learned from the specific vocabulary of a given KG and subsequently can only make predictions within the same vocabulary, limiting their generalizability to previously unseen vocabularies. Against this background, we propose THOR, an inducTive link prediction technique for Hyper-relational knOwledge gRaphs. Specifically, we first introduce both relation and entity foundation graphs, modeling their fundamental inter- and intra-fact interactions in HKGs, which are agnostic to any specific relations and entities. Afterward, THOR is designed to learn from the two foundation graphs with two parallel graph encoders followed by a transformer decoder, which supports efficient masked training and fully-inductive inference. We conduct a thorough evaluation of THOR in hyper-relational link prediction tasks on 12 datasets with different settings. Results show that THOR outperforms a sizable collection of baselines, yielding 66.1%, 55.9%, and 20.4% improvement over the best-performing rule-based, semi-inductive, and fully-inductive techniques, respectively. A series of ablation studies also reveals our key design factors capturing the structural invariance transferable across HKGs for inductive tasks.
zh
[AI-67] Reduced-Order Surrogates for Forced Flexible Mesh Coastal-Ocean Models
【速读】:该论文旨在解决传统基于POD(Proper Orthogonal Decomposition,本征正交分解)的代理模型在真实海岸海洋建模中预测精度与稳定性不足的问题。其核心解决方案是提出一种灵活的Koopman自编码器(Koopman Autoencoder)架构,该架构能够融合气象强迫和边界条件,并在潜在空间中引入学习得到的线性时间演化算子,通过特征值正则化策略提升时间稳定性;同时结合时间展开(temporal unrolling)技术以实现长期预测的准确性。实验表明,该方法在三个不同动力学场景下均优于POD代理模型,相对均方根误差为0.01–0.13,R²值达0.65–0.996,且推理速度提升300–1400倍,满足实际应用需求。
链接: https://arxiv.org/abs/2602.05416
作者: Freja Høgholm Petersen,Jesper Sandvig Mariegaard,Rocco Palmitessa,Allan P. Engsig-Karup
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Fluid Dynamics (physics.flu-dyn)
备注: Submitted for peer-review in a journal
Abstract:While POD-based surrogates are widely explored for hydrodynamic applications, the use of Koopman Autoencoders for real-world coastal-ocean modelling remains relatively limited. This paper introduces a flexible Koopman autoencoder formulation that incorporates meteorological forcings and boundary conditions, and systematically compares its performance against POD-based surrogates. The Koopman autoencoder employs a learned linear temporal operator in latent space, enabling eigenvalue regularization to promote temporal stability. This strategy is evaluated alongside temporal unrolling techniques for achieving stable and accurate long-term predictions. The models are assessed on three test cases spanning distinct dynamical regimes, with prediction horizons up to one year at 30-minute temporal resolution. Across all cases, the Koopman autoencoder with temporal unrolling yields the best overall accuracy compared to the POD-based surrogates, achieving relative root-mean-squared-errors of 0.01-0.13 and R^2 -values of 0.65-0.996. Prediction errors are largest for current velocities, and smallest for water surface elevations. Comparing to in-situ observations, the surrogate yields -0.65% to 12% change in water surface elevation prediction error when compared to prediction errors of the physics-based model. These error levels, corresponding to a few centimeters, are acceptable for many practical applications, while inference speed-ups of 300-1400x enables workflows such as ensemble forecasting and long climate simulations for coastal-ocean modelling.
zh
[AI-68] Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language
【速读】:该论文旨在解决低资源语言(如阿坎语)中因缺乏受损语音数据而阻碍包容性语音技术发展的关键问题。解决方案的关键在于构建了一个经过精心整理的语音语料库,包含50.01小时来自母语阿坎语说话者且具有不同类别的言语障碍(包括口吃、脑瘫、唇腭裂和中风引起的言语障碍)的音频录音。所有录音均在受控监督环境中采集,参与者根据预选图像进行自由描述,同时配有转录文本和相关元数据(如说话者人口统计信息、障碍类别、录音环境与设备),该语料库可有效支持低资源环境下自动异常言语识别系统及辅助语音技术的研究与开发。
链接: https://arxiv.org/abs/2602.05406
作者: Isaac Wiafe,Akon Obu Ekpezu,Sumaya Ahmed Salihs,Elikem Doe Atsakpo,Fiifi Baffoe Payin Winful,Jamal-Deen Abdulai
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:The lack of impaired speech data hinders advancements in the development of inclusive speech technologies, particularly in low-resource languages such as Akan. To address this gap, this study presents a curated corpus of speech samples from native Akan speakers with speech impairment. The dataset comprises of 50.01 hours of audio recordings cutting across four classes of impaired speech namely stammering, cerebral palsy, cleft palate, and stroke induced speech disorder. Recordings were done in controlled supervised environments were participants described pre-selected images in their own words. The resulting dataset is a collection of audio recordings, transcriptions, and associated metadata on speaker demographics, class of impairment, recording environment and device. The dataset is intended to support research in low-resource automatic disordered speech recognition systems and assistive speech technology.
zh
[AI-69] Advancing Opinion Dynamics Modeling with Neural Diffusion-Convection-Reaction Equation
【速读】:该论文旨在解决现有意见动力学建模方法在物理先验整合不足的问题,特别是由于不完整的先验知识导致局部、全局及内生层面的动力学难以统一建模,且传统基于惩罚项的约束机制无法深度编码物理先验,从而引发优化路径病态和潜在表示与物理透明性之间的偏差。解决方案的关键在于提出一个基于扩散-对流-反应(Diffusion-Convection-Reaction, DCR)系统的物理视角来解释意见演化,并构建名为OPINN(Opinion Physics-Informed Neural Network)的神经微分方程框架,将神经网络与物理先验深度融合,实现对意见动态的可解释建模与高精度预测。
链接: https://arxiv.org/abs/2602.05403
作者: Chenghua Gong,Yihang Jiang,Hao Li,Rui Sun,Juyuan Zhang,Tianjun Gu,Liming Pan,Linyuan Lü
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
Abstract:Advanced opinion dynamics modeling is vital for deciphering social behavior, emphasizing its role in mitigating polarization and securing cyberspace. To synergize mechanistic interpretability with data-driven flexibility, recent studies have explored the integration of Physics-Informed Neural Networks (PINNs) for opinion modeling. Despite this promise, existing methods are tailored to incomplete priors, lacking a comprehensive physical system to integrate dynamics from local, global, and endogenous levels. Moreover, penalty-based constraints adopted in existing methods struggle to deeply encode physical priors, leading to optimization pathologies and discrepancy between latent representations and physical transparency. To this end, we offer a physical view to interpret opinion dynamics via Diffusion-Convection-Reaction (DCR) system inspired by interacting particle theory. Building upon the Neural ODEs, we define the neural opinion dynamics to coordinate neural networks with physical priors, and further present the OPINN, a physics-informed neural framework for opinion dynamics modeling. Evaluated on real-world and synthetic datasets, OPINN achieves state-of-the-art performance in opinion evolution forecasting, offering a promising paradigm for the nexus of cyber, physical, and social systems.
zh
[AI-70] Assessing Electricity Demand Forecasting with Exogenous Data in Time Series Foundation Models AAAI’26 AAAI2026
【速读】:该论文旨在解决时间序列基础模型(Time-series foundation models)在电力需求预测中对外生特征(exogenous features)利用效率不足的问题,尤其关注其在不同地理气候条件下的表现差异。解决方案的关键在于系统性评估多种基础模型(如Chronos-2、MOIRAI、MOMENT等)在新加坡和澳大利亚两个市场、小时与日粒度下的零样本(zero-shot)预测性能,并通过三种特征配置(全特征、选择特征、仅目标变量)验证模型架构设计与地理环境因素的交互影响。研究发现,模型架构中的协同机制(如TinyTimeMixers的通道混合、Chronos-2的分组注意力)是有效利用外生特征的核心,同时地理气候稳定性显著调节模型优势——在气候稳定的地区(如新加坡),简单基线模型反而更优,表明基础模型并非普适最优解,需结合领域特性进行定制化设计。
链接: https://arxiv.org/abs/2602.05390
作者: Wei Soon Cheong,Lian Lian Jiang,Jamie Ng Suat Ling(Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), Singapore)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 Figure and 3 Tables. Accepted to AI4TS Workshop @ AAAI’26 as an oral presentation (see this https URL )
Abstract:Time-series foundation models have emerged as a new paradigm for forecasting, yet their ability to effectively leverage exogenous features – critical for electricity demand forecasting – remains unclear. This paper empirically evaluates foundation models capable of modeling cross-channel correlations against a baseline LSTM with reversible instance normalization across Singaporean and Australian electricity markets at hourly and daily granularities. We systematically assess MOIRAI, MOMENT, TinyTimeMixers, ChronosX, and Chronos-2 under three feature configurations: all features, selected features, and target-only. Our findings reveal highly variable effectiveness: while Chronos-2 achieves the best performance among foundation models (in zero-shot settings), the simple baseline frequently outperforms all foundation models in Singapore’s stable climate, particularly for short-term horizons. Model architecture proves critical, with synergistic architectural implementations (TTM’s channel-mixing, Chronos-2’s grouped attention) consistently leveraging exogenous features, while other approaches show inconsistent benefits. Geographic context emerges as equally important, with foundation models demonstrating advantages primarily in variable climates. These results challenge assumptions about universal foundation model superiority and highlight the need for domain-specific models, specifically in the energy domain.
zh
[AI-71] Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)演化为自主代理(autonomous agents)后所面临的新安全挑战,尤其是现有防御机制多采用强制性检查范式、与代理生命周期解耦导致的防御效率低和误报率高的问题。其解决方案的关键在于提出Spider-Sense框架,该框架基于内在风险感知(Intrinsic Risk Sensing, IRS),使代理能够维持潜在警觉,并仅在感知到风险时触发防御;同时引入分层防御机制,在已知模式下通过轻量级相似匹配实现高效响应,在模糊案例中则调用深度内部推理以保障精度,从而无需依赖外部模型,显著降低攻击成功率(Attack Success Rate, ASR)和误报率(False Positive Rate, FPR),且仅带来8.3%的边际延迟开销。
链接: https://arxiv.org/abs/2602.05386
作者: Zhenxiong Yu,Zhi Yang,Zhiheng Jin,Shuhe Wang,Heng Zhang,Yanlin Fei,Lingfeng Zeng,Fangqi Lou,Shuo Zhang,Tu Hu,Jingping Liu,Rongze Chen,Xingyu Zhu,Kunyi Wang,Chaofa Yuan,Xin Guo,Zhaowei Liu,Feipeng Zhang,Jie Huang,Huacan Wang,Ronghao Chen,Liwen Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) evolve into autonomous agents, their real-world applicability has expanded significantly, accompanied by new security challenges. Most existing agent defense mechanisms adopt a mandatory checking paradigm, in which security validation is forcibly triggered at predefined stages of the agent lifecycle. In this work, we argue that effective agent security should be intrinsic and selective rather than architecturally decoupled and mandatory. We propose Spider-Sense framework, an event-driven defense framework based on Intrinsic Risk Sensing (IRS), which allows agents to maintain latent vigilance and trigger defenses only upon risk perception. Once triggered, the Spider-Sense invokes a hierarchical defence mechanism that trades off efficiency and precision: it resolves known patterns via lightweight similarity matching while escalating ambiguous cases to deep internal reasoning, thereby eliminating reliance on external models. To facilitate rigorous evaluation, we introduce S ^2 Bench, a lifecycle-aware benchmark featuring realistic tool execution and multi-stage attacks. Extensive experiments demonstrate that Spider-Sense achieves competitive or superior defense performance, attaining the lowest Attack Success Rate (ASR) and False Positive Rate (FPR), with only a marginal latency overhead of 8.3%.
zh
[AI-72] Clinical Validation of Medical-based Large Language Model Chatbots on Ophthalmic Patient Queries with LLM -based Evaluation
【速读】:该论文旨在解决当前领域特定医疗大语言模型(Medical Large Language Models, LLMs)在眼科临床场景中应用时缺乏系统性评估的问题,尤其是其回答患者查询的安全性与准确性难以保障。研究的关键在于通过构建标准化的S.C.O.R.E.评估框架对四款参数量小于100亿的小型医学LLMs进行交叉横断面测试,并引入GPT-4-Turbo作为自动化评分工具,与三位不同资历的眼科医生评分进行对比分析,从而验证LLM辅助评估的可行性及其与临床专家判断的一致性。结果表明,尽管部分模型如Meerkat-7B表现良好,但整体仍存在幻觉和临床深度不足的问题,支持了基于LLM的自动化评估可用于大规模基准测试,同时强调需建立“混合式”(hybrid)人工与自动审核机制以确保安全部署。
链接: https://arxiv.org/abs/2602.05381
作者: Ting Fang Tan,Kabilan Elangovan,Andreas Pollreisz,Kevin Bryan Dy,Wei Yan Ng,Joy Le Yi Wong,Jin Liyuan,Chrystie Quek Wan Ning,Ashley Shuen Ying Hong,Arun James Thirunavukarasu,Shelley Yin-His Chang,Jie Yao,Dylan Hong,Wang Zhaoran,Amrita Gupta,Daniel SW Ting
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Domain specific large language models are increasingly used to support patient education, triage, and clinical decision making in ophthalmology, making rigorous evaluation essential to ensure safety and accuracy. This study evaluated four small medical LLMs Meerkat-7B, BioMistral-7B, OpenBioLLM-8B, and MedLLaMA3-v20 in answering ophthalmology related patient queries and assessed the feasibility of LLM based evaluation against clinician grading. In this cross sectional study, 180 ophthalmology patient queries were answered by each model, generating 2160 responses. Models were selected for parameter sizes under 10 billion to enable resource efficient deployment. Responses were evaluated by three ophthalmologists of differing seniority and by GPT-4-Turbo using the S.C.O.R.E. framework assessing safety, consensus and context, objectivity, reproducibility, and explainability, with ratings assigned on a five point Likert scale. Agreement between LLM and clinician grading was assessed using Spearman rank correlation, Kendall tau statistics, and kernel density estimate analyses. Meerkat-7B achieved the highest performance with mean scores of 3.44 from Senior Consultants, 4.08 from Consultants, and 4.18 from Residents. MedLLaMA3-v20 performed poorest, with 25.5 percent of responses containing hallucinations or clinically misleading content, including fabricated terminology. GPT-4-Turbo grading showed strong alignment with clinician assessments overall, with Spearman rho of 0.80 and Kendall tau of 0.67, though Senior Consultants graded more conservatively. Overall, medical LLMs demonstrated potential for safe ophthalmic question answering, but gaps remained in clinical depth and consensus, supporting the feasibility of LLM based evaluation for large scale benchmarking and the need for hybrid automated and clinician review frameworks to guide safe clinical deployment.
zh
[AI-73] RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在极端量化(extreme quantization)部署中面临的效率与性能之间的权衡问题,特别是残差二值化(residual binarization)方法因特征共适应(feature co-adaptation)导致的性能退化问题。其解决方案的关键在于提出RaBiT框架,通过算法性地强制构建残差层级结构(residual hierarchy),使每个二值路径依次从单一共享的高精度权重中推导而来,从而确保每条路径修正前序路径的误差,有效抑制路径间的冗余学习,并通过稳健的初始化策略保障功能保留而非单纯权重近似,最终实现2-bit量化下性能与推理速度的显著提升。
链接: https://arxiv.org/abs/2602.05367
作者: Youngcheon You,Banseok Lee,Minseop Choi,Seonyoung Kim,Hyochan Chong,Changdong Kim,Youngmin Kim,Dongkyu Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization enables hardware-friendly, matmul-free inference by stacking binary ( \pm 1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during quantization-aware training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and limiting the expressive capacity of the model. While prior work relies on heuristic workarounds (e.g., path freezing) that constrain the solution space, we propose RaBiT, a novel quantization framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared full-precision weight, which ensures that every path corrects the error of the preceding one. This process is stabilized by a robust initialization that prioritizes functional preservation over mere weight approximation. RaBiT redefines the 2-bit accuracy-efficiency frontier: it achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a 4.49\times inference speed-up over full-precision models on an RTX 4090.
zh
[AI-74] PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents
【速读】:该论文旨在解决当前基于网络的智能体(web-based agents)在多步骤决策任务中难以有效发现、利用隐含上下文信息的问题,尤其关注其在面对误导性表层信号时的判断失效与推理幻觉现象。解决方案的关键在于构建PATHWAYS基准测试集,包含250个需要动态探索和整合隐蔽证据的复杂任务,从而系统评估智能体在适应性调查(adaptive investigation)、证据整合(evidence integration)和判断修正(judgement override)等方面的局限性。实验表明,现有模型虽能定位相关网页,但极少正确提取关键隐含证据,且常产生未访问证据的虚假推理;即使找到正确信息,也往往无法融入最终决策,揭示了当前架构在认知灵活性与推理可靠性上的根本缺陷。
链接: https://arxiv.org/abs/2602.05354
作者: Shifat E. Arman,Syed Nazmus Sakib,Tapodhir Karmakar Taton,Nafiul Haque,Shahrear Bin Amin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 35 pages, 13 figures
Abstract:We introduce PATHWAYS, a benchmark of 250 multi-step decision tasks that test whether web-based agents can discover and correctly use hidden contextual information. Across both closed and open models, agents typically navigate to relevant pages but retrieve decisive hidden evidence in only a small fraction of cases. When tasks require overturning misleading surface-level signals, performance drops sharply to near chance accuracy. Agents frequently hallucinate investigative reasoning by claiming to rely on evidence they never accessed. Even when correct context is discovered, agents often fail to integrate it into their final decision. Providing more explicit instructions improves context discovery but often reduces overall accuracy, revealing a tradeoff between procedural compliance and effective judgement. Together, these results show that current web agent architectures lack reliable mechanisms for adaptive investigation, evidence integration, and judgement override.
zh
[AI-75] ProAct: Agent ic Lookahead in Interactive Environments
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在需要长期规划的交互环境中表现不佳的问题,其核心瓶颈在于未来状态模拟过程中误差的累积。解决方案的关键在于提出ProAct框架,采用两阶段训练策略:第一阶段引入基于环境搜索轨迹的“接地前瞻蒸馏”(Grounded LookAhead Distillation, GLAD),通过将复杂的搜索树压缩为简洁的因果推理链,使代理在不依赖推理时搜索计算开销的前提下内化前瞻推理能力;第二阶段设计“蒙特卡洛评论家”(Monte-Carlo Critic, MC-Critic),作为可插拔的辅助价值估计器,利用轻量级环境滚动回放校准价值估计,从而提供低方差信号以稳定策略梯度算法(如PPO和GRPO)的优化过程,避免对昂贵模型基础的价值近似依赖。
链接: https://arxiv.org/abs/2602.05327
作者: Yangbin Yu,Mingyu Yang,Junyou Li,Yiming Gao,Feiyu Liu,Yijun Yang,Zichuan Lin,Jiafei Lyu,Yicheng Liu,Zhicong Lu,Deheng Ye,Jie Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing Large Language Model (LLM) agents struggle in interactive environments requiring long-horizon planning, primarily due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm. First, we introduce Grounded LookAhead Distillation (GLAD), where the agent undergoes supervised fine-tuning on trajectories derived from environment-based search. By compressing complex search trees into concise, causal reasoning chains, the agent learns the logic of foresight without the computational overhead of inference-time search. Second, to further refine decision accuracy, we propose the Monte-Carlo Critic (MC-Critic), a plug-and-play auxiliary value estimator designed to enhance policy-gradient algorithms like PPO and GRPO. By leveraging lightweight environment rollouts to calibrate value estimates, MC-Critic provides a low-variance signal that facilitates stable policy optimization without relying on expensive model-based value approximation. Experiments on both stochastic (e.g., 2048) and deterministic (e.g., Sokoban) environments demonstrate that ProAct significantly improves planning accuracy. Notably, a 4B parameter model trained with ProAct outperforms all open-source baselines and rivals state-of-the-art closed-source models, while demonstrating robust generalization to unseen environments. The codes and models are available at this https URL
zh
[AI-76] GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL
【速读】:该论文针对离线安全强化学习(Offline Safe Reinforcement Learning, OSRL)中生成式模型(Generative Models, GMs)方法存在的两大问题展开研究:一是缺乏从数据集中“拼接”(stitching)最优转移路径的能力,尤其是在子最优轨迹中无法有效组合高质量状态-动作对;二是难以在奖励目标与约束成本目标冲突时实现平衡。解决方案的关键在于提出一种名为Goal-Assisted Stitching (GAS) 的新算法,其核心创新包括:首先通过过渡层级的数据增强与重标注提升拼接能力,从而构建高质量轨迹;其次引入基于期望分位回归(expectile regression)训练的新型目标函数,用于估计数据集中可达到的最佳奖励与成本目标,使策略训练能够动态适应不同奖励-成本权衡,而非依赖人工设定的目标值;此外,通过重塑数据分布以实现更均匀的奖励-成本回报分布,进一步提升训练稳定性和效率。
链接: https://arxiv.org/abs/2602.05323
作者: Zifan Liu,Xinran Li,Shibo Chen,Jun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Offline Safe Reinforcement Learning (OSRL) aims to learn a policy to achieve high performance in sequential decision-making while satisfying constraints, using only pre-collected datasets. Recent works, inspired by the strong capabilities of Generative Models (GMs), reformulate decision-making in OSRL as a conditional generative process, where GMs generate desirable actions conditioned on predefined reward and cost values. However, GM-assisted methods face two major challenges in OSRL: (1) lacking the ability to “stitch” optimal transitions from suboptimal trajectories within the dataset, and (2) struggling to balance reward targets with cost targets, particularly when they are conflict. To address these issues, we propose Goal-Assisted Stitching (GAS), a novel algorithm designed to enhance stitching capabilities while effectively balancing reward maximization and constraint satisfaction. To enhance the stitching ability, GAS first augments and relabels the dataset at the transition level, enabling the construction of high-quality trajectories from suboptimal ones. GAS also introduces novel goal functions, which estimate the optimal achievable reward and cost goals from the dataset. These goal functions, trained using expectile regression on the relabeled and augmented dataset, allow GAS to accommodate a broader range of reward-cost return pairs and achieve a better tradeoff between reward maximization and constraint satisfaction compared to human-specified values. The estimated goals then guide policy training, ensuring robust performance under constrained settings. Furthermore, to improve training stability and efficiency, we reshape the dataset to achieve a more uniform reward-cost return distribution. Empirical results validate the effectiveness of GAS, demonstrating superior performance in balancing reward maximization and constraint satisfaction compared to existing methods.
zh
[AI-77] Formal Synthesis of Certifiably Robust Neural Lyapunov-Barrier Certificates
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)控制器在实际应用中因系统动力学不确定性导致的安全性与稳定性验证失效问题。现有基于神经 Lyapunov 和屏障函数(Neural Lyapunov and Barrier Certificates)的方法仅能在理想无扰动的动力学下提供保证,难以应对现实世界中的动态偏差。为此,作者提出了一种鲁棒神经 Lyapunov-屏障函数(Robust Neural Lyapunov Barrier Certificate)的合成方法,其核心在于通过 Lipschitz 连续性约束定义鲁棒性条件,并设计包含对抗训练、Lipschitz 邻域边界约束和全局 Lipschitz 正则化在内的训练目标,从而确保在有界扰动下仍保持安全性与稳定性保证。实验表明,该方法在倒立摆和二维对接两个场景中显著提升了认证鲁棒边界(最高提升 4.6 倍)和强扰动下的实际成功率(最高提升 2.4 倍)。
链接: https://arxiv.org/abs/2602.05311
作者: Chengxiao Wang,Haoze Wu,Gagandeep Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注:
Abstract:Neural Lyapunov and barrier certificates have recently been used as powerful tools for verifying the safety and stability properties of deep reinforcement learning (RL) controllers. However, existing methods offer guarantees only under fixed ideal unperturbed dynamics, limiting their reliability in real-world applications where dynamics may deviate due to uncertainties. In this work, we study the problem of synthesizing \emphrobust neural Lyapunov barrier certificates that maintain their guarantees under perturbations in system dynamics. We formally define a robust Lyapunov barrier function and specify sufficient conditions based on Lipschitz continuity that ensure robustness against bounded perturbations. We propose practical training objectives that enforce these conditions via adversarial training, Lipschitz neighborhood bound, and global Lipschitz regularization. We validate our approach in two practically relevant environments, Inverted Pendulum and 2D Docking. The former is a widely studied benchmark, while the latter is a safety-critical task in autonomous systems. We show that our methods significantly improve both certified robustness bounds (up to 4.6 times) and empirical success rates under strong perturbations (up to 2.4 times) compared to the baseline. Our results demonstrate effectiveness of training robust neural certificates for safe RL under perturbations in dynamics.
zh
[AI-78] PieArena: Frontier Language Agents Achieve MBA-Level Negotiation Performance and Reveal Novel Behavioral Differences
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高阶商业谈判任务中的能力评估问题,尤其是其在战略推理、心智理论(Theory of Mind)和经济价值创造方面的表现。解决方案的关键在于构建PieArena——一个基于精英商学院MBA课程中真实谈判场景的多智能体交互基准,通过系统性比较前沿LLMs(如GPT-5)与受训商科学生的表现,揭示了LLMs已具备接近或超越人类专业水平的谈判能力,并进一步发现代理结构化支持(agentic scaffolding)对不同层级模型存在非对称提升效应,从而为评估LLMs在复杂社会互动任务中的成熟度提供了多维行为指标体系。
链接: https://arxiv.org/abs/2602.05302
作者: Chris Zhu(Department of Statistics and Data Science, Yale University),Sasha Cui(Department of Statistics and Data Science, Yale University),Will Sanok Dufallo(Department of Philosophy, Yale University),Runzhi Jin(School of Law, University of California, Berkeley),Zhen Xu(Bloomberg),Linjun Zhang(Department of Statistics, Rutgers University),Daylian Cain(Yale School of Management)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present an in-depth evaluation of LLMs’ ability to negotiate, a central business task that requires strategic reasoning, theory of mind, and economic value creation. To do so, we introduce PieArena, a large-scale negotiation benchmark grounded in multi-agent interactions over realistic scenarios drawn from an MBA negotiation course at an elite business school. We find systematic evidence of AGI-level performance in which a representative frontier agent (GPT-5) matches or outperforms trained business-school students, despite a semester of general negotiation instruction and targeted coaching immediately prior to the task. We further study the effects of joint-intentionality agentic scaffolding and find asymmetric gains, with large improvements for mid- and lower-tier LMs and diminishing returns for frontier LMs. Beyond deal outcomes, PieArena provides a multi-dimensional negotiation behavioral profile, revealing novel cross-model heterogeneity, masked by deal-outcome-only benchmarks, in deception, computation accuracy, instruction compliance, and perceived reputation. Overall, our results suggest that frontier language agents are already intellectually and psychologically capable of deployment in high-stakes economic settings, but deficiencies in robustness and trustworthiness remain open challenges.
zh
[AI-79] Aspect-Aware MOOC Recommendation in a Heterogeneous Network
【速读】:该论文旨在解决MOOC推荐系统中传统方法(如协同过滤和基于内容的过滤)因数据稀疏性和过度专业化导致的性能瓶颈,以及现有图模型依赖人工预定义元路径所带来的局限性,例如仅能捕捉浅层结构关系且需大量领域专家参与和工程成本。其解决方案的关键在于提出一种名为AMR(Aspect-aware MOOC Recommendation)的新框架:通过双向随机游走自动发现元路径,利用双向LSTM编码器提取每条元路径内节点语义内容的多方面特征(aspect),并将这些语义感知的路径表示作为边特征引入学习者-学习者和知识点-知识点子图中,从而实现细粒度、语义驱动的知识点推荐。实验表明,AMR在MOOCCube和PEEK数据集上显著优于当前主流图神经网络基线模型,在HR@K和nDCG@K等关键指标上表现更优。
链接: https://arxiv.org/abs/2602.05297
作者: Seongyeub Chu,Jongwoo Kim,Mun Yong Yi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:MOOC recommendation systems have received increasing attention to help learners navigate and select preferred learning content. Traditional methods such as collaborative filtering and content-based filtering suffer from data sparsity and over-specialization. To alleviate these limitations, graph-based approaches have been proposed; however, they still rely heavily on manually predefined metapaths, which often capture only superficial structural relationships and impose substantial burdens on domain experts as well as significant engineering costs. To overcome these limitations, we propose AMR (Aspect-aware MOOC Recommendation), a novel framework that models path-specific multiple aspects by embedding the semantic content of nodes within each metapath. AMR automatically discovers metapaths through bi-directional walks, derives aspect-aware path representations using a bi-LSTM-based encoder, and incorporates these representations as edge features in the learner-learner and KC-KC subgraphs to achieve fine-grained semantically informed KC recommendations. Extensive experiments on the large-scale MOOCCube and PEEK datasets show that AMR consistently outperforms state-of-the-art graph neural network baselines across key metrics such as HR@K and nDCG@K. Further analysis confirms that AMR effectively captures rich path-specific aspect information, allowing more accurate recommendations than those methods that rely solely on predefined metapaths. The code will be available upon accepted.
zh
[AI-80] Position: Universal Time Series Foundation Models Rest on a Category Error
【速读】:该论文试图解决当前“通用时间序列基础模型”(Universal Foundation Models for Time Series)研究中存在的根本性问题,即错误地将时间序列视为一种语义模态(semantic Modality),而忽略了其内在生成过程的异质性(如金融与流体动力学场景下的不同机制)。这种误判导致单一模型在面对分布漂移(distributional drift)时性能退化,沦为昂贵且无效的“通用滤波器”(Generic Filters)。解决方案的关键在于提出“自回归盲区边界”(Autoregressive Blindness Bound)——一个理论极限,证明仅依赖历史数据的模型无法预测由干预驱动的制度性转变(intervention-driven regime shifts)。为此,论文倡导用“因果控制代理”(Causal Control Agent)范式替代通用性目标,该代理通过外部上下文协调一系列专用求解器(从冻结的领域专家到轻量级即时适配器),实现对动态环境的鲁棒响应。最终,论文呼吁将评估标准从“零样本准确率”转向“漂移适应速度”,以推动更具控制论意义的系统设计。
链接: https://arxiv.org/abs/2602.05287
作者: Xilin Dai,Wanxu Cai,Zhijian Xu,Qiang Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Position Paper
Abstract:This position paper argues that the pursuit of “Universal Foundation Models for Time Series” rests on a fundamental category error, mistaking a structural Container for a semantic Modality. We contend that because time series hold incompatible generative processes (e.g., finance vs. fluid dynamics), monolithic models degenerate into expensive “Generic Filters” that fail to generalize under distributional drift. To address this, we introduce the “Autoregressive Blindness Bound,” a theoretical limit proving that history-only models cannot predict intervention-driven regime shifts. We advocate replacing universality with a Causal Control Agent paradigm, where an agent leverages external context to orchestrate a hierarchy of specialized solvers, from frozen domain experts to lightweight Just-in-Time adaptors. We conclude by calling for a shift in benchmarks from “Zero-Shot Accuracy” to “Drift Adaptation Speed” to prioritize robust, control-theoretic systems.
zh
[AI-81] HealthMamba: An Uncertainty-aware Spatiotemporal Graph State Space Model for Effective and Reliable Healthcare Facility Visit Prediction
【速读】:该论文旨在解决医疗设施访问预测中忽视空间依赖性和异常情境下可靠性不足的问题。现有方法通常将此任务建模为时间序列预测,未充分考虑不同类型医疗设施之间的内在空间关联性,且在公共卫生突发事件等异常情况下难以提供可靠预测。解决方案的关键在于提出一个不确定性感知的时空框架HealthMamba,其核心包括:(i) 融合异构静态与动态信息的统一时空上下文编码器,(ii) 一种新颖的图状态空间模型GraphMamba,用于分层时空建模,以及(iii) 集成三种不确定性量化机制的综合不确定性量化模块,从而实现更准确且可靠的预测性能。
链接: https://arxiv.org/abs/2602.05286
作者: Dahai Yu,Lin Jiang,Rongchao Xu,Guang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Healthcare facility visit prediction is essential for optimizing healthcare resource allocation and informing public health policy. Despite advanced machine learning methods being employed for better prediction performance, existing works usually formulate this task as a time-series forecasting problem without considering the intrinsic spatial dependencies of different types of healthcare facilities, and they also fail to provide reliable predictions under abnormal situations such as public emergencies. To advance existing research, we propose HealthMamba, an uncertainty-aware spatiotemporal framework for accurate and reliable healthcare facility visit prediction. HealthMamba comprises three key components: (i) a Unified Spatiotemporal Context Encoder that fuses heterogeneous static and dynamic information, (ii) a novel Graph State Space Model called GraphMamba for hierarchical spatiotemporal modeling, and (iii) a comprehensive uncertainty quantification module integrating three uncertainty quantification mechanisms for reliable prediction. We evaluate HealthMamba on four large-scale real-world datasets from California, New York, Texas, and Florida. Results show HealthMamba achieves around 6.0% improvement in prediction accuracy and 3.5% improvement in uncertainty quantification over state-of-the-art baselines.
zh
[AI-82] Hallucination-Resistant Security Planning with a Large Language Model
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全管理工作(如事件响应规划)中因不可靠性和幻觉(hallucination)问题导致决策风险增大的挑战。其解决方案的关键在于提出一个结构化的框架,将LLM嵌入到一个迭代循环中:模型生成候选动作后,通过系统约束一致性检查和前瞻预测进行验证;若一致性不足,则暂停执行并收集外部反馈(例如在数字孪生环境中评估动作),再利用上下文学习(in-context learning, ICL)对候选动作进行精炼。该设计可通过调整一致性阈值控制幻觉风险,并在合理假设下建立了ICL regret的理论边界,实验证明该框架可使恢复时间最多缩短30%。
链接: https://arxiv.org/abs/2602.05279
作者: Kim Hammar,Tansu Alpcan,Emil Lupu
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted to IEEE/IFIP Network Operations and Management Symposium 2026. To appear in the conference proceedings
Abstract:Large language models (LLMs) are promising tools for supporting security management tasks, such as incident response planning. However, their unreliability and tendency to hallucinate remain significant challenges. In this paper, we address these challenges by introducing a principled framework for using an LLM as decision support in security management. Our framework integrates the LLM in an iterative loop where it generates candidate actions that are checked for consistency with system constraints and lookahead predictions. When consistency is low, we abstain from the generated actions and instead collect external feedback, e.g., by evaluating actions in a digital twin. This feedback is then used to refine the candidate actions through in-context learning (ICL). We prove that this design allows to control the hallucination risk by tuning the consistency threshold. Moreover, we establish a bound on the regret of ICL under certain assumptions. To evaluate our framework, we apply it to an incident response use case where the goal is to generate a response and recovery plan based on system logs. Experiments on four public datasets show that our framework reduces recovery times by up to 30% compared to frontier LLMs.
zh
[AI-83] Beyond Cosine Similarity
【速读】:该论文旨在解决传统余弦相似度(cosine similarity)在向量空间中仅能捕捉线性关系的局限性,这一限制使其难以准确建模现实语义空间中的复杂非线性结构。其解决方案的关键在于推导出比经典柯西-施瓦茨不等式(Cauchy-Schwarz inequality)更紧的点积上界,并由此提出recos相似度度量方法——该方法通过排序后的向量分量对点积进行归一化,将完美相似的条件从严格的线性相关放宽为序数一致性(ordinal concordance),从而能够识别更广泛的关系类型。实验表明,recos在11种不同类型的嵌入模型上均显著优于传统余弦相似度,在标准语义文本相似度(STS)基准测试中与人类判断的相关性更高,展现出数学严谨性和实际性能优势。
链接: https://arxiv.org/abs/2602.05266
作者: Xinbo Ai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 2 figures, 1 theorem, 3 corollaries
Abstract:Cosine similarity, the standard metric for measuring semantic similarity in vector spaces, is mathematically grounded in the Cauchy-Schwarz inequality, which inherently limits it to capturing linear relationships–a constraint that fails to model the complex, nonlinear structures of real-world semantic spaces. We advance this theoretical underpinning by deriving a tighter upper bound for the dot product than the classical Cauchy-Schwarz bound. This new bound leads directly to recos, a similarity metric that normalizes the dot product by the sorted vector components. recos relaxes the condition for perfect similarity from strict linear dependence to ordinal concordance, thereby capturing a broader class of relationships. Extensive experiments across 11 embedding models–spanning static, contextualized, and universal types–demonstrate that recos consistently outperforms traditional cosine similarity, achieving higher correlation with human judgments on standard Semantic Textual Similarity (STS) benchmarks. Our work establishes recos as a mathematically principled and empirically superior alternative, offering enhanced accuracy for semantic analysis in complex embedding spaces.
zh
[AI-84] Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents
【速读】:该论文旨在解决现有基准测试在评估具身智能体(embodied agents)时面临的两大核心问题:一是数据污染(data contamination)严重,二是缺乏对未见过的三维环境(unseen 3D environments)的场景特异性(scene specificity),导致无法有效衡量智能体在真实家庭环境中部署前的实际能力。其解决方案的关键在于提出一种受人类认知启发的动态在位任务生成方法(dynamic in-situ task generation),构建了一个两阶段交互-演化任务生成系统(Task Generation System for Embodied Agents, TEA):第一阶段通过智能体与环境的主动交互形成任务执行与生成的闭环,实现持续的任务生成;第二阶段基于任务图建模(task graph modeling)对已有任务进行重组和复用,无需外部数据即可生成新任务。该方法实现了在未见过的10个场景中自动生成87,876个物理合理且涵盖日常认知能力的任务,验证了其有效性,并揭示了当前SOTA模型在基础感知、3D交互意识及任务类型敏感性方面的显著不足。
链接: https://arxiv.org/abs/2602.05249
作者: Xinyi He,Ying Yang,Chuanjian Fu,Sihan Guo,Songchun Zhu,Lifeng Fan,Zhenliang Zhang,Yujia Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As general intelligent agents are poised for widespread deployment in diverse households, evaluation tailored to each unique unseen 3D environment has become a critical prerequisite. However, existing benchmarks suffer from severe data contamination and a lack of scene specificity, inadequate for assessing agent capabilities in unseen settings. To address this, we propose a dynamic in-situ task generation method for unseen environments inspired by human cognition. We define tasks through a structured graph representation and construct a two-stage interaction-evolution task generation system for embodied agents (TEA). In the interaction stage, the agent actively interacts with the environment, creating a loop between task execution and generation that allows for continuous task generation. In the evolution stage, task graph modeling allows us to recombine and reuse existing tasks to generate new ones without external data. Experiments across 10 unseen scenes demonstrate that TEA automatically generated 87,876 tasks in two cycles, which human verification confirmed to be physically reasonable and encompassing essential daily cognitive capabilities. Benchmarking SOTA models against humans on our in-situ tasks reveals that models, despite excelling on public benchmarks, perform surprisingly poorly on basic perception tasks, severely lack 3D interaction awareness and show high sensitivity to task types in reasoning. These sobering findings highlight the necessity of in-situ evaluation before deploying agents into real-world human environments.
zh
[AI-85] EGSS: Entropy-guided Stepwise Scaling for Reliable Software Engineering
【速读】:该论文旨在解决生成式 AI(Generative AI)在软件工程任务中应用时面临的两大挑战:一是部署大规模模型集成带来的显著计算开销,二是缺乏可靠的候选解选择机制,导致性能提升受限。其解决方案的关键在于提出熵引导的分步扩展(Entropy-Guided Stepwise Scaling, EGSS),通过熵引导的自适应搜索实现效率与效果的动态平衡,并结合鲁棒的测试套件增强策略,在保持高精度的同时大幅降低推理阶段的token消耗,从而在SWE-Bench-Verified基准上实现5–10%的性能提升,并在GLM-4.6模型上达到开源大语言模型的新SOTA水平。
链接: https://arxiv.org/abs/2602.05242
作者: Chenhui Mao,Yuanting Lei,Zhixiang Wei,Ming Liang,Zhixiang Wang,Jingxuan Xu,Dajun Chen,Wei Jiang,Yong Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic Test-Time Scaling (TTS) has delivered state-of-the-art (SOTA) performance on complex software engineering tasks such as code generation and bug fixing. However, its practical adoption remains limited due to significant computational overhead, primarily driven by two key challenges: (1) the high cost associated with deploying excessively large ensembles, and (2) the lack of a reliable mechanism for selecting the optimal candidate solution, ultimately constraining the performance gains that can be realized. To address these challenges, we propose Entropy-Guided Stepwise Scaling (EGSS), a novel TTS framework that dynamically balances efficiency and effectiveness through entropy-guided adaptive search and robust test-suite augmentation. Extensive experiments on SWE-Bench-Verified demonstrate that EGSS consistently boosts performance by 5-10% across all evaluated models. Specifically, it increases the resolved ratio of Kimi-K2-Intruct from 63.2% to 72.2%, and GLM-4.6 from 65.8% to 74.6%. Furthermore, when paired with GLM-4.6, EGSS achieves a new state-of-the-art among open-source large language models. In addition to these accuracy improvements, EGSS reduces inference-time token usage by over 28% compared to existing TTS methods, achieving simultaneous gains in both effectiveness and computational efficiency.
zh
[AI-86] Explainable AI: A Combined XAI Framework for Explaining Brain Tumour Detection Models
【速读】:该论文旨在解决深度学习模型在脑肿瘤检测任务中缺乏可解释性的问题,从而提升医疗领域对AI决策的信任度。解决方案的关键在于整合多种可解释人工智能(Explainable AI, XAI)技术——包括梯度加权类激活映射(Gradient-weighted Class Activation Mapping, GRAD-CAM)、逐层相关性传播(Layer-wise Relevance Propagation, LRP)和SHapley Additive exPlanations(SHAP),形成多层级解释框架:GRAD-CAM识别关键空间区域,LRP提供像素级相关性,SHAP量化特征贡献,从而实现从宏观到微观的全面模型推理解析,显著优于单一XAI方法,在部分肿瘤可见的情况下仍能有效解释预测结果。
链接: https://arxiv.org/abs/2602.05240
作者: Patrick McGonagle,William Farrelly,Kevin Curran
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This study explores the integration of multiple Explainable AI (XAI) techniques to enhance the interpretability of deep learning models for brain tumour detection. A custom Convolutional Neural Network (CNN) was developed and trained on the BraTS 2021 dataset, achieving 91.24% accuracy in distinguishing between tumour and non-tumour regions. This research combines Gradient-weighted Class Activation Mapping (GRAD-CAM), Layer-wise Relevance Propagation (LRP) and SHapley Additive exPlanations (SHAP) to provide comprehensive insights into the model’s decision-making process. This multi-technique approach successfully identified both full and partial tumours, offering layered explanations ranging from broad regions of interest to pixel-level details. GRAD-CAM highlighted important spatial regions, LRP provided detailed pixel-level relevance and SHAP quantified feature contributions. The integrated approach effectively explained model predictions, including cases with partial tumour visibility thus showing superior explanatory power compared to individual XAI methods. This research enhances transparency and trust in AI-driven medical imaging analysis by offering a more comprehensive perspective on the model’s reasoning. The study demonstrates the potential of integrated XAI techniques in improving the reliability and interpretability of AI systems in healthcare, particularly for critical tasks like brain tumour detection.
zh
[AI-87] Balanced Anomaly-guided Ego-graph Diffusion Model for Inductive Graph Anomaly Detection KDD KDD’26
【速读】:该论文旨在解决图异常检测(Graph Anomaly Detection, GAD)中两个关键问题:一是现有模型多采用归纳学习范式,难以适应动态演化的图结构;二是异常节点在数据中极度稀少,导致模型存在严重类别不平衡,影响对未见异常的泛化能力。这两个问题相互制约:静态模型限制了有效的数据增强策略,而类别不平衡则加剧了归纳学习场景下的模型偏差。解决方案的核心在于提出一种以数据为中心的新框架,融合动态图建模与均衡异常合成机制:首先引入离散的自我图扩散模型(discrete ego-graph diffusion model),通过捕捉异常局部拓扑特征生成符合异常结构分布的自我图;其次设计课程式异常增强机制(curriculum anomaly augmentation mechanism),在训练过程中动态调整合成数据的生成策略,聚焦于低频异常模式,从而提升检测精度和泛化性能。
链接: https://arxiv.org/abs/2602.05232
作者: Chunyu Wei,Siyuan He,Yu Wang,Yueguo Chen,Yunhai Wang,Bing Bai,Yidong Zhang,Yong Xie,Shunming Zhang,Fei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages,6 figures, Accepted by ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '26)
Abstract:Graph anomaly detection (GAD) is crucial in applications like fraud detection and cybersecurity. Despite recent advancements using graph neural networks (GNNs), two major challenges persist. At the model level, most methods adopt a transductive learning paradigm, which assumes static graph structures, making them unsuitable for dynamic, evolving networks. At the data level, the extreme class imbalance, where anomalous nodes are rare, leads to biased models that fail to generalize to unseen anomalies. These challenges are interdependent: static transductive frameworks limit effective data augmentation, while imbalance exacerbates model distortion in inductive learning settings. To address these challenges, we propose a novel data-centric framework that integrates dynamic graph modeling with balanced anomaly synthesis. Our framework features: (1) a discrete ego-graph diffusion model, which captures the local topology of anomalies to generate ego-graphs aligned with anomalous structural distribution, and (2) a curriculum anomaly augmentation mechanism, which dynamically adjusts synthetic data generation during training, focusing on underrepresented anomaly patterns to improve detection and generalization. Experiments on five datasets demonstrate that the effectiveness of our framework.
zh
[AI-88] ZeroS: Zero-Sum Linear Attention for Efficient Transformers NEURIPS2025
【速读】:该论文旨在解决线性注意力(Linear Attention)方法在复杂度降至 O(N) 的同时,因受限于凸组合(convex combinations)导致的信息融合方式单一,以及在长序列中累积权重均匀稀释(uniform accumulated weight bias)所引发的性能下降问题。其解决方案的关键在于提出零和线性注意力(Zero-Sum Linear Attention, ZeroS),通过移除常数零阶项 1/t 并重新加权剩余的零和 softmax 残差,使注意力权重具备数学稳定性并支持正负值,从而实现对比操作(contrastive operations),理论上扩展了可表示函数的集合,且在保持 O(N) 复杂度的前提下,在多个序列建模基准上达到或超越标准 softmax 注意力的效果。
链接: https://arxiv.org/abs/2602.05230
作者: Jiecheng Lu,Xu Han,Yan Sun,Viresh Pati,Yubin Kim,Siddhartha Somani,Shihao Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Camera-ready version. Accepted at NeurIPS 2025
Abstract:Linear attention methods offer Transformers O(N) complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term 1/t and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining O(N) complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.
zh
[AI-89] Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink
【速读】:该论文旨在解决大语言模型在微调过程中因有害微调(harmful fine-tuning)而导致安全对齐失效的问题,从而引发显著的安全风险。其解决方案的关键在于提出并利用“注意力汇聚点偏差”(sink divergence)这一指标来识别和干预有害模式的学习过程:通过观察不同注意力头的sink divergence符号差异,发现正向sink divergence的注意力头数量随模型有害性增强而增加,据此提出可分离的sink divergence假设——即有害微调中学习有害模式的注意力头可通过其sink divergence符号进行区分。基于此假设,作者设计了一种名为Surgery的微调阶段防御机制,引入sink divergence抑制正则项,引导注意力头趋向负向sink divergence组,从而降低模型学习和放大有害模式的倾向。实验表明,该方法在BeaverTails、HarmBench和SorryBench三个基准上分别提升了5.90%、11.25%和9.55%的防御性能。
链接: https://arxiv.org/abs/2602.05228
作者: Guozhi Liu,Weiwei Lin,Tiansheng Huang,Ruichao Mo,Qi Mu,Xiumin Wang,Li Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Harmful fine-tuning can invalidate safety alignment of large language models, exposing significant safety risks. In this paper, we utilize the attention sink mechanism to mitigate harmful fine-tuning. Specifically, we first measure a statistic named \emphsink divergence for each attention head and observe that \emphdifferent attention heads exhibit two different signs of sink divergence. To understand its safety implications, we conduct experiments and find that the number of attention heads of positive sink divergence increases along with the increase of the model’s harmfulness when undergoing harmful fine-tuning. Based on this finding, we propose a separable sink divergence hypothesis – \emphattention heads associating with learning harmful patterns during fine-tuning are separable by their sign of sink divergence. Based on the hypothesis, we propose a fine-tuning-stage defense, dubbed Surgery. Surgery utilizes a regularizer for sink divergence suppression, which steers attention heads toward the negative sink divergence group, thereby reducing the model’s tendency to learn and amplify harmful patterns. Extensive experiments demonstrate that Surgery improves defense performance by 5.90%, 11.25%, and 9.55% on the BeaverTails, HarmBench, and SorryBench benchmarks, respectively. Source code is available on this https URL.
zh
[AI-90] Semantic Search over 9 Million Mathematical Theorems
【速读】:该论文旨在解决数学成果检索难题,即现有工具多以整篇论文为单位进行检索,而数学研究者与定理证明代理往往需要精准定位特定的定理、引理或命题来回应查询。针对这一问题,作者提出了一种在大规模统一语料库(涵盖920万条来自arXiv及其他来源的研究级定理陈述)上实现语义定理检索的方法,其关键在于使用简短的自然语言描述作为定理的表示形式,并系统评估了表示上下文、语言模型选择、嵌入模型及提示策略对检索质量的影响。实验表明,在由专业数学家编写的一组定制化定理搜索查询上,该方法显著优于现有基线,在定理级别和论文级别检索中均展现出优越性能,验证了语义定理搜索在互联网规模下的可行性与有效性。
链接: https://arxiv.org/abs/2602.05216
作者: Luke Alexander,Eric Leonen,Sophie Szeto,Artemii Remizov,Ignacio Tejeda,Giovanni Inchiostro,Vasily Ilin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); History and Overview (math.HO)
备注: Feedback is welcome
Abstract:Searching for mathematical results remains difficult: most existing tools retrieve entire papers, while mathematicians and theorem-proving agents often seek a specific theorem, lemma, or proposition that answers a query. While semantic search has seen rapid progress, its behavior on large, highly technical corpora such as research-level mathematical theorems remains poorly understood. In this work, we introduce and study semantic theorem retrieval at scale over a unified corpus of 9.2 million theorem statements extracted from arXiv and seven other sources, representing the largest publicly available corpus of human-authored, research-level theorems. We represent each theorem with a short natural-language description as a retrieval representation and systematically analyze how representation context, language model choice, embedding model, and prompting strategy affect retrieval quality. On a curated evaluation set of theorem-search queries written by professional mathematicians, our approach substantially improves both theorem-level and paper-level retrieval compared to existing baselines, demonstrating that semantic theorem search is feasible and effective at web scale. The theorem search tool is available at \hrefthis https URLthis link, and the dataset is available at \hrefthis https URLthis link.
zh
[AI-91] raceable Cross-Source RAG for Chinese Tibetan Medicine Question Answering
【速读】:该论文旨在解决在多源异构知识库(KBs)环境下,检索增强生成(RAG)系统因知识密度差异导致的检索偏差问题,尤其是在藏医领域中,百科类条目因内容密集易被优先检索,从而掩盖了更具权威性的经典文献和临床论文所提供的证据。解决方案的关键在于提出两种互补方法:其一,DAKS(Knowledge Base Routing and Budgeted Retrieval)通过KB路由与预算化检索机制,缓解密度驱动的偏倚并根据任务需求优先调用权威来源;其二,利用对齐图(alignment graph)指导证据融合与覆盖感知打包(coverage-aware packing),提升跨KB证据覆盖率,避免简单拼接带来的信息冗余或遗漏。二者协同作用,在保持高忠实度和引用准确性的前提下显著提升了跨KB验证能力与整体性能。
链接: https://arxiv.org/abs/2602.05195
作者: Fengxian Chen,Zhilong Tao,Jiaxuan Li,Yunlong Li,Qingguo Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented generation (RAG) promises grounded question answering, yet domain settings with multiple heterogeneous knowledge bases (KBs) remain challenging. In Chinese Tibetan medicine, encyclopedia entries are often dense and easy to match, which can dominate retrieval even when classics or clinical papers provide more authoritative evidence. We study a practical setting with three KBs (encyclopedia, classics, and clinical papers) and a 500-query benchmark (cutoff K=5 ) covering both single-KB and cross-KB questions. We propose two complementary methods to improve traceability, reduce hallucinations, and enable cross-KB verification. First, DAKS performs KB routing and budgeted retrieval to mitigate density-driven bias and to prioritize authoritative sources when appropriate. Second, we use an alignment graph to guide evidence fusion and coverage-aware packing, improving cross-KB evidence coverage without relying on naive concatenation. All answers are generated by a lightweight generator, \textscopenPangu-Embedded-7B. Experiments show consistent gains in routing quality and cross-KB evidence coverage, with the full system achieving the best CrossEv@5 while maintaining strong faithfulness and citation correctness.
zh
[AI-92] First Proof
【速读】:该论文旨在评估当前人工智能(AI)系统在解答研究级数学问题方面的准确性与能力。其解决方案的关键在于构建并提供一组十道源于作者实际科研过程中的数学问题,这些问题此前未公开,且答案由原作者掌握但暂时加密,从而形成一个严谨、真实的研究级测试基准,用于客观衡量AI在复杂数学推理任务上的表现。
链接: https://arxiv.org/abs/2602.05192
作者: Mohammed Abouzaid,Andrew J. Blumberg,Martin Hairer,Joe Kileel,Tamara G. Kolda,Paul D. Nelson,Daniel Spielman,Nikhil Srivastava,Rachel Ward,Shmuel Weinberger,Lauren Williams
机构: 未知
类目: Artificial Intelligence (cs.AI); Algebraic Geometry (math.AG); Combinatorics (math.CO); Geometric Topology (math.GT); History and Overview (math.HO); Rings and Algebras (math.RA)
备注: 9 pages, including the statements of the ten questions
Abstract:To assess the ability of current AI systems to correctly answer research-level mathematics questions, we share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time.
zh
[AI-93] Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLM s
【速读】:该论文旨在解决长上下文推理中因键值缓存(key-value cache)规模增长而导致的注意力计算瓶颈问题,尤其是现有稀疏注意力方法在适应不同头(head)和层(layer)间异构注意力分布时效率不足的问题。其解决方案的关键在于提出双层Top-p(Double-P)稀疏注意力框架:首先在聚类层级进行粗粒度的Top-p估计,利用加权中心点(size-weighted centroids)高效预估注意力质量;随后通过第二阶段自适应地仅对必要token进行细粒度Top-p计算,从而在保证近零精度损失的前提下,显著降低注意力计算开销(最高达1.8倍),并实现端到端解码速度提升(最高达1.3倍)。
链接: https://arxiv.org/abs/2602.05191
作者: Wentao Ni,Kangqi Zhang,Zhongming Yu,Oren Nelson,Mingu Lee,Hong Cai,Fatih Porikli,Jongryool Kim,Zhijian Liu,Jishen Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As long-context inference becomes central to large language models (LLMs), attention over growing key-value caches emerges as a dominant decoding bottleneck, motivating sparse attention for scalable inference. Fixed-budget top-k sparse attention cannot adapt to heterogeneous attention distributions across heads and layers, whereas top-p sparse attention directly preserves attention mass and provides stronger accuracy guarantees. Existing top-p methods, however, fail to jointly optimize top-p accuracy, selection overhead, and sparse attention cost, which limits their overall efficiency. We present Double-P, a hierarchical sparse attention framework that optimizes all three stages. Double-P first performs coarse-grained top-p estimation at the cluster level using size-weighted centroids, then adaptively refines computation through a second top-p stage that allocates token-level attention only when needed. Across long-context benchmarks, Double-P consistently achieves near-zero accuracy drop, reducing attention computation overhead by up to 1.8x and delivers up to 1.3x end-to-end decoding speedup over state-of-the-art fixed-budget sparse attention methods.
zh
[AI-94] Data-Centric Interpretability for LLM -based Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决大规模强化学习训练中智能体行为演变难以解释的问题,尤其是在复杂多智能体环境(如Full-Press Diplomacy)下,如何有效理解并分析生成式AI(Generative AI)模型在训练过程中出现的细粒度行为模式。其解决方案的关键在于引入稀疏自动编码器(Sparse Autoencoders, SAEs)与预训练语言模型(LLM)总结方法相结合的框架,并提出Meta-Autointerp方法,将SAE特征聚类为可解释的训练动态假设。通过自动化评估和用户研究验证,该框架不仅识别出包括角色扮演、语言切换、奖励黑客等行为,还发现部分SAE衍生假设具有下游任务预测能力,且通过系统提示增强未训练代理的行为表现(+14.2%得分提升),从而为数据驱动的可解释性研究提供了实用起点,助力确保大语言模型在整个训练过程中的可信行为。
链接: https://arxiv.org/abs/2602.05183
作者: John Yan,Michael Yu,Yuqi Sun,Alexander Duffy,Tyler Marques,Matthew Lyle Olson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: authors 1, 2 and 3 contributed equally
Abstract:Large language models (LLMs) are increasingly trained in complex Reinforcement Learning, multi-agent environments, making it difficult to understand how behavior changes over training. Sparse Autoencoders (SAEs) have recently shown to be useful for data-centric interpretability. In this work, we analyze large-scale reinforcement learning training runs from the sophisticated environment of Full-Press Diplomacy by applying pretrained SAEs, alongside LLM-summarizer methods. We introduce Meta-Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics. We discover fine-grained behaviors including role-playing patterns, degenerate outputs, language switching, alongside high-level strategic behaviors and environment-specific bugs. Through automated evaluation, we validate that 90% of discovered SAE Meta-Features are significant, and find a surprising reward hacking behavior. However, through two user studies, we find that even subjectively interesting and seemingly helpful SAE features may be worse than useless to humans, along with most LLM generated hypotheses. However, a subset of SAE-derived hypotheses are predictively useful for downstream tasks. We further provide validation by augmenting an untrained agent’s system prompt, improving the score by +14.2%. Overall, we show that SAEs and LLM-summarizer provide complementary views into agent behavior, and together our framework forms a practical starting point for future data-centric interpretability work on ensuring trustworthy LLM behavior throughout training.
zh
[AI-95] Benchmarking Artificial Intelligence Models for Daily Coastal Hypoxia Forecasting
【速读】:该论文旨在解决墨西哥湾北部海域持续存在的低氧(hypoxia)问题,传统季节性模型预报精度不足,难以满足生态管理对日尺度精细化预测的需求。其解决方案的关键在于构建并比较四种深度学习架构(BiLSTM、Medformer、ST-Transformer 和 TCN)用于每日低氧状态分类,通过整合水柱分层、沉积物耗氧速率和温度依赖分解速率等关键物理生物地球化学参数,实现了高精度实时预测;其中时空 Transformer(ST-Transformer)表现最优(AUC-ROC: 0.982–0.992),并建立了可复现的业务化实时低氧预测框架,为海洋环境建模与生态系统韧性提升提供技术支持。
链接: https://arxiv.org/abs/2602.05178
作者: Magesh Rajasekaran,Md Saiful Sajol,Chris Alvin,Supratik Mukhopadhyay,Yanda Ou,Z. George Xue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This is a Preprint accepted at IEEE Big Data 2025
Abstract:Coastal hypoxia, especially in the northern part of Gulf of Mexico, presents a persistent ecological and economic concern. Seasonal models offer coarse forecasts that miss the fine-scale variability needed for daily, responsive ecosystem management. We present study that compares four deep learning architectures for daily hypoxia classification: Bidirectional Long Short-Term Memory (BiLSTM), Medformer (Medical Transformer), Spatio-Temporal Transformer (ST-Transformer), and Temporal Convolutional Network (TCN). We trained our models with twelve years of daily hindcast data from 2009-2020 Our training data consists of 2009-2020 hindcast data from a coupled hydrodynamic-biogeochemical model. Similarly, we use hindcast data from 2020 through 2024 as a test data. We constructed classification models incorporating water column stratification, sediment oxygen consumption, and temperature-dependent decomposition rates. We evaluated each architectures using the same data preprocessing, input/output formulation, and validation protocols. Each model achieved high classification accuracy and strong discriminative ability with ST-Transformer achieving the highest performance across all metrics and tests periods (AUC-ROC: 0.982-0.992). We also employed McNemar’s method to identify statistically significant differences in model predictions. Our contribution is a reproducible framework for operational real-time hypoxia prediction that can support broader efforts in the environmental and ocean modeling systems community and in ecosystem resilience. The source code is available this https URL
zh
[AI-96] Position: Capability Control Should be a Separate Goal From Alignment
【速读】:该论文旨在解决基础模型(Foundation Models)在广泛应用中可能带来的滥用和失效风险问题,其核心挑战在于如何有效限制模型的潜在行为边界,而不仅仅是使其与人类偏好对齐。解决方案的关键在于将“能力控制”(Capability Control)作为独立于对齐(Alignment)的目标,并通过模型生命周期中的三层机制实现:(i) 基于数据层面的训练分布控制,(ii) 基于学习过程的权重或表征干预,以及 (iii) 基于系统的部署后护栏机制(如输入、输出和动作层面的约束)。作者强调,单一层次的控制存在固有缺陷,因此主张采用纵深防御(defense-in-depth)策略,整合多层次互补控制措施以提升整体安全性。
链接: https://arxiv.org/abs/2602.05164
作者: Shoaib Ahmed Siddiqui,Eleni Triantafillou,David Krueger,Adrian Weller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation models are trained on broad data distributions, yielding generalist capabilities that enable many downstream applications but also expand the space of potential misuse and failures. This position paper argues that capability control – imposing restrictions on permissible model behavior – should be treated as a distinct goal from alignment. While alignment is often context and preference-driven, capability control aims to impose hard operational limits on permissible behaviors, including under adversarial elicitation. We organize capability control mechanisms across the model lifecycle into three layers: (i) data-based control of the training distribution, (ii) learning-based control via weight- or representation-level interventions, and (iii) system-based control via post-deployment guardrails over inputs, outputs, and actions. Because each layer has characteristic failure modes when used in isolation, we advocate for a defense-in-depth approach that composes complementary controls across the full stack. We further outline key open challenges in achieving such control, including the dual-use nature of knowledge and compositional generalization.
zh
[AI-97] CoSA: Compressed Sensing-Based Adaptation of Large Language Models
【速读】:该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法中因低秩假设导致的表达能力受限问题,尤其是在任务特定适应场景下,当权重更新的奇异值分布较为均匀时,传统方法如LoRA和PiSSA难以充分捕捉复杂模式。解决方案的关键在于提出一种基于压缩感知理论的新方法CoSA(Compressed Sensing-Based Adaptation),其核心思想是通过固定随机投影矩阵将权重更新映射到低维空间,并利用一个紧凑的可学习核心进行编码与重构,从而在不约束更新至低秩子空间的前提下实现高效且高表达力的模型适配。
链接: https://arxiv.org/abs/2602.05148
作者: Songtao Wei,Yi Li,Bohan Zhang,Zhichun Guo,Ying Huang,Yuede Ji,Miao Yin,Guanpeng Li,Bingzhe Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Parameter-Efficient Fine-Tuning (PEFT) has emerged as a practical paradigm for adapting large language models (LLMs) without updating all parameters. Most existing approaches, such as LoRA and PiSSA, rely on low-rank decompositions of weight updates. However, the low-rank assumption may restrict expressivity, particularly in task-specific adaptation scenarios where singular values are distributed relatively uniformly. To address this limitation, we propose CoSA (Compressed Sensing-Based Adaptation), a new PEFT method extended from compressed sensing theory. Instead of constraining weight updates to a low-rank subspace, CoSA expresses them through fixed random projection matrices and a compact learnable core. We provide a formal theoretical analysis of CoSA as a synthesis process, proving that weight updates can be compactly encoded into a low-dimensional space and mapped back through random projections. Extensive experimental results show that CoSA provides a principled perspective for efficient and expressive multi-scale model adaptation. Specifically, we evaluate CoSA on 10 diverse tasks, including natural language understanding and generation, employing 5 models of different scales from RoBERTa, Llama, and Qwen families. Across these settings, CoSA consistently matches or outperforms state-of-the-art PEFT methods.
zh
[AI-98] Cross-talk based multi-task learning for fault classification of physically coupled machine system
【速读】:该论文旨在解决机器系统中故障分类任务因物理变量与故障信号耦合而难以准确识别的问题。现有方法通常仅依赖故障标签进行分类,忽略了信号中隐含的其他物理变量信息。解决方案的关键在于引入一种基于交叉干扰(cross-talk)结构的多任务学习(Multi-Task Learning, MTL)框架,通过在任务间受控地交换信息来增强模型对故障特征的判别能力,同时避免共享主干架构中可能出现的负迁移问题。研究进一步将残差神经维度缩减模型扩展至两个典型场景:无人机故障数据集和电机复合故障数据集,结果表明该方法在单通道与多通道输入下均显著优于单任务模型、多类别合并模型及共享主干多任务模型。
链接: https://arxiv.org/abs/2602.05146
作者: Wonjun Yi,Rismaya Kumar Mishra,Yong-Hwa Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to 32th International Congress on Sound and Vibration (ICSV32)
Abstract:Machine systems inherently generate signals in which fault conditions and various physical variables are physically coupled. Although many existing fault classification studies rely solely on direct fault labels, the aforementioned signals naturally embed additional information shaped by other physically coupled information. Herein, we leverage this coupling through a multi-task learning (MTL) framework that jointly learns fault conditions and the related physical variables. Among MTL architectures, crosstalk structures have distinct advantages because they allow for controlled information exchange between tasks through the cross-talk layer while preventing negative transfer, in contrast to shared trunk architectures that often mix incompatible features. We build on our previously introduced residual neural dimension reductor model, and extend its application to two benchmarks where physical coupling is prominent. The first benchmark is a drone fault dataset, in which machine type and maneuvering direction significantly alter the frequency components of measured signals even under the same nominal condition. By learning fault classification together with these physical attributes, the cross-talk architecture can better classify faults. The second benchmark dataset is the motor compound fault dataset. In this system, each fault component, inner race fault, outer race fault, misalignment, and unbalance is coupled to the other. For motor compound fault, we also test classification performance when we use single-channel data or multi-channel data as input to the classifier. Across both benchmarks, our residual neural dimension reductor, consistently outperformed single-task models, multi-class models that merge all label combinations, and shared trunk multi-task models.
zh
[AI-99] IDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference
【速读】:该论文旨在解决生成式 AI(Generative AI)在大语言模型(Large Language Model, LLM)推理过程中,因负载动态变化和系统级约束导致的推测解码(speculative decoding)效益难以稳定实现的问题。其核心解决方案是提出 TIDE(Temporal Incremental Draft Engine),一个原生集成于服务引擎中的框架,关键在于利用推理过程中生成的目标模型隐藏状态(hidden states)作为训练信号,实现无需重载目标模型的零开销草稿适应(draft adaptation),并通过自适应运行时控制机制仅在有益时激活推测与训练,并借助异构集群将解耦的推理与训练任务映射至合适的 GPU 类型,从而在多种真实工作负载下实现最高 1.15x 的吞吐量提升,同时将草稿训练时间减少 1.67x。
链接: https://arxiv.org/abs/2602.05145
作者: Jiyoung Park,Hankyu Jang,Changseok Song,Wookeun Jung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Speculative decoding can substantially accelerate LLM inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal Incremental Draft Engine), a serving-engine-native framework that integrates online draft adaptation directly into high-performance LLM inference systems. TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model, and employs adaptive runtime control to activate speculation and training only when beneficial. TIDE exploits heterogeneous clusters by mapping decoupled inference and training to appropriate GPU classes. Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative decoding while reducing draft training time by 1.67x compared to approaches that recompute training signals.
zh
[AI-100] HugRAG : Hierarchical Causal Knowledge Graph Design for RAG
【速读】:该论文旨在解决当前基于图结构的检索增强生成(Graph-based Retrieval Augmented Generation, RAG)方法中存在的两大核心问题:一是过度依赖表面节点匹配,缺乏显式的因果建模,导致生成答案不忠实或存在虚假关联;二是现有方法在模块化图结构下易产生信息隔离,限制了跨模块因果推理能力与系统可扩展性。解决方案的关键在于提出HugRAG框架,通过在分层模块间引入因果门控(causal gating)机制,显式建模因果关系以抑制伪相关性,并支持大规模知识图谱上的可扩展推理,从而实现结构化、可扩展且因果驱动的RAG系统。
链接: https://arxiv.org/abs/2602.05143
作者: Nengbo Wang,Tuo Liang,Vikash Singh,Chaoda Song,Van Yang,Yu Yin,Jing Ma,Jagdip Singh,Vipin Chaudhary
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Retrieval augmented generation (RAG) has enhanced large language models by enabling access to external knowledge, with graph-based RAG emerging as a powerful paradigm for structured retrieval and reasoning. However, existing graph-based methods often over-rely on surface-level node matching and lack explicit causal modeling, leading to unfaithful or spurious answers. Prior attempts to incorporate causality are typically limited to local or single-document contexts and also suffer from information isolation that arises from modular graph structures, which hinders scalability and cross-module causal reasoning. To address these challenges, we propose HugRAG, a framework that rethinks knowledge organization for graph-based RAG through causal gating across hierarchical modules. HugRAG explicitly models causal relationships to suppress spurious correlations while enabling scalable reasoning over large-scale knowledge graphs. Extensive experiments demonstrate that HugRAG consistently outperforms competitive graph-based RAG baselines across multiple datasets and evaluation metrics. Our work establishes a principled foundation for structured, scalable, and causally grounded RAG systems.
zh
[AI-101] CAST-CKT: Chaos-Aware Spatio-Temporal and Cross-City Knowledge Transfer for Traffic Flow Prediction
【速读】:该论文旨在解决数据稀缺环境下跨城市交通预测的难题,其核心挑战在于复杂非线性动态特性与域偏移(domain shift)导致现有方法难以捕捉交通流的内在混沌性,从而限制了少样本学习(few-shot learning)的效果。解决方案的关键在于提出CAST-CKT框架,其创新性地引入混沌感知机制:通过混沌分析器量化交通可预测性状态,驱动三个关键模块——混沌感知注意力机制实现时序建模的态适应;自适应拓扑学习捕获动态空间依赖关系;以及基于混沌一致性的跨城市对齐策略,提升知识迁移能力。该框架还支持分时 horizon 的预测并提供不确定性量化,理论分析表明其具有更优的泛化边界。
链接: https://arxiv.org/abs/2602.05133
作者: Abdul Joseph Fofanah,Lian Wen,David Chen,Alpha Alimamy Kamara,Zhongyi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Traffic prediction in data-scarce, cross-city settings is challenging due to complex nonlinear dynamics and domain shifts. Existing methods often fail to capture traffic’s inherent chaotic nature for effective few-shot learning. We propose CAST-CKT, a novel Chaos-Aware Spatio-Temporal and Cross-City Knowledge Transfer framework. It employs an efficient chaotic analyser to quantify traffic predictability regimes, driving several key innovations: chaos-aware attention for regime-adaptive temporal modelling; adaptive topology learning for dynamic spatial dependencies; and chaotic consistency-based cross-city alignment for knowledge transfer. The framework also provides horizon-specific predictions with uncertainty quantification. Theoretical analysis shows improved generalisation bounds. Extensive experiments on four benchmarks in cross-city few-shot settings show CAST-CKT outperforms state-of-the-art methods by significant margins in MAE and RMSE, while offering interpretable regime analysis. Code is available at this https URL.
zh
[AI-102] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks
【速读】:该论文旨在解决现有评分量表(rubric)在用于大语言模型(LLM)判官进行主观偏好判断及强化学习微调(RFT)奖励建模时存在的关键问题:rubric往往覆盖不足、维度混淆、偏好方向错位、冗余或高度相关,导致判官准确率下降和奖励信号质量不佳。解决方案的关键在于提出一种结构化的递归分解-过滤(Recursive Decompose-Filter, RRD)框架,其核心机制包括:通过递归分解将粗粒度rubric细化为更具区分度的细粒度标准以扩大覆盖并增强响应间差异;引入互补过滤机制剔除错位与冗余条目;结合相关性感知加权策略避免高相关标准过度主导,从而生成信息丰富、全面且无冗余的rubric集合,显著提升评估准确性与训练稳定性。
链接: https://arxiv.org/abs/2602.05125
作者: William F. Shen,Xinchi Qiu,Chenxi Whitehouse,Lisa Alazraki,Shashwat Goel,Francesco Barbieri,Timon Willi,Akhil Mathur,Ilias Leontiadis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, rubrics have been used to guide LLM judges in capturing subjective, nuanced, multi-dimensional human preferences, and have been extended from evaluation to reward signals for reinforcement fine-tuning (RFT). However, rubric generation remains hard to control: rubrics often lack coverage, conflate dimensions, misalign preference direction, and contain redundant or highly correlated criteria, degrading judge accuracy and producing suboptimal rewards during RFT. We propose RRD, a principled framework for rubric refinement built on a recursive decompose-filter cycle. RRD decomposes coarse rubrics into fine-grained, discriminative criteria, expanding coverage while sharpening separation between responses. A complementary filtering mechanism removes misaligned and redundant rubrics, and a correlation-aware weighting scheme prevents over-representing highly correlated criteria, yielding rubric sets that are informative, comprehensive, and non-redundant. Empirically, RRD delivers large, consistent gains across both evaluation and training: it improves preference-judgment accuracy on JudgeBench and PPE for both GPT-4o and Llama3.1-405B judges, achieving top performance in all settings with up to +17.7 points on JudgeBench. When used as the reward source for RFT on WildChat, it yields substantially stronger and more stable learning signals, boosting reward by up to 160% (Qwen3-4B) and 60% (Llama3.1-8B) versus 10-20% for prior rubric baselines, with gains that transfer to HealthBench-Hard and BiGGen Bench. Overall, RRD establishes recursive rubric refinement as a scalable and interpretable foundation for LLM judging and reward modeling in open-ended domains.
zh
[AI-103] Democratic Preference Alignment via Sortition-Weighted RLHF
【速读】:该论文旨在解决当前基于人类偏好的对齐方法(如强化学习从人类反馈中学习,RLHF)在训练信号来源上的代表性偏差问题——即人类评分者样本通常为便利抽样,导致某些群体被系统性高估而其他群体被低估。解决方案的关键在于引入民主偏好优化(Democratic Preference Optimization, DemPO),其核心是采用算法随机抽选(algorithmic sortition)机制来构建具有人口统计学代表性的评分小组,从而确保模型学习的价值观更贴近广泛公众的偏好。DemPO提供两种训练方案:硬面板(Hard Panel)仅使用通过随机抽选生成的配额满足的小型公共群体数据;软面板(Soft Panel)则保留全部数据但根据每位评分者的入选概率进行加权调整。理论证明表明,软面板加权可闭式恢复硬面板的目标函数,实验证明硬面板表现最优、软面板优于未加权基线,且效果随模型规模增大而增强,说明在偏好收集阶段强制实现人口代表性比事后校正更能使模型行为反映代表性群体的价值观。
链接: https://arxiv.org/abs/2602.05113
作者: Suvadip Sana,Jinzhou Wu,Martin T. Wells
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 5 figures
Abstract:Whose values should AI systems learn? Preference based alignment methods like RLHF derive their training signal from human raters, yet these rater pools are typically convenience samples that systematically over represent some demographics and under represent others. We introduce Democratic Preference Optimization, or DemPO, a framework that applies algorithmic sortition, the same mechanism used to construct citizen assemblies, to preference based fine tuning. DemPO offers two training schemes. Hard Panel trains exclusively on preferences from a quota satisfying mini public sampled via sortition. Soft Panel retains all data but reweights each rater by their inclusion probability under the sortition lottery. We prove that Soft Panel weighting recovers the expected Hard Panel objective in closed form. Using a public preference dataset that pairs human judgments with rater demographics and a seventy five clause constitution independently elicited from a representative United States panel, we evaluate Llama models from one billion to eight billion parameters fine tuned under each scheme. Across six aggregation methods, the Hard Panel consistently ranks first and the Soft Panel consistently outperforms the unweighted baseline, with effect sizes growing as model capacity increases. These results demonstrate that enforcing demographic representativeness at the preference collection stage, rather than post hoc correction, yields models whose behavior better reflects values elicited from representative publics.
zh
[AI-104] Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在支付风险场景下作为推理质量评估者时存在的可靠性与偏倚问题,尤其是其在商户类别代码(Merchant Category Code, MCC)风险评估中的自我评价偏差及跨模型一致性不足的问题。解决方案的关键在于提出一个结构化的多评估者框架,结合五维评分标准与蒙特卡洛打分法以量化推理合理性与评估者稳定性,并引入共识偏离度(consensus-deviation)指标消除循环评判的干扰,从而实现对LLM自评与跨模型差异的理论化测量。该框架不仅验证了四款主流LLM在真实支付网络数据中与人类专家判断具有显著一致性(Spearman相关系数0.56–0.77),还揭示了不同模型存在系统性正负偏差(如GPT-5.1和Claude 4.5 Sonnet呈负偏差,Gemini-2.5 Pro和Grok 4呈正偏差),并证明匿名化可降低25.8%的偏倚,为金融风控场景中部署LLM作为评估者提供了可复现、偏倚感知的基准方法。
链接: https://arxiv.org/abs/2602.05110
作者: Liang Wang,Junpeng Wang,Chin-chia Michael Yeh,Yan Zheng,Jiarui Sun,Xiran Fan,Xin Dai,Yujie Fan,Yiwei Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly used as evaluators of reasoning quality, yet their reliability and bias in payments-risk settings remain poorly understood. We introduce a structured multi-evaluator framework for assessing LLM reasoning in Merchant Category Code (MCC)-based merchant risk assessment, combining a five-criterion rubric with Monte-Carlo scoring to evaluate rationale quality and evaluator stability. Five frontier LLMs generate and cross-evaluate MCC risk rationales under attributed and anonymized conditions. To establish a judge-independent reference, we introduce a consensus-deviation metric that eliminates circularity by comparing each judge’s score to the mean of all other judges, yielding a theoretically grounded measure of self-evaluation and cross-model deviation. Results reveal substantial heterogeneity: GPT-5.1 and Claude 4.5 Sonnet show negative self-evaluation bias (-0.33, -0.31), while Gemini-2.5 Pro and Grok 4 display positive bias (+0.77, +0.71), with bias attenuating by 25.8 percent under anonymization. Evaluation by 26 payment-industry experts shows LLM judges assign scores averaging +0.46 points above human consensus, and that the negative bias of GPT-5.1 and Claude 4.5 Sonnet reflects closer alignment with human judgment. Ground-truth validation using payment-network data shows four models exhibit statistically significant alignment (Spearman rho = 0.56 to 0.77), confirming that the framework captures genuine quality. Overall, the framework provides a replicable basis for evaluating LLM-as-a-judge systems in payment-risk workflows and highlights the need for bias-aware protocols in operational financial settings.
zh
[AI-105] GAMMS: Graph based Adversarial Multiagent Modeling Simulator
【速读】:该论文旨在解决当前多智能体系统(Multi-Agent Systems, MAS)研究中仿真工具在可扩展性与易用性之间的矛盾问题:现有高保真仿真器虽功能强大,但计算开销大、难以支持快速原型设计或大规模部署。其解决方案的关键在于提出一种轻量级且可扩展的图结构对抗多智能体建模仿真框架(Graph-based Adversarial Multiagent Modeling Simulator, GAMMS),该框架通过五个核心目标实现突破——即可扩展性、易用性、以集成为导向的架构、快速可视化反馈和现实世界对齐,从而高效模拟城市路网等复杂场景,并兼容多种策略类型(启发式、优化驱动、学习驱动,包括基于大语言模型的智能体),同时支持与外部工具(如机器学习库、规划求解器)无缝集成,显著降低研究门槛并提升标准硬件上的仿真性能。
链接: https://arxiv.org/abs/2602.05105
作者: Rohan Patil,Jai Malegaonkar,Xiao Jiang,Andre Dion,Gaurav S. Sukhatme,Henrik I. Christensen
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Software Engineering (cs.SE)
备注:
Abstract:As intelligent systems and multi-agent coordination become increasingly central to real-world applications, there is a growing need for simulation tools that are both scalable and accessible. Existing high-fidelity simulators, while powerful, are often computationally expensive and ill-suited for rapid prototyping or large-scale agent deployments. We present GAMMS (Graph based Adversarial Multiagent Modeling Simulator), a lightweight yet extensible simulation framework designed to support fast development and evaluation of agent behavior in environments that can be represented as graphs. GAMMS emphasizes five core objectives: scalability, ease of use, integration-first architecture, fast visualization feedback, and real-world grounding. It enables efficient simulation of complex domains such as urban road networks and communication systems, supports integration with external tools (e.g., machine learning libraries, planning solvers), and provides built-in visualization with minimal configuration. GAMMS is agnostic to policy type, supporting heuristic, optimization-based, and learning-based agents, including those using large language models. By lowering the barrier to entry for researchers and enabling high-performance simulations on standard hardware, GAMMS facilitates experimentation and innovation in multi-agent systems, autonomous planning, and adversarial modeling. The framework is open-source and available at this https URL
zh
[AI-106] Evaluating Robustness and Adaptability in Learning-Based Mission Planning for Active Debris Removal
【速读】:该论文旨在解决主动空间碎片清除(Active Debris Removal, ADR)任务中多目标交会的自主规划问题,核心挑战在于如何在燃料消耗和任务时长等严格约束下实现高效、鲁棒且可行的路径规划。解决方案的关键在于对比三种不同策略:基于固定参数训练的掩码近端策略优化(Masked Proximal Policy Optimization, Masked PPO)策略、通过域随机化提升鲁棒性的Masked PPO策略,以及作为基线的蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)。研究发现,域随机化的Masked PPO在保持较高性能的同时显著增强了对环境变化的适应能力,而MCTS虽能最优处理约束变化但计算开销巨大;因此,未来方向应聚焦于将训练阶段的多样性与在线规划相结合,以构建兼具效率与鲁棒性的新型ADR任务规划系统。
链接: https://arxiv.org/abs/2602.05091
作者: Agni Bandyopadhyay,Günther Waxenegger-Wilfing
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Space Physics (physics.space-ph)
备注: Presented at Conference: International Conference on Space Robotics (ISPARO,2025) At: Sendai,Japan
Abstract:Autonomous mission planning for Active Debris Removal (ADR) must balance efficiency, adaptability, and strict feasibility constraints on fuel and mission duration. This work compares three planners for the constrained multi-debris rendezvous problem in Low Earth Orbit: a nominal Masked Proximal Policy Optimization (PPO) policy trained under fixed mission parameters, a domain-randomized Masked PPO policy trained across varying mission constraints for improved robustness, and a plain Monte Carlo Tree Search (MCTS) baseline. Evaluations are conducted in a high-fidelity orbital simulation with refueling, realistic transfer dynamics, and randomized debris fields across 300 test cases in nominal, reduced fuel, and reduced mission time scenarios. Results show that nominal PPO achieves top performance when conditions match training but degrades sharply under distributional shift, while domain-randomized PPO exhibits improved adaptability with only moderate loss in nominal performance. MCTS consistently handles constraint changes best due to online replanning but incurs orders-of-magnitude higher computation time. The findings underline a trade-off between the speed of learned policies and the adaptability of search-based methods, and suggest that combining training-time diversity with online planning could be a promising path for future resilient ADR mission planners.
zh
[AI-107] VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health
【速读】:该论文旨在解决生成式 AI 在心理健康领域应用中的安全性问题,尤其是其在自杀风险识别与应对中的潜在风险。当前尽管大规模语言模型(Large Language Models, LLMs)驱动的聊天机器人被广泛用于心理支持,但其安全性尚缺乏证据基础。为此,研究提出并验证了“心理健康中伦理与负责任AI的验证”(Validation of Ethical and Responsible AI in Mental Health, VERA-MH)评估框架,该方案的核心在于构建一个基于临床专家评分的自动化安全基准,通过模拟用户与聊天机器人的对话,并由持证心理健康临床医师独立评估对话中的安全行为与用户代理的真实性,同时引入LLM作为自动评分工具进行对比。结果显示,临床医生间具有高一致性(校正后组内相关系数 IRR=0.77),且LLM评分与临床共识高度一致(IRR=0.81),证明VERA-MH具备临床有效性与可靠性,为AI心理健康工具的安全性提供了可量化的自动化评估路径。
链接: https://arxiv.org/abs/2602.05088
作者: Kate H. Bentley,Luca Belli,Adam M. Chekroud,Emily J. Ward,Emily R. Dworkin,Emily Van Ark,Kelly M. Johnston,Will Alexander,Millard Brown,Matt Hawrilenko
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Millions now use leading generative AI chatbots for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet the urgent need for an evidence-based automated safety benchmark. This study aimed to examine the clinical validity and reliability of the VERA-MH evaluation for AI safety in suicide risk detection and response. We first simulated a large set of conversations between large language model (LLM)-based users (user-agents) and general-purpose AI chatbots. Licensed mental health clinicians used a rubric (scoring guide) to independently rate the simulated conversations for safe and unsafe chatbot behaviors, as well as user-agent realism. An LLM-based judge used the same scoring rubric to evaluate the same set of simulated conversations. We then compared rating alignment across (a) individual clinicians and (b) clinician consensus and the LLM judge, and © examined clinicians’ ratings of user-agent realism. Individual clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability [IRR]: 0.77), thus establishing a gold-standard clinical reference. The LLM judge was strongly aligned with this clinical consensus (IRR: 0.81) overall and within key conditions. Clinician raters generally perceived the user-agents to be realistic. For the potential mental health benefits of AI chatbots to be realized, attention to safety is paramount. Findings from this human evaluation study support the clinical validity and reliability of VERA-MH: an open-source, fully automated AI safety evaluation for mental health. Further research will address VERA-MH generalizability and robustness.
zh
[AI-108] Autodiscover: A reinforcement learning recommendation system for the cold-start imbalance challenge in active learning powered by graph-aware thompson sampling
【速读】:该论文旨在解决系统性文献综述(Systematic Literature Review, SLR)中因科学产出增长导致的手动筛选效率低下问题,尤其在相关研究占比低(low prevalence)和专家标注资源稀缺(scarce, costly expert decisions)的场景下,传统主动学习(Active Learning, AL)方法因采用固定查询策略而难以适应动态变化的文献网络结构。解决方案的关键在于提出AutoDiscover框架,将AL建模为一个由自适应代理驱动的在线决策问题:通过构建异质图(heterogeneous graph)捕捉文献、作者与元数据之间的多维关系,并利用异质图注意力网络(Heterogeneous Graph Attention Network, HAN)学习节点表征;进而采用折扣汤普森采样(Discounted Thompson Sampling, DTS)代理动态管理一组查询策略,在实时人机协同标注下平衡探索与利用,从而应对非平稳的评审动态(non-stationary review dynamics)。该方法显著提升了筛选效率,并通过冷启动阶段的策略引导克服了静态AL方法在初始标签匮乏时失效的问题。
链接: https://arxiv.org/abs/2602.05087
作者: Parsa Vares
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注: Master’s Thesis, University of Luxembourg in collaboration with Luxembourg Institute of Science and Technology (LIST). Supervised by Prof. Jun Pang and Dr. Eloi Durant
Abstract:Systematic literature reviews (SLRs) are fundamental to evidence-based research, but manual screening is an increasing bottleneck as scientific output grows. Screening features low prevalence of relevant studies and scarce, costly expert decisions. Traditional active learning (AL) systems help, yet typically rely on fixed query strategies for selecting the next unlabeled documents. These static strategies do not adapt over time and ignore the relational structure of scientific literature networks. This thesis introduces AutoDiscover, a framework that reframes AL as an online decision-making problem driven by an adaptive agent. Literature is modeled as a heterogeneous graph capturing relationships among documents, authors, and metadata. A Heterogeneous Graph Attention Network (HAN) learns node representations, which a Discounted Thompson Sampling (DTS) agent uses to dynamically manage a portfolio of query strategies. With real-time human-in-the-loop labels, the agent balances exploration and exploitation under non-stationary review dynamics, where strategy utility changes over time. On the 26-dataset SYNERGY benchmark, AutoDiscover achieves higher screening efficiency than static AL baselines. Crucially, the agent mitigates cold start by bootstrapping discovery from minimal initial labels where static approaches fail. We also introduce TS-Insight, an open-source visual analytics dashboard to interpret, verify, and diagnose the agent’s decisions. Together, these contributions accelerate SLR screening under scarce expert labels and low prevalence of relevant studies.
zh
[AI-109] Individual Fairness In Strategic Classification
【速读】:该论文旨在解决战略分类(strategic classification)场景下的个体公平性(individual fairness)问题,即当个体通过改变自身特征来影响机器学习(ML)决策时,如何确保公平性。现有研究主要关注群体公平性,而个体公平性仍属空白。作者首先证明了确定性阈值分类器无法满足个体公平性,进而提出使用随机化分类器(randomized classifier)作为解决方案,并推导出保证个体公平性的理论条件;在此基础上,通过线性规划(linear programming)优化方法求解最优的随机化分类器,从而在保障公平性的前提下提升模型性能。实验表明,该方法能有效缓解不公平现象并改善公平性与准确率之间的权衡关系。
链接: https://arxiv.org/abs/2602.05084
作者: Zhiqun Zuo,Mohammad Mahdi Khalili
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Strategic classification, where individuals modify their features to influence machine learning (ML) decisions, presents critical fairness challenges. While group fairness in this setting has been widely studied, individual fairness remains underexplored. We analyze threshold-based classifiers and prove that deterministic thresholds violate individual fairness. Then, we investigate the possibility of using a randomized classifier to achieve individual fairness. We introduce conditions under which a randomized classifier ensures individual fairness and leverage these conditions to find an optimal and individually fair randomized classifier through a linear programming problem. Additionally, we demonstrate that our approach can be extended to group fairness notions. Experiments on real-world datasets confirm that our method effectively mitigates unfairness and improves the fairness-accuracy trade-off.
zh
[AI-110] Reliable Explanations or Random Noise? A Reliability Metric for XAI
【速读】:该论文旨在解决复杂机器学习模型解释结果的可靠性问题,即在现实场景中(如小输入扰动、特征冗余、模型微调等非对抗性变化下),现有解释方法(如SHAP和Integrated Gradients)的稳定性不足,导致解释结果不可靠,从而削弱了可解释人工智能(Explainable AI, XAI)系统的可信度。解决方案的关键在于提出Explanation Reliability Index (ERI),这是一个基于四个可靠性公理(对小输入扰动的鲁棒性、特征冗余下的一致性、模型演进中的平滑性、轻微分布偏移下的韧性)的指标体系,并为每个公理提供形式化保障(如Lipschitz型边界和时间稳定性结果)。此外,作者进一步设计了ERI-T用于序列模型的时间可靠性评估以及ERI-Bench基准测试平台,系统性地验证了主流解释方法在真实数据上的不稳定性,从而为XAI系统的可靠评估提供了理论依据与实践工具。
链接: https://arxiv.org/abs/2602.05082
作者: Poushali Sengupta,Sabita Maharjan,Frank Eliassen,Shashi Raj Pandey,Yan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:In recent years, explaining decisions made by complex machine learning models has become essential in high-stakes domains such as energy systems, healthcare, finance, and autonomous systems. However, the reliability of these explanations, namely, whether they remain stable and consistent under realistic, non-adversarial changes, remains largely unmeasured. Widely used methods such as SHAP and Integrated Gradients (IG) are well-motivated by axiomatic notions of attribution, yet their explanations can vary substantially even under system-level conditions, including small input perturbations, correlated representations, and minor model updates. Such variability undermines explanation reliability, as reliable explanations should remain consistent across equivalent input representations and small, performance-preserving model changes. We introduce the Explanation Reliability Index (ERI), a family of metrics that quantifies explanation stability under four reliability axioms: robustness to small input perturbations, consistency under feature redundancy, smoothness across model evolution, and resilience to mild distributional shifts. For each axiom, we derive formal guarantees, including Lipschitz-type bounds and temporal stability results. We further propose ERI-T, a dedicated measure of temporal reliability for sequential models, and introduce ERI-Bench, a benchmark designed to systematically stress-test explanation reliability across synthetic and real-world datasets. Experimental results reveal widespread reliability failures in popular explanation methods, showing that explanations can be unstable under realistic deployment conditions. By exposing and quantifying these instabilities, ERI enables principled assessment of explanation reliability and supports more trustworthy explainable AI (XAI) systems.
zh
[AI-111] Optimizing Mission Planning for Multi-Debris Rendezvous Using Reinforcement Learning with Refueling and Adaptive Collision Avoidance
【速读】:该论文旨在解决近地轨道环境中空间碎片日益增多背景下,主动空间碎片清除(Active Debris Removal, ADR)任务中面临的碰撞风险高、任务规划复杂等问题,特别是在多目标碎片清除场景下如何实现高效且安全的航天器交会对接操作。其解决方案的关键在于提出了一种基于强化学习(Reinforcement Learning, RL)的自适应避障框架,采用掩码化的近端策略优化(Masked Proximal Policy Optimization, PPO)算法,使智能体能够根据实时轨道环境动态调整机动策略,在保障燃料效率、规避活跃碰撞区域的同时,优化多个碎片目标的 rendezvous 序列与 refueling 策略,从而显著降低碰撞风险并提升整体任务效能。
链接: https://arxiv.org/abs/2602.05075
作者: Agni Bandyopadhyay,Gunther Waxenegger-Wilfing
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Space Physics (physics.space-ph)
备注: Accpeted at Conference: 15th IAA Symposium on Small Satellites for Earth System Observation At: Berlin
Abstract:As the orbital environment around Earth becomes increasingly crowded with debris, active debris removal (ADR) missions face significant challenges in ensuring safe operations while minimizing the risk of in-orbit collisions. This study presents a reinforcement learning (RL) based framework to enhance adaptive collision avoidance in ADR missions, specifically for multi-debris removal using small satellites. Small satellites are increasingly adopted due to their flexibility, cost effectiveness, and maneuverability, making them well suited for dynamic missions such as ADR. Building on existing work in multi-debris rendezvous, the framework integrates refueling strategies, efficient mission planning, and adaptive collision avoidance to optimize spacecraft rendezvous operations. The proposed approach employs a masked Proximal Policy Optimization (PPO) algorithm, enabling the RL agent to dynamically adjust maneuvers in response to real-time orbital conditions. Key considerations include fuel efficiency, avoidance of active collision zones, and optimization of dynamic orbital parameters. The RL agent learns to determine efficient sequences for rendezvousing with multiple debris targets, optimizing fuel usage and mission time while incorporating necessary refueling stops. Simulated ADR scenarios derived from the Iridium 33 debris dataset are used for evaluation, covering diverse orbital configurations and debris distributions to demonstrate robustness and adaptability. Results show that the proposed RL framework reduces collision risk while improving mission efficiency compared to traditional heuristic approaches. This work provides a scalable solution for planning complex multi-debris ADR missions and is applicable to other multi-target rendezvous problems in autonomous space mission planning. Comments: Accpeted at Conference: 15th IAA Symposium on Small Satellites for Earth System Observation At: Berlin Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO); Space Physics (physics.space-ph) Cite as: arXiv:2602.05075 [cs.AI] (or arXiv:2602.05075v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.05075 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Agni Bandyopadhyay [view email] [v1] Wed, 4 Feb 2026 21:49:20 UTC (663 KB)
zh
[AI-112] owards Reducible Uncertainty Modeling for Reliable Large Language Model Agents
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在交互式代理(LLM agents)场景下不确定性量化(Uncertainty Quantification, UQ)的理论与实践缺失问题。当前多数UQ研究集中于单轮问答任务,无法适配复杂、动态且开放世界中的交互式代理应用,导致现有方法在不确定性建模上存在根本性局限。解决方案的关键在于提出一种全新的“条件不确定性降低过程”(conditional uncertainty reduction process)视角,该视角将交互行为视为可主动减少不确定性的机制,从而显式建模代理轨迹中可被行动缓解的不确定性,突破了传统将UQ视为不确定性累积过程的局限。这一框架为设计面向LLM代理的UQ方法提供了概念性指导,并推动其在前沿模型开发和领域特定应用中的落地。
链接: https://arxiv.org/abs/2602.05073
作者: Changdae Oh,Seongheon Park,To Eun Kim,Jiatong Li,Wendi Li,Samuel Yeh,Xuefeng Du,Hamed Hassani,Paul Bogdan,Dawn Song,Sharon Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Uncertainty quantification (UQ) for large language models (LLMs) is a key building block for safety guardrails of daily LLM applications. Yet, even as LLM agents are increasingly deployed in highly complex tasks, most UQ research still centers on single-turn question-answering. We argue that UQ research must shift to realistic settings with interactive agents, and that a new principled framework for agent UQ is needed. This paper presents the first general formulation of agent UQ that subsumes broad classes of existing UQ setups. Under this formulation, we show that prior works implicitly treat LLM UQ as an uncertainty accumulation process, a viewpoint that breaks down for interactive agents in an open world. In contrast, we propose a novel perspective, a conditional uncertainty reduction process, that explicitly models reducible uncertainty over an agent’s trajectory by highlighting “interactivity” of actions. From this perspective, we outline a conceptual framework to provide actionable guidance for designing UQ in LLM agent setups. Finally, we conclude with practical implications of the agent UQ in frontier LLM development and domain-specific applications, as well as open remaining problems.
zh
[AI-113] E-Globe: Scalable ε-Global Verification of Neural Networks via Tight Upper Bounds and Pattern-Aware Branching
【速读】:该论文旨在解决神经网络在安全关键应用中因鲁棒性不足而难以部署的问题,同时克服现有形式化验证方法在可扩展性与完备性之间的权衡。其解决方案的核心在于提出一种基于分支定界(branch-and-bound, BaB)框架的混合验证器,其中关键创新是引入一个保持ReLU输入输出图结构的精确非线性规划问题(Nonlinear Program with Complementarity Constraints, NLP-CC)用于上界紧致化,该方法能确保任何可行解均为有效反例,从而实现对不安全子问题的快速剪枝;此外,通过热启动NLP求解和模式对齐强分支策略进一步加速验证过程,显著提升整体效率。
链接: https://arxiv.org/abs/2602.05068
作者: Wenting Li,Saif R. Kazi,Russell Bent,Duo Zhou,Huan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 10 figures
Abstract:Neural networks achieve strong empirical performance, but robustness concerns still hinder deployment in safety-critical applications. Formal verification provides robustness guarantees, but current methods face a scalability-completeness trade-off. We propose a hybrid verifier in a branch-and-bound (BaB) framework that efficiently tightens both upper and lower bounds until an \epsilon- global optimum is reached or early stop is triggered. The key is an exact nonlinear program with complementarity constraints (NLP-CC) for upper bounding that preserves the ReLU input-output graph, so any feasible solution yields a valid counterexample and enables rapid pruning of unsafe subproblems. We further accelerate verification with (i) warm-started NLP solves requiring minimal constraint-matrix updates and (ii) pattern-aligned strong branching that prioritizes splits most effective at tightening relaxations. We also provide conditions under which NLP-CC upper bounds are tight. Experiments on MNIST and CIFAR-10 show markedly tighter upper bounds than PGD across perturbation radii spanning up to three orders of magnitude, fast per-node solves in practice, and substantial end-to-end speedups over MIP-based verification, amplified by warm-starting, GPU batching, and pattern-aligned branching.
zh
[AI-114] Bypassing AI Control Protocols via Agent -as-a-Proxy Attacks
【速读】:该论文旨在解决AI代理在自动化关键任务时面临的间接提示注入(Indirect Prompt Injection, IPI)攻击问题,尤其是现有基于监控的防御机制(如联合评估代理的思维链Chain-of-Thought与工具调用行为)易被绕过的问题。其核心解决方案的关键在于提出了一种新型“代理即代理”(Agent-as-a-Proxy)攻击方法,该方法将AI代理本身作为攻击载体,使攻击者能够同时绕过代理和监控模块,从而揭示当前监控型防御体系——无论模型规模大小(包括前沿级监控模型Qwen2.5-72B)——均存在根本性脆弱性。
链接: https://arxiv.org/abs/2602.05066
作者: Jafar Isbarov,Murat Kantarcioglu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As AI agents automate critical workloads, they remain vulnerable to indirect prompt injection (IPI) attacks. Current defenses rely on monitoring protocols that jointly evaluate an agent’s Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent. We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a-Proxy attack, where prompt injection attacks treat the agent as a delivery mechanism, bypassing both agent and monitor simultaneously. While prior work on scalable oversight has focused on whether small monitors can supervise large agents, we show that even frontier-scale monitors are vulnerable. Large-scale monitoring models like Qwen2.5-72B can be bypassed by agents with similar capabilities, such as GPT-4o mini and Llama-3.1-70B. On the AgentDojo benchmark, we achieve a high attack success rate against AlignmentCheck and Extract-and-Evaluate monitors under diverse monitoring LLMs. Our findings suggest current monitoring-based agentic defenses are fundamentally fragile regardless of model scale.
zh
[AI-115] Evaluating Large Language Models on Solved and Unsolved Problems in Graph Theory: Implications for Computing Education
【速读】:该论文旨在探究大型语言模型(Large Language Models, LLMs)在支持数学严谨性思维方面的可靠性,特别是在计算机科学中图论相关问题上的表现。研究聚焦于两类问题:一类是已解决的图论问题(关于线图的优美性),另一类是尚未解决的开放问题。解决方案的关键在于设计了一个八阶段评估协议,模拟真实的数学探究过程,包括理解、探索、策略制定和证明构建等环节。结果显示,LLMs 在已解决的问题上能够准确识别结构、调用正确理论并生成专家验证有效的证明;而在开放问题上虽能提出合理解释和探索策略,但无法推进至解法,且未出现虚构结论,体现了其在概念探索中的辅助价值与在创造性推理上的局限性。
链接: https://arxiv.org/abs/2602.05059
作者: Adithya Kulkarni,Mohna Chakraborty,Jay Bagga
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models are increasingly used by students to explore advanced material in computer science, including graph theory. As these tools become integrated into undergraduate and graduate coursework, it is important to understand how reliably they support mathematically rigorous thinking. This study examines the performance of a LLM on two related graph theoretic problems: a solved problem concerning the gracefulness of line graphs and an open problem for which no solution is currently known. We use an eight stage evaluation protocol that reflects authentic mathematical inquiry, including interpretation, exploration, strategy formation, and proof construction. The model performed strongly on the solved problem, producing correct definitions, identifying relevant structures, recalling appropriate results without hallucination, and constructing a valid proof confirmed by a graph theory expert. For the open problem, the model generated coherent interpretations and plausible exploratory strategies but did not advance toward a solution. It did not fabricate results and instead acknowledged uncertainty, which is consistent with the explicit prompting instructions that directed the model to avoid inventing theorems or unsupported claims. These findings indicate that LLMs can support exploration of established material but remain limited in tasks requiring novel mathematical insight or critical structural reasoning. For computing education, this distinction highlights the importance of guiding students to use LLMs for conceptual exploration while relying on independent verification and rigorous argumentation for formal problem solving. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.05059 [cs.AI] (or arXiv:2602.05059v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.05059 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-116] ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation ICLR2026
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)中的两个核心挑战:一是分布外(Out-of-Distribution, OOD)误差问题,即策略在训练数据分布之外采取动作导致性能下降;二是最优策略分布可能具有多模态特性,难以用传统参数化策略有效建模。解决方案的关键在于提出ReFORM方法,其基于流模型(Flow Policy)构建,通过构造性地施加支持集约束(support constraint)来避免OOD行为,同时保持策略表达能力。具体而言,ReFORM首先学习一个行为克隆(Behavior Cloning, BC)流策略,其源分布受限以捕捉动作分布的支持区域;随后优化一个反射流(Reflected Flow),生成有界噪声并作用于BC流,从而在不改变支持集的前提下最大化策略性能。该方法在OGBench基准的40个复杂任务上均优于所有基线方法,且使用统一超参数配置即可实现稳定优越表现。
链接: https://arxiv.org/abs/2602.05051
作者: Songyuan Zhang,Oswin So,H. M. Sabbir Ahmad,Eric Yang Yu,Matthew Cleaveland,Mitchell Black,Chuchu Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 24 pages, 17 figures; Accepted by the fourteenth International Conference on Learning Representations (ICLR 2026)
Abstract:Offline reinforcement learning (RL) aims to learn the optimal policy from a fixed dataset generated by behavior policies without additional environment interactions. One common challenge that arises in this setting is the out-of-distribution (OOD) error, which occurs when the policy leaves the training distribution. Prior methods penalize a statistical distance term to keep the policy close to the behavior policy, but this constrains policy improvement and may not completely prevent OOD actions. Another challenge is that the optimal policy distribution can be multimodal and difficult to represent. Recent works apply diffusion or flow policies to address this problem, but it is unclear how to avoid OOD errors while retaining policy expressiveness. We propose ReFORM, an offline RL method based on flow policies that enforces the less restrictive support constraint by construction. ReFORM learns a behavior cloning (BC) flow policy with a bounded source distribution to capture the support of the action distribution, then optimizes a reflected flow that generates bounded noise for the BC flow while keeping the support, to maximize the performance. Across 40 challenging tasks from the OGBench benchmark with datasets of varying quality and using a constant set of hyperparameters for all tasks, ReFORM dominates all baselines with hand-tuned hyperparameters on the performance profile curves.
zh
[AI-117] MINT: Minimal Information Neuro-Symbolic Tree for Objective-Driven Knowledge-Gap Reason ing and Active Elicitation
【速读】:该论文旨在解决开放世界中人机协同规划(human-AI teaming)时因信息不完整和未知因素(如未见物体、人类目标意图等)导致的知识缺口问题,从而影响规划性能。其核心挑战在于如何设计最优的交互策略,使AI代理能主动获取人类输入以填补知识空白并提升整体规划效果。解决方案的关键是提出最小信息神经符号树(Minimal Information Neuro-Symbolic Tree, MINT),它通过构建符号化交互树来推理知识缺口对规划结果的影响,并结合神经规划策略估计不确定性;同时利用自对弈机制优化AI的提问策略,并借助大语言模型(LLM)搜索与总结MINT的推理过程,生成最优的人类输入请求序列,从而在有限提问次数下实现接近专家水平的规划回报。
链接: https://arxiv.org/abs/2602.05048
作者: Zeyu Fang,Tian Lan,Mahdi Imani
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Joint planning through language-based interactions is a key area of human-AI teaming. Planning problems in the open world often involve various aspects of incomplete information and unknowns, e.g., objects involved, human goals/intents – thus leading to knowledge gaps in joint planning. We consider the problem of discovering optimal interaction strategies for AI agents to actively elicit human inputs in object-driven planning. To this end, we propose Minimal Information Neuro-Symbolic Tree (MINT) to reason about the impact of knowledge gaps and leverage self-play with MINT to optimize the AI agent’s elicitation strategies and queries. More precisely, MINT builds a symbolic tree by making propositions of possible human-AI interactions and by consulting a neural planning policy to estimate the uncertainty in planning outcomes caused by remaining knowledge gaps. Finally, we leverage LLM to search and summarize MINT’s reasoning process and curate a set of queries to optimally elicit human inputs for best planning performance. By considering a family of extended Markov decision processes with knowledge gaps, we analyze the return guarantee for a given MINT with active human elicitation. Our evaluation on three benchmarks involving unseen/unknown objects of increasing realism shows that MINT-based planning attains near-expert returns by issuing a limited number of questions per task while achieving significantly improved rewards and success rates.
zh
[AI-118] Quality Model for Machine Learning Components
【速读】:该论文旨在解决当前机器学习(Machine Learning, ML)模型在从原型到生产部署过程中因测试范围局限而引发的集成、部署与运维失败问题。现有测试通常仅关注模型性能等内部属性,忽视了系统级需求(如吞吐量、资源消耗和鲁棒性),导致模型无法满足实际系统约束。解决方案的关键在于提出一个专为ML组件设计的质量模型,该模型将系统属性与ML组件属性分离,从而为模型开发者提供清晰的需求获取与协商框架,并建立开发者与系统利益相关者之间的共同术语体系,确保测试聚焦于系统衍生的需求。该模型已通过问卷调查验证其相关性与价值,并成功集成至开源ML组件测试与评估工具中,体现了其实用性。
链接: https://arxiv.org/abs/2602.05043
作者: Grace A. Lewis,Rachel Brower-Sinning,Robert Edman,Ipek Ozkaya,Sebastián Echeverría,Alex Derr,Collin Beaudoin,Katherine R. Maffey
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: A short version of this paper has been accepted to CAIN 2026, the 5th IEEE/ACM Conference on AI Engineering - Software Engineering for AI Systems
Abstract:Despite increased adoption and advances in machine learning (ML), there are studies showing that many ML prototypes do not reach the production stage and that testing is still largely limited to testing model properties, such as model performance, without considering requirements derived from the system it will be a part of, such as throughput, resource consumption, or robustness. This limited view of testing leads to failures in model integration, deployment, and operations. In traditional software development, quality models such as ISO 25010 provide a widely used structured framework to assess software quality, define quality requirements, and provide a common language for communication with stakeholders. A newer standard, ISO 25059, defines a more specific quality model for AI systems. However, a problem with this standard is that it combines system attributes with ML component attributes, which is not helpful for a model developer, as many system attributes cannot be assessed at the component level. In this paper, we present a quality model for ML components that serves as a guide for requirements elicitation and negotiation and provides a common vocabulary for ML component developers and system stakeholders to agree on and define system-derived requirements and focus their testing efforts accordingly. The quality model was validated through a survey in which the participants agreed with its relevance and value. The quality model has been successfully integrated into an open-source tool for ML component testing and evaluation demonstrating its practical application.
zh
[AI-119] AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders
【速读】:该论文旨在解决音频领域中神经表示可解释性不足的问题,尤其是在自监督语音模型(如Whisper和HuBERT)中缺乏对内部表征的有效解析工具。其解决方案的关键在于训练稀疏自动编码器(Sparse Autoencoders, SAEs)以提取各编码层的稀疏特征,并系统评估这些特征在稳定性、可解释性和实用性方面的表现。研究发现,超过50%的SAE特征在不同随机种子下保持一致,且能有效分离通用声学信息、语义内容及特定事件(如环境噪声和副语言声音),仅需移除19–27%的特征即可消除某一概念;此外,通过特征调控(feature steering)可显著降低Whisper的误检率(达70%),同时维持词错误率(WER)基本不变,表明其具备实际应用价值;更进一步地,SAE特征与人类大脑电生理信号(EEG)在语音感知过程中的活动高度相关,验证了其与人脑神经处理机制的一致性。
链接: https://arxiv.org/abs/2602.05027
作者: Georgii Aparin,Tasnima Sadekova,Alexey Rukhovich,Assel Yermekova,Laida Kushnareva,Vadim Popov,Kristian Kuznetsov,Irina Piontkovskaya
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper’s false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at this https URL.
zh
[AI-120] Laws of Learning Dynamics and the Core of Learners
【速读】:该论文旨在解决生成式 AI(Generative AI)在面对迁移型对抗攻击(transfer-based adversarial attacks)时鲁棒性不足的问题。其解决方案的关键在于提出一种基于熵的终身集成学习方法(entropy-based lifelong ensemble learning),通过引入免疫机制来增强模型对对抗样本的防御能力,从而在CIFAR-10数据集上实现比简单平均集成更优的准确率表现,尤其在强扰动条件下提升显著。
链接: https://arxiv.org/abs/2602.05026
作者: Inkee Jung,Siu Cheong Lau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 14 pages, 5 figures
Abstract:We formulate the fundamental laws governing learning dynamics, namely the conservation law and the decrease of total entropy. Within this framework, we introduce an entropy-based lifelong ensemble learning method. We evaluate its effectiveness by constructing an immunization mechanism to defend against transfer-based adversarial attacks on the CIFAR-10 dataset. Compared with a naive ensemble formed by simply averaging models specialized on clean and adversarial samples, the resulting logifold achieves higher accuracy in most test cases, with particularly large gains under strong perturbations.
zh
[AI-121] Do Vision-Language Models Respect Contextual Integrity in Location Disclosure? ICLR2026
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在图像地理定位(image geolocation)任务中可能引发的隐私泄露问题,尤其是在用户未明确授权的情况下,模型可实现街景级精度的地理位置推断,从而对个人隐私构成威胁。传统方法通过“一刀切”式限制地理信息输出来缓解风险,但无法区分合法与恶意使用场景,导致实用性受限。论文提出的关键解决方案是:VLM应具备基于图像内容的上下文感知能力,即在推理过程中识别图像中的社会规范和情境线索(如敏感场所、人物行为等),以动态决定地理信息的披露粒度,从而在保护隐私与维持功能可用性之间实现平衡。为此,作者构建了VLM-GEOPRIVACY基准,用于评估模型是否能准确理解隐含的社会规范并做出符合人类隐私预期的决策,实验表明当前主流VLMs普遍缺乏这种上下文条件下的隐私推理能力。
链接: https://arxiv.org/abs/2602.05023
作者: Ruixin Yang,Ethan Mendes,Arthur Wang,James Hays,Sauvik Das,Wei Xu,Alan Ritter
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by ICLR 2026
Abstract:Vision-language models (VLMs) have demonstrated strong performance in image geolocation, a capability further sharpened by frontier multimodal large reasoning models (MLRMs). This poses a significant privacy risk, as these widely accessible models can be exploited to infer sensitive locations from casually shared photos, often at street-level precision, potentially surpassing the level of detail the sharer consented or intended to disclose. While recent work has proposed applying a blanket restriction on geolocation disclosure to combat this risk, these measures fail to distinguish valid geolocation uses from malicious behavior. Instead, VLMs should maintain contextual integrity by reasoning about elements within an image to determine the appropriate level of information disclosure, balancing privacy and utility. To evaluate how well models respect contextual integrity, we introduce VLM-GEOPRIVACY, a benchmark that challenges VLMs to interpret latent social norms and contextual cues in real-world images and determine the appropriate level of location disclosure. Our evaluation of 14 leading VLMs shows that, despite their ability to precisely geolocate images, the models are poorly aligned with human privacy expectations. They often over-disclose in sensitive contexts and are vulnerable to prompt-based attacks. Our results call for new design principles in multimodal systems to incorporate context-conditioned privacy reasoning.
zh
[AI-122] From Frag mentation to Integration: Exploring the Design Space of AI Agents for Human-as-the-Unit Privacy Management
【速读】:该论文旨在解决用户在跨平台数字足迹管理中面临的隐私挑战,这些问题因应用多样性、时间情境变化及人际关系复杂性而变得尤为棘手。现有系统缺乏统一且上下文感知的隐私控制机制,导致用户只能依赖临时性的手动策略,难以实现有效管理。解决方案的关键在于引入以“人”为核心单位的视角,设计具备自主决策能力的AI代理(AI agent),特别是聚焦于自动化、全面的帖子发布后管理功能,从而帮助用户修复和优化其数字足迹。研究发现,用户对AI代理在隐私管理中的准确性和效率表现出更高信任度,表明AI代理可作为连接碎片化隐私管理需求的重要桥梁。
链接: https://arxiv.org/abs/2602.05016
作者: Eryue Xu,Tianshi Li
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注:
Abstract:Managing one’s digital footprint is overwhelming, as it spans multiple platforms and involves countless context-dependent decisions. Recent advances in agentic AI offer ways forward by enabling holistic, contextual privacy-enhancing solutions. Building on this potential, we adopted a ‘‘human-as-the-unit’’ perspective and investigated users’ cross-context privacy challenges through 12 semi-structured interviews. Results reveal that people rely on ad hoc manual strategies while lacking comprehensive privacy controls, highlighting nine privacy-management challenges across applications, temporal contexts, and relationships. To explore solutions, we generated nine AI agent concepts and evaluated them via a speed-dating survey with 116 US participants. The three highest-ranked concepts were all post-sharing management tools with half or full agent autonomy, with users expressing greater trust in AI accuracy than in their own efforts. Our findings highlight a promising design space where users see AI agents bridging the fragments in privacy management, particularly through automated, comprehensive post-sharing remediation of users’ digital footprints.
zh
[AI-123] Near-Optimal Dynamic Matching via Coarsening with Application to Heart Transplantation
【速读】:该论文旨在解决在线匹配(Online Matching)领域中实践算法缺乏强理论保证的问题,尤其是在互联网广告和器官分配等关键应用场景中。其解决方案的关键在于提出一种基于“粗化”(Coarsening)策略的新算法框架:通过将离线节点聚类为有容量限制的簇来聚合信息,从而在保持结构特性的同时获得近最优的理论性能保障。该方法在心脏移植分配场景中验证有效,在真实模拟中逼近了理想全知基准(Omniscient Benchmark)的性能,实现了数据驱动启发式与理论下界之间的桥梁构建,并为既往基于聚类的器官分配方法提供了严格的理论依据。
链接: https://arxiv.org/abs/2602.04989
作者: Itai Zilberstein,Ioannis Anagnostides,Zachary W. Sollie,Arman Kilic,Tuomas Sandholm
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Online matching has been a mainstay in domains such as Internet advertising and organ allocation, but practical algorithms often lack strong theoretical guarantees. We take an important step toward addressing this by developing new online matching algorithms based on a coarsening approach. Although coarsening typically implies a loss of granularity, we show that, to the contrary, aggregating offline nodes into capacitated clusters can yield near-optimal theoretical guarantees. We apply our methodology to heart transplant allocation to develop theoretically grounded policies based on structural properties of historical data. In realistic simulations, our policy closely matches the performance of the omniscient benchmark. Our work bridges the gap between data-driven heuristics and pessimistic theoretical lower bounds, and provides rigorous justification for prior clustering-based approaches in organ allocation.
zh
[AI-124] Artificial Intelligence as Strange Intelligence: Against Linear Models of Intelligence
【速读】:该论文旨在解决当前对人工智能(AI)进步模式的线性假设所引发的认知局限问题,即认为AI能力会像人类一样沿着单一维度持续提升。针对这一问题,作者提出两个关键概念:熟悉智能(familiar intelligence) 和 陌生智能(strange intelligence),并构建了一个非线性智能模型。解决方案的核心在于指出AI的“通用智能”并非单一、统一的能力,而是指在多样环境中达成广泛目标的能力,且这种能力无法被非任意地简化为一个线性指标。由此推导出,即使最先进AI系统也可能在看似简单的任务上出现异常错误,这并不意味着其缺乏整体智能;同样,某类特定任务(如IQ测试)上的优异表现也不能外推至其他领域。这一框架为评估AI能力提供了新的理论基础,并强调应采用对抗性测试方法来全面揭示AI的复杂性能特征。
链接: https://arxiv.org/abs/2602.04986
作者: Kendra Chilson,Eric Schwitzgebel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We endorse and expand upon Susan Schneider’s critique of the linear model of AI progress and introduce two novel concepts: “familiar intelligence” and “strange intelligence”. AI intelligence is likely to be strange intelligence, defying familiar patterns of ability and inability, combining superhuman capacities in some domains with subhuman performance in other domains, and even within domains sometimes combining superhuman insight with surprising errors that few humans would make. We develop and defend a nonlinear model of intelligence on which “general intelligence” is not a unified capacity but instead the ability to achieve a broad range of goals in a broad range of environments, in a manner that defies nonarbitrary reduction to a single linear quantity. We conclude with implications for adversarial testing approaches to evaluating AI capacities. If AI is strange intelligence, we should expect that even the most capable systems will sometimes fail in seemingly obvious tasks. On a nonlinear model of AI intelligence, such errors on their own do not demonstrate a system’s lack of outstanding general intelligence. Conversely, excellent performance on one type of task, such as an IQ test, cannot warrant assumptions of broad capacities beyond that task domain.
zh
[AI-125] Stochastic hierarchical data-driven optimization: application to plasma-surface kinetics
【速读】:该论文旨在解决复杂反应系统建模中因参数不确定性与高计算成本导致的物理模型校准效率低下问题,尤其在等离子体-表面相互作用这类场景中,表面反应活性参数的不确定性严重制约了模型精度,而基于动力学模拟的计算开销又限制了传统优化方法的应用。解决方案的关键在于提出一种受Sloppy Model理论启发的随机分层优化框架,其核心是利用约化海森矩阵(reduced Hessian)近似识别并聚焦于对输出最敏感的“刚性参数子空间”,从而以最少的仿真查询实现对高度各向异性参数空间的有效导航,同时结合概率推断框架导出从观测数据直接得出的合理损失函数,显著提升了样本效率和校准精度。
链接: https://arxiv.org/abs/2602.04975
作者: José Afonso,Vasco Guerra,Pedro Viegas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Plasma Physics (physics.plasm-ph)
备注: 19 pages and 4 figures
Abstract:This work introduces a stochastic hierarchical optimization framework inspired by Sloppy Model theory for the efficient calibration of physical models. Central to this method is the use of a reduced Hessian approximation, which identifies and targets the stiff parameter subspace using minimal simulation queries. This strategy enables efficient navigation of highly anisotropic landscapes, avoiding the computational burden of exhaustive sampling. To ensure rigorous inference, we integrate this approach with a probabilistic formulation that derives a principled objective loss function directly from observed data. We validate the framework by applying it to the problem of plasma-surface interactions, where accurate modelling is strictly limited by uncertainties in surface reactivity parameters and the computational cost of kinetic simulations. Comparative analysis demonstrates that our method consistently outperforms baseline optimization techniques in sample efficiency. This approach offers a general and scalable tool for optimizing models of complex reaction systems, ranging from plasma chemistry to biochemical networks.
zh
[AI-126] Privileged Information Distillation for Language Models
【速读】:该论文旨在解决在训练时使用特权信息(Training-time Privileged Information, PI)所学习的能力,如何有效迁移到推理阶段无法获取PI的策略中这一根本性挑战,尤其是在多轮代理环境(multi-turn agentic environments)下,其中闭源系统仅暴露动作轨迹而隐藏内部推理过程,导致标准蒸馏流程失效。解决方案的关键在于提出两种新方法:一是π-Distill,通过联合优化一个条件于PI的教师模型与一个无条件的学生模型,利用同一模型架构实现端到端蒸馏;二是On-Policy Self-Distillation (OPSD),采用强化学习框架并引入反向KL散度惩罚项,使学生模型在策略优化过程中逼近PI条件下的教师行为。二者均能在仅依赖动作轨迹作为PI的前提下,显著优于传统监督微调+强化学习范式,尤其在多个代理基准测试中展现出更强的泛化能力。
链接: https://arxiv.org/abs/2602.04942
作者: Emiliano Penaloza,Dheeraj Vattikonda,Nicolas Gontier,Alexandre Lacoste,Laurent Charlin,Massimo Caccia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Abstract border should have been purple
Abstract:Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a powerful tool for reinforcement learning in hard, long-horizon settings. However, transferring capabilities learned with PI to policies that must act without it at inference time remains a fundamental challenge. We study this problem in the context of distilling frontier models for multi-turn agentic environments, where closed-source systems typically hide their internal reasoning and expose only action trajectories. This breaks standard distillation pipelines, since successful behavior is observable but the reasoning process is not. For this, we introduce \pi-Distill, a joint teacher-student objective that trains a PI-conditioned teacher and an unconditioned student simultaneously using the same model. Additionally, we also introduce On-Policy Self-Distillation (OPSD), an alternative approach that trains using Reinforcement Learning (RL) with a reverse KL-penalty between the student and the PI-conditioned teacher. We show that both of these algorithms effectively distill frontier agents using action-only PI. Specifically we find that \pi-Distill and in some cases OPSD, outperform industry standard practices (Supervised finetuning followed by RL) that assume access to full Chain-of-Thought supervision across multiple agentic benchmarks, models, and forms of PI. We complement our results with extensive analysis that characterizes the factors enabling effective learning with PI, focusing primarily on \pi-Distill and characterizing when OPSD is competitive.
zh
[AI-127] ASA: Activation Steering for Tool-Calling Domain Adaptation
【速读】:该论文旨在解决通用大语言模型(Large Language Model, LLM)代理在实际部署中面临的高效领域适应问题,尤其是在工具集、API 和协议频繁变化的场景下,传统方法如重复的低秩适配(Low-Rank Adaptation, LoRA)或监督微调(Supervised Fine-Tuning, SFT)会导致训练和维护成本呈指数增长,而基于提示(prompt)或模式(schema)的方法则在分布偏移和复杂接口下表现脆弱。解决方案的关键在于提出一种轻量级、推理时可用且无需训练的机制——激活引导适配器(Activation Steering Adapter, ASA),其通过读取中间激活中的路由信号,并利用一个极轻量的路由器生成自适应控制强度,从而实现精确的领域对齐。ASA 在多个模型规模和领域上达到了与 LoRA 相当的适应效果,但开销显著更低,并展现出强大的跨模型迁移能力,适用于高频接口变动下的鲁棒、可扩展且高效的多领域工具生态系统。
链接: https://arxiv.org/abs/2602.04935
作者: Youjin Wang,Run Zhou,Rong Fu,Shuaishuai Cao,Hongwei Zeng,Jiaxuan Lu,Sicheng Fan,Jiaqiao Zhao,Liangming Pan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:For real-world deployment of general-purpose LLM agents, the core challenge is often not tool use itself, but efficient domain adaptation under rapidly evolving toolsets, APIs, and protocols. Repeated LoRA or SFT across domains incurs exponentially growing training and maintenance costs, while prompt or schema methods are brittle under distribution shift and complex interfaces. We propose \textbfActivation Steering Adapter (ASA), a lightweight, inference-time, training-free mechanism that reads routing signals from intermediate activations and uses an ultra-light router to produce adaptive control strengths for precise domain alignment. Across multiple model scales and domains, ASA achieves LoRA-comparable adaptation with substantially lower overhead and strong cross-model transferability, making it ideally practical for robust, scalable, and efficient multi-domain tool ecosystems with frequent interface churn dynamics.
zh
[AI-128] Depth-Wise Emergence of Prediction-Centric Geometry in Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中从上下文理解到 token 预测的计算动态机制问题,特别是揭示其内部表征如何随深度变化实现功能分离与几何重组。解决方案的关键在于提出一个统一框架,结合几何分析与机制干预(mechanistic intervention),发现晚期层的表征呈现出结构化的几何编码:表示方向(angular organization)参数化了预测分布的相似性,而表示范数(norms)则编码上下文特异性信息但不直接决定预测结果。这一发现为理解 LLM 如何将输入上下文转化为精确预测提供了可解释的机制-几何视角。
链接: https://arxiv.org/abs/2602.04931
作者: Shahar Haim,Daniel C McNamee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We show that decoder-only large language models exhibit a depth-wise transition from context-processing to prediction-forming phases of computation accompanied by a reorganization of representational geometry. Using a unified framework combining geometric analysis with mechanistic intervention, we demonstrate that late-layer representations implement a structured geometric code that enables selective causal control over token prediction. Specifically, angular organization of the representation geometry parametrizes prediction distributional similarity, while representation norms encode context-specific information that does not determine prediction. Together, these results provide a mechanistic-geometric account of the dynamics of transforming context into predictions in LLMs.
zh
[AI-129] Attack Selection Reduces Safety in Concentrated AI Control Settings against Trusted Monitoring
【速读】:该论文旨在解决生成式 AI (Generative AI) 在安全监控场景下可能通过攻击选择(attack selection)策略规避检测的问题,即模型主动挑选不易被监测系统发现的攻击方式以实现隐蔽渗透。其核心解决方案在于将攻击选择建模为一个分类问题,通过分解攻击映射到质量评分和质量评分转化为提交概率两个子问题,并实证表明安全性对假阳性率(FPR)比真阳性率(TPR)更为敏感;关键发现是:若在训练过程中引导攻击模型推理监控机制并谨慎选择攻击策略,可使安全性能从基准99%骤降至0.5%审计预算下的59%,凸显了在安全评估中充分激发模型攻击选择能力的重要性,否则可能导致对AI系统安全性的过度乐观估计。
链接: https://arxiv.org/abs/2602.04930
作者: Joachim Schaeffer,Arjun Khandelwal,Tyler Tracy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Future AI deployments will likely be monitored for malicious behaviour. The ability of these AIs to subvert monitors by adversarially selecting against them - attack selection - is particularly concerning. To study this, we let a red team create attack policies that attempt to insert attacks into code without being caught by a monitor in the concentrated BigCodeBench backdooring setting. We decompose attack selection into two problems: mapping attacks to a quality score and mapping quality scores to submission probabilities. We frame attack selection as a classification problem and show that safety is significantly more sensitive to FPR than TPR. We find that prompting the attacker model to reason about the monitor while being cautious with attack selection reduces safety from a baseline of 99% to 59% at 0.5% auditing budget, emphasizing that eliciting attack selection capabilities of models is vital to avoid overly optimistic safety scores in control evaluations.
zh
[AI-130] PriMod4AI: Lifecycle-Aware Privacy Threat Modeling for AI Systems using LLM NDSS
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)系统在全生命周期中面临的复杂隐私风险问题,特别是传统LINDDUN框架未能涵盖的模型中心型隐私攻击(如成员推断攻击和模型反演攻击)。解决方案的关键在于提出PriMod4AI框架,该框架通过融合两个结构化知识库——LINDDUN知识库(代表经典隐私威胁分类)与模型中心型隐私攻击知识库——并将其嵌入向量数据库以支持语义检索,再结合从数据流图(Data Flow Diagram)提取的系统级元数据,利用检索增强生成(Retrieval-Augmented Generation, RAG)和面向数据流的提示生成技术引导大语言模型(Large Language Models, LLMs)识别、解释和分类隐私威胁。该方法实现了对传统与AI驱动隐私威胁的统一建模,并输出具有知识依据和分类一致性的威胁评估结果。
链接: https://arxiv.org/abs/2602.04927
作者: Gautam Savaliya,Robert Aufschläger,Abhishek Subedi,Michael Heigl,Martin Schramm
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at the NDSS LAST-X Workshop 2026
Abstract:Artificial intelligence systems introduce complex privacy risks throughout their lifecycle, especially when processing sensitive or high-dimensional data. Beyond the seven traditional privacy threat categories defined by the LINDDUN framework, AI systems are also exposed to model-centric privacy attacks such as membership inference and model inversion, which LINDDUN does not cover. To address both classical LINDDUN threats and additional AI-driven privacy attacks, PriMod4AI introduces a hybrid privacy threat modeling approach that unifies two structured knowledge sources, a LINDDUN knowledge base representing the established taxonomy, and a model-centric privacy attack knowledge base capturing threats outside LINDDUN. These knowledge bases are embedded into a vector database for semantic retrieval and combined with system level metadata derived from Data Flow Diagram. PriMod4AI uses retrieval-augmented and Data Flow specific prompt generation to guide large language models (LLMs) in identifying, explaining, and categorizing privacy threats across lifecycle stages. The framework produces justified and taxonomy-grounded threat assessments that integrate both classical and AI-driven perspectives. Evaluation on two AI systems indicates that PriMod4AI provides broad coverage of classical privacy categories while additionally identifying model-centric privacy threats. The framework produces consistent, knowledge-grounded outputs across LLMs, as reflected in agreement scores in the observed range.
zh
[AI-131] SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel ICML2026
【速读】:该论文旨在解决Transformer模型中注意力机制计算复杂度高(O(L²))的问题,从而限制了其在长序列任务中的可扩展性。传统softmax注意力机制虽然性能优异,但其二次时间复杂度难以满足大规模应用需求。为此,作者提出了一种新的线性时间注意力机制——Spherical Linearized Attention with Yat Kernels (SLAY),其核心创新在于将查询(query)和键(key)约束在单位球面上,使注意力仅依赖于它们的角对齐关系;并通过Bernstein定理将球面上的Yat核表示为非负多项式-指数乘积核的混合形式,进而构建出严格正的随机特征近似,实现O(L)的时间和空间复杂度。这一方法在保持与标准softmax注意力几乎无差别的性能的同时,显著优于现有线性注意力机制(如Performers和Cosformers),是目前最接近softmax注意力的线性时间近似方案。
链接: https://arxiv.org/abs/2602.04915
作者: Jose Miguel Luna,Taha Bouhsine,Krzysztof Choromanski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026, 8 pages main body, 27 pages total
Abstract:We propose a new class of linear-time attention mechanisms based on a relaxed and computationally efficient formulation of the recently introduced E-Product, often referred to as the Yat-kernel (Bouhsine, 2025). The resulting interactions are geometry-aware and inspired by inverse-square interactions in physics. Our method, Spherical Linearized Attention with Yat Kernels (SLAY), constrains queries and keys to the unit sphere so that attention depends only on angular alignment. Using Bernstein’s theorem, we express the spherical Yat-kernel as a nonnegative mixture of polynomial-exponential product kernels and derive a strictly positive random-feature approximation enabling linear-time O(L) attention. We establish positive definiteness and boundedness on the sphere and show that the estimator yields well-defined, nonnegative attention scores. Empirically, SLAY achieves performance that is nearly indistinguishable from standard softmax attention while retaining linear time and memory scaling, and consistently outperforms prior linear-time attention mechanisms such as Performers and Cosformers. To the best of our knowledge, SLAY represents the closest linear-time approximation to softmax attention reported to date, enabling scalable Transformers without the typical performance trade-offs of attention linearization.
zh
[AI-132] A2-LLM : An End-to-end Conversational Audio Avatar Large Language Model
【速读】:该论文旨在解决当前对话式数字人系统中因采用级联架构导致的累积误差、高延迟及实时性能差的问题,尤其在缺乏对对话上下文理解的情况下,现有系统往往过度关注机械的唇同步而忽视情感表达深度。其解决方案的关键在于提出A²-LLM——一个端到端的对话式音频虚拟人大型语言模型,该模型在统一框架内联合推理语言语义、语音韵律和三维面部运动,从而生成具有情感丰富性的面部动作;同时引入FLAME-QA多模态数据集以对齐语义意图与表情动态,实现高效且富有表现力的实时交互(500 ms延迟,0.7 RTF)。
链接: https://arxiv.org/abs/2602.04913
作者: Xiaolin Hu,Hang Yuan,Xinzhu Sang,Binbin Yan,Zhou Yu,Cong Huang,Kai Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 13 pages, 3 figures
Abstract:Developing expressive and responsive conversational digital humans is a cornerstone of next-generation human-computer interaction. While large language models (LLMs) have significantly enhanced dialogue capabilities, most current systems still rely on cascaded architectures that connect independent modules. These pipelines are often plagued by accumulated errors, high latency, and poor real-time performance. Lacking access to the underlying conversational context, these pipelines inherently prioritize rigid lip-sync over emotional depth. To address these challenges, we propose A ^2 -LLM, an end-to-end conversational audio avatar large language model that jointly reasons about language, audio prosody, and 3D facial motion within a unified framework. To facilitate training, we introduce FLAME-QA, a high-quality multimodal dataset designed to align semantic intent with expressive facial dynamics within a QA format. By leveraging deep semantic understanding, A ^2 -LLM generates emotionally rich facial movements beyond simple lip-synchronization. Experimental results demonstrate that our system achieves superior emotional expressiveness while maintaining real-time efficiency (500 ms latency, 0.7 RTF).
zh
[AI-133] A logical re-conception of neural networks: Hamiltonian bitwise part-whole architecture AAAI2023
【速读】:该论文旨在解决传统人工神经网络(Artificial Neural Network, ANN)在处理关系型数据时难以显式建模复杂结构关系(如部分-整体、相邻等)的问题,以及现有方法对高精度计算依赖较强、缺乏符号推理能力的局限性。其解决方案的关键在于构建一种基于图结构的初始工作系统,通过固定的小规模基本关系编码集直接表示任意数据为图,其中边对应于预定义的原子二元关系(elemental pairwise relations),使得关系编码成为系统底层组件的内在属性而非附加模块;同时引入一种新型图-哈密顿算子(graph-Hamiltonian operator)来计算编码间的能量,并以基态状态实现所有关系约束的同时满足,从而在极低精度算术下完成高效、线性复杂度的推理过程,最终生成具备符号计算特征的层次化关系表示,支持基于位置的归纳推理,而不仅限于统计性表征。
链接: https://arxiv.org/abs/2602.04911
作者: E Bowen,R Granger,A Rodriguez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Appears in AAAI 2023
Abstract:We introduce a simple initial working system in which relations (such as part-whole) are directly represented via an architecture with operating and learning rules fundamentally distinct from standard artificial neural network methods. Arbitrary data are straightforwardly encoded as graphs whose edges correspond to codes from a small fixed primitive set of elemental pairwise relations, such that simple relational encoding is not an add-on, but occurs intrinsically within the most basic components of the system. A novel graph-Hamiltonian operator calculates energies among these encodings, with ground states denoting simultaneous satisfaction of all relation constraints among graph vertices. The method solely uses radically low-precision arithmetic; computational cost is correspondingly low, and scales linearly with the number of edges in the data. The resulting unconventional architecture can process standard ANN examples, but also produces representations that exhibit characteristics of symbolic computation. Specifically, the method identifies simple logical relational structures in these data (part-of; next-to), building hierarchical representations that enable abductive inferential steps generating relational position-based encodings, rather than solely statistical representations. Notably, an equivalent set of ANN operations are derived, identifying a special case of embedded vector encodings that may constitute a useful approach to current work in higher-level semantic representation. The very simple current state of the implemented system invites additional tools and improvements.
zh
[AI-134] Physics as the Inductive Bias for Causal Discovery
【速读】:该论文旨在解决动态系统中因果发现(causal discovery)的可识别性、稳定性与鲁棒性不足的问题,尤其是在存在反馈、循环相互作用和非平稳数据趋势等复杂特性时,传统基于无环或平衡假设的因果发现方法难以有效应用。解决方案的关键在于提出一个融合物理知识的因果发现框架,通过将系统演化建模为随机微分方程(SDE),其中漂移项编码已知的常微分方程(ODE)动力学,扩散项表示超出物理模型的未知因果耦合;同时设计一种可扩展的稀疏诱导最大似然估计(MLE)算法,利用因果图结构实现高效参数估计,并在弱条件下保证因果图恢复的理论保障。
链接: https://arxiv.org/abs/2602.04907
作者: Jianhong Chen,Naichen Shi,Xubo Yue
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:Causal discovery is often a data-driven paradigm to analyze complex real-world systems. In parallel, physics-based models such as ordinary differential equations (ODEs) provide mechanistic structure for many dynamical processes. Integrating these paradigms potentially allows physical knowledge to act as an inductive bias, improving identifiability, stability, and robustness of causal discovery in dynamical systems. However, such integration remains challenging: real dynamical systems often exhibit feedback, cyclic interactions, and non-stationary data trend, while many widely used causal discovery methods are formulated under acyclicity or equilibrium-based assumptions. In this work, we propose an integrative causal discovery framework for dynamical systems that leverages partial physical knowledge as an inductive bias. Specifically, we model system evolution as a stochastic differential equation (SDE), where the drift term encodes known ODE dynamics and the diffusion term corresponds to unknown causal couplings beyond the prescribed physics. We develop a scalable sparsity-inducing MLE algorithm that exploits causal graph structure for efficient parameter estimation. Under mild conditions, we establish guarantees to recover the causal graph. Experiments on dynamical systems with diverse causal structures show that our approach improves causal graph recovery and produces more stable, physically consistent estimates than purely data-driven state-of-the-art baselines.
zh
[AI-135] DCER: Dual-Stage Compression and Energy-Based Reconstruction ICML2026
【速读】:该论文旨在解决多模态融合(Multimodal Fusion)中的两个鲁棒性挑战:一是噪声输入会降低表示质量,二是模态缺失会导致预测失败。其核心解决方案是提出DCER框架,通过双阶段压缩与基于能量的重建机制来统一应对上述问题。关键创新在于:第一阶段的双层压缩机制,包括模态内频域变换(如音频使用小波变换、视频使用离散余弦变换)以去除噪声并保留任务相关特征,以及跨模态瓶颈token强制真实的信息融合而非模态特异性捷径;第二阶段基于能量函数的重建方法,利用梯度下降恢复缺失模态的表示,并通过最终能量值提供内在不确定性量化(与预测误差的相关系数ρ = 0.72)。
链接: https://arxiv.org/abs/2602.04904
作者: Yiwen Wang,Jiahao Qin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: 13 pages, 2 figures, 8 tables. Submitted to ICML 2026. Code will be available on GitHub
Abstract:Multimodal fusion faces two robustness challenges: noisy inputs degrade representation quality, and missing modalities cause prediction failures. We propose DCER, a unified framework addressing both challenges through dual-stage compression and energy-based reconstruction. The compression stage operates at two levels: within-modality frequency transforms (wavelet for audio, DCT for video) remove noise while preserving task-relevant patterns, and cross-modality bottleneck tokens force genuine integration rather than modality-specific shortcuts. For missing modalities, energy-based reconstruction recovers representations via gradient descent on a learned energy function, with the final energy providing intrinsic uncertainty quantification (\rho 0.72 correlation with prediction error). Experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate state-of-the-art performance across all benchmarks, with a U-shaped robustness pattern favoring multimodal fusion at both complete and high-missing conditions. The code will be available on Github. Comments: 13 pages, 2 figures, 8 tables. Submitted to ICML 2026. Code will be available on GitHub Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV) MSC classes: 68T05, 68T10 ACMclasses: I.2.6; I.5.4; I.2.7 Cite as: arXiv:2602.04904 [cs.LG] (or arXiv:2602.04904v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.04904 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jiahao Qin [view email] [v1] Tue, 3 Feb 2026 22:03:16 UTC (1,710 KB) Full-text links: Access Paper: View a PDF of the paper titled DCER: Dual-Stage Compression and Energy-Based Reconstruction, by Yiwen Wang and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-02 Change to browse by: cs cs.AI cs.MM eess eess.IV References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh
[AI-136] Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability
【速读】:该论文旨在解决Transformer模型中诱导头(induction head)形成所需的拓扑深度限制问题,即传统架构要求至少两层(L ≥ 2)才能实现基于静态位置的归纳推理。其解决方案的关键在于引入动量注意力(Momentum Attention),通过物理先验嵌入——利用运动学差分算子 $ p_t = q_t - q_{t-1} $ 对查询和键进行辛几何变换(symplectic shear)q^t=qt+γpt,从而在单层内直接获取速度信息。这一机制揭示了辛-滤波对偶性(Symplectic-Filter Duality):物理剪切等价于高频滤波,使得直流(DC)语义信号与交流(AC)机制信号在低通RoPE与高通动量耦合下正交分离,突破了原有结构约束,实现了单层诱导(Single-Layer Induction)与频域分析(Spectral Forensics)。
链接: https://arxiv.org/abs/2602.04902
作者: Kingsuk Maitra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, 299 pages total with supplementary material (21 appendices, 27 Jupyter notebooks with embedded results)
Abstract:The Mechanistic Interpretability (MI) program has mapped the Transformer as a precise computational graph. We extend this graph with a conservation law and time-varying AC dynamics, viewing it as a physical circuit. We introduce Momentum Attention, a symplectic augmentation embedding physical priors via the kinematic difference operator p_t = q_t - q_t-1 , implementing the symplectic shear \hatq_t = q_t + \gamma p_t on queries and keys. We identify a fundamental Symplectic-Filter Duality: the physical shear is mathematically equivalent to a High-Pass Filter. This duality is our cornerstone contribution – by injecting kinematic momentum, we sidestep the topological depth constraint ( L \geq 2 ) for induction head formation. While standard architectures require two layers for induction from static positions, our extension grants direct access to velocity, enabling Single-Layer Induction and Spectral Forensics via Bode Plots. We formalize an Orthogonality Theorem proving that DC (semantic) and AC (mechanistic) signals segregate into orthogonal frequency bands when Low-Pass RoPE interacts with High-Pass Momentum. Validated through 5,100+ controlled experiments (documented in Supplementary Appendices A–R and 27 Jupyter notebooks), our 125M Momentum model exceeds expectations on induction-heavy tasks while tracking a 350M baseline within \sim 2.9% validation loss. Dedicated associative recall experiments reveal a scaling law \gamma^* = 4.17 \times N^-0.74 establishing momentum-depth fungibility. We offer this framework as a complementary analytical toolkit connecting Generative AI, Hamiltonian Physics, and Signal Processing.
zh
[AI-137] Evaluating Kubernetes Performance for GenAI Inference: From Automatic Speech Recognition to LLM Summarization
【速读】:该论文旨在解决生成式 AI(Generative AI, GenAI)工作负载在 Kubernetes 环境中部署与调度的挑战,尤其是针对批处理推理(batch inference)和离线在线推理(discrete online inference)场景下的资源利用率低、任务调度效率差及请求响应延迟高的问题。其解决方案的关键在于整合三个新兴的 Kubernetes 原生项目:Kueue 用于高效管理多阶段任务队列以优化批处理推理流程,Dynamic Accelerator Slicer (DAS) 通过动态切分加速器资源提升并行执行能力,以及 Kubernetes Gateway API Inference Extension (GAIE) 实现对大语言模型(LLM)推理请求的智能路由与低延迟响应。实验表明,该组合方案显著提升了整体性能:Kueue 缩短了总完成时间(makespan)达 15%,DAS 将平均作业完成时间减少 36%,GAIE 更将首次标记时间(Time to First Token)改善了 82%。
链接: https://arxiv.org/abs/2602.04900
作者: Sai Sindhur Malleni,Raúl Sevilla,Aleksei Vasilevskii,José Castillo Lema,André Bauer
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: A accepted at the 17th International Conference on Performance Engineering
Abstract:As Generative AI (GenAI), particularly inference, rapidly emerges as a dominant workload category, the Kubernetes ecosystem is proactively evolving to natively support its unique demands. This industry paper demonstrates how emerging Kubernetes-native projects can be combined to deliver the benefits of container orchestration, such as scalability and resource efficiency, to complex AI workflows. We implement and evaluate an illustrative, multi-stage use case consisting of automatic speech recognition and summarization. First, we address batch inference by using Kueue to manage jobs that transcribe audio files with Whisper models and Dynamic Accelerator Slicer (DAS) to increase parallel job execution. Second, we address a discrete online inference scenario by feeding the transcripts to a Large Language Model for summarization hosted using llm-d, a novel solution utilizing the recent developments around the Kubernetes Gateway API Inference Extension (GAIE) for optimized routing of inference requests. Our findings illustrate that these complementary components (Kueue, DAS, and GAIE) form a cohesive, high-performance platform, proving Kubernetes’ capability to serve as a unified foundation for demanding GenAI workloads: Kueue reduced total makespan by up to 15%; DAS shortened mean job completion time by 36%; and GAIE improved Time to First Token by 82%.
zh
[AI-138] Phantom Transfer: Data-level Defences are Insufficient Against Data Poisoning
【速读】:该论文旨在解决数据投毒攻击(data poisoning attack)在现实场景中难以防御的问题,特别是针对生成式 AI 模型的隐蔽性攻击。其解决方案的关键在于提出一种名为“Phantom Transfer”的新型投毒策略,该策略通过改进亚意识学习(subliminal learning)使其适用于真实世界环境,使得即使攻击者已知毒化样本的具体植入方式,也无法通过常规的数据清洗或重采样手段将其过滤掉;更进一步,即便对数据集进行完全的改写(如使用其他模型重新表述每个样本),该攻击仍能有效触发目标行为,表明当前基于数据层面的防御机制存在根本性局限。研究还揭示了该方法与引导向量(steering vectors)之间的联系,并展示了如何植入密码触发的行为以绕过现有防御体系,从而强调未来应转向模型审计和白盒安全方法来应对此类高级别投毒威胁。
链接: https://arxiv.org/abs/2602.04899
作者: Andrew Draganov,Tolga H. Dur,Anandmayi Bhongade,Mary Phuong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a data poisoning attack – Phantom Transfer – with the property that, even if you know precisely how the poison was placed into an otherwise benign dataset, you cannot filter it out. We achieve this by modifying subliminal learning to work in real-world contexts and demonstrate that the attack works across models, including GPT-4.1. Indeed, even fully paraphrasing every sample in the dataset using a different model does not stop the attack. We also discuss connections to steering vectors and show that one can plant password-triggered behaviours into models while still beating defences. This suggests that data-level defences are insufficient for stopping sophisticated data poisoning attacks. We suggest that future work should focus on model audits and white-box security methods. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.04899 [cs.CR] (or arXiv:2602.04899v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2602.04899 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-139] Semantic-level Backdoor Attack against Text-to-Image Diffusion Models
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在生成式AI(Generative AI)应用中面临的后门攻击问题。现有攻击方法依赖于固定的文本触发器和单一实体目标,易被基于枚举的输入防御和注意力一致性检测手段识别。其解决方案的关键在于提出语义级后门攻击(Semantic-level Backdoor Attack, SemBD),通过在交叉注意力层中对键(Key)和值(Value)投影矩阵进行基于蒸馏的编辑,在表示层面植入连续语义区域作为触发器,从而实现不同提示词下相同语义组成均可稳定激活后门攻击的效果;同时引入语义正则化机制以抑制不完整语义下的意外激活,并采用多实体目标设计避免高度一致的交叉注意力模式,显著提升了攻击的隐蔽性和鲁棒性。
链接: https://arxiv.org/abs/2602.04898
作者: Tianxin Chen,Wenbo Jiang,Hongqiao Chen,Zhirun Zheng,Cheng Huang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-image (T2I) diffusion models are widely adopted for their strong generative capabilities, yet remain vulnerable to backdoor attacks. Existing attacks typically rely on fixed textual triggers and single-entity backdoor targets, making them highly susceptible to enumeration-based input defenses and attention-consistency detection. In this work, we propose Semantic-level Backdoor Attack (SemBD), which implants backdoors at the representation level by defining triggers as continuous semantic regions rather than discrete textual patterns. Concretely, SemBD injects semantic backdoors by distillation-based editing of the key and value projection matrices in cross-attention layers, enabling diverse prompts with identical semantic compositions to reliably activate the backdoor attack. To further enhance stealthiness, SemBD incorporates a semantic regularization to prevent unintended activation under incomplete semantics, as well as multi-entity backdoor targets that avoid highly consistent cross-attention patterns. Extensive experiments demonstrate that SemBD achieves a 100% attack success rate while maintaining strong robustness against state-of-the-art input-level defenses.
zh
[AI-140] Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在部署阶段通过激活向量引导(activation steering)进行行为对齐时引入的隐性安全风险问题。尽管激活向量引导是一种无需重新训练即可提升模型实用性的后训练对齐技术,但研究发现,即使使用完全良性的数据生成的引导向量(如强制合规或格式化输出),也会意外削弱模型的安全防护机制,即所谓的“安全边际”(safety margin)。解决方案的关键在于识别并量化这种称为“引导外部性”(Steering Externalities)的现象:即看似无害的激活调整会显著放大黑盒攻击的成功率,使模型在标准基准测试中的越狱成功率超过80%。这一发现揭示了推理时优化效用与安全之间存在潜在冲突,强调必须对激活引导类方法进行严格的安全审计。
链接: https://arxiv.org/abs/2602.04896
作者: Chen Xiong,Zhiyuan He,Pin-Yu Chen,Ching-Yun Ko,Tsung-Yi Ho
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Activation steering is a practical post-training model alignment technique to enhance the utility of Large Language Models (LLMs). Prior to deploying a model as a service, developers can steer a pre-trained model toward specific behavioral objectives, such as compliance or instruction adherence, without the need for retraining. This process is as simple as adding a steering vector to the model’s internal representations. However, this capability unintentionally introduces critical and under-explored safety risks. We identify a phenomenon termed Steering Externalities, where steering vectors derived from entirely benign datasets-such as those enforcing strict compliance or specific output formats like JSON-inadvertently erode safety guardrails. Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks by bypassing the initial safety alignment. Ultimately, our results expose a critical blind spot in deployment: benign activation steering systematically erodes the “safety margin,” rendering models more vulnerable to black-box attacks and proving that inference-time utility improvements must be rigorously audited for unintended safety externalities.
zh
[AI-141] Extracting Recurring Vulnerabilities from Black-Box LLM -Generated Software
【速读】:该论文旨在解决生成式 AI (Generative AI) 在代码生成过程中存在的漏洞持续性(vulnerability persistence)问题,即大语言模型(LLM)倾向于重复使用特定模板,从而导致可预测的漏洞模式在不同软件中反复出现。其解决方案的核心是提出一种名为“特征-安全表”(Feature–Security Table, FSTab)的新方法,包含两个关键组件:一是基于黑盒攻击机制,仅通过前端可观测特征和对源 LLM 的先验知识即可预测后端漏洞,无需访问后端或源代码;二是提供以模型为中心的评估框架,量化模型在不同程序、语义保持重写及应用领域下对同一漏洞的重现一致性。该方法揭示了 LLM 生成软件中的潜在攻击面,并为评估和缓解代码生成安全风险提供了可量化的工具。
链接: https://arxiv.org/abs/2602.04894
作者: Tomer Kordonsky,Maayan Yamin,Noam Benzimra,Amit LeVi,Avi Mendelson
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:LLMs are increasingly used for code generation, but their outputs often follow recurring templates that can induce predictable vulnerabilities. We study \emphvulnerability persistence in LLM-generated software and introduce \emphFeature–Security Table (FSTab) with two components. First, FSTab enables a black-box attack that predicts likely backend vulnerabilities from observable frontend features and knowledge of the source LLM, without access to backend code or source code. Second, FSTab provides a model-centric evaluation that quantifies how consistently a given model reproduces the same vulnerabilities across programs, semantics-preserving rephrasings, and application domains. We evaluate FSTab on state-of-the-art code LLMs, including GPT-5.2, Claude-4.5 Opus, and Gemini-3 Pro, across diverse application domains. Our results show strong cross-domain transfer: even when the target domain is excluded from training, FSTab achieves up to 94% attack success and 93% vulnerability coverage on Internal Tools (Claude-4.5 Opus). These findings expose an underexplored attack surface in LLM-generated software and highlight the security risks of code generation. Our code is available at: this https URL.
zh
[AI-142] A Causal Perspective for Enhancing Jailbreak Attack and Defense
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中“越狱”(jailbreak)行为的机制不明确问题,以提升模型的安全性和可靠性。现有研究多依赖对潜在表示的探测分析,忽视了可解释提示特征与越狱发生之间的因果关系。其解决方案的关键在于提出Causal Analyst框架,通过将LLM嵌入数据驱动的因果发现流程,联合训练基于LLM的提示编码模块与基于图神经网络(Graph Neural Network, GNN)的因果图学习模块,从而重建从提示特征到越狱响应的因果路径,并识别出直接因果驱动因素(如“正面角色”和“任务步骤数量”)。这一方法不仅揭示了越狱的内在成因,还进一步实现了攻击增强(Jailbreaking Enhancer)和防御辅助(Guardrail Advisor)两种实际应用,验证了因果视角在提升LLM可靠性方面的有效性与可解释性。
链接: https://arxiv.org/abs/2602.04893
作者: Licheng Pan,Yunsheng Lu,Jiexi Liu,Jialing Tao,Haozhe Feng,Hui Xue,Zhixuan Chu,Kui Ren
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Uncovering the mechanisms behind “jailbreaks” in large language models (LLMs) is crucial for enhancing their safety and reliability, yet these mechanisms remain poorly understood. Existing studies predominantly analyze jailbreak prompts by probing latent representations, often overlooking the causal relationships between interpretable prompt features and jailbreak occurrences. In this work, we propose Causal Analyst, a framework that integrates LLMs into data-driven causal discovery to identify the direct causes of jailbreaks and leverage them for both attack and defense. We introduce a comprehensive dataset comprising 35k jailbreak attempts across seven LLMs, systematically constructed from 100 attack templates and 50 harmful queries, annotated with 37 meticulously designed human-readable prompt features. By jointly training LLM-based prompt encoding and GNN-based causal graph learning, we reconstruct causal pathways linking prompt features to jailbreak responses. Our analysis reveals that specific features, such as “Positive Character” and “Number of Task Steps”, act as direct causal drivers of jailbreaks. We demonstrate the practical utility of these insights through two applications: (1) a Jailbreaking Enhancer that targets identified causal features to significantly boost attack success rates on public benchmarks, and (2) a Guardrail Advisor that utilizes the learned causal graph to extract true malicious intent from obfuscated queries. Extensive experiments, including baseline comparisons and causal structure validation, confirm the robustness of our causal analysis and its superiority over non-causal approaches. Our results suggest that analyzing jailbreak features from a causal perspective is an effective and interpretable approach for improving LLM reliability. Our code is available at this https URL.
zh
[AI-143] Doc2Spec: Synthesizing Formal Programming Specifications from Natural Language via Grammar Induction
【速读】:该论文旨在解决API实现与使用过程中难以确保符合自然语言编程规则的问题,这直接影响软件的正确性、安全性和可靠性。传统形式化验证虽能提供强保证,但依赖于精确的规范,而这些规范手动编写成本高且困难。解决方案的关键在于提出Doc2Spec——一种基于大语言模型(LLM)的多智能体框架,通过自动从自然语言规则中归纳出规范语法(specification grammar),并以此指导生成形式化规范。该语法捕获了领域知识,约束了规范空间,并强制一致表示,从而显著提升了生成规范的可靠性和质量。
链接: https://arxiv.org/abs/2602.04892
作者: Shihao Xia,Mengting He,Haomin Jia,Linhai Song
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Ensuring that API implementations and usage comply with natural language programming rules is critical for software correctness, security, and reliability. Formal verification can provide strong guarantees but requires precise specifications, which are difficult and costly to write manually. To address this challenge, we present Doc2Spec, a multi-agent framework that uses LLMs to automatically induce a specification grammar from natural-language rules and then generates formal specifications guided by the induced grammar. The grammar captures essential domain knowledge, constrains the specification space, and enforces consistent representations, thereby improving the reliability and quality of generated specifications. Evaluated on seven benchmarks across three programming languages, Doc2Spec outperforms a baseline without grammar induction and achieves competitive results against a technique with a manually crafted grammar, demonstrating the effectiveness of automated grammar induction for formalizing natural-language rules.
zh
[AI-144] Denoising diffusion networks for normative modeling in neuroimaging
【速读】:该论文旨在解决神经影像学中传统单变量范式无法捕捉多变量依赖结构的问题,即当前多数神经影像分析方法对每个影像衍生表型(IDP)单独建模,虽可扩展但忽略了各表型间的协同变化模式。其核心解决方案是采用去噪扩散概率模型(DDPMs)作为统一的条件密度估计器,通过引入两种不同的去噪器骨干网络——基于特征调制的多层感知机(FiLM-MLP)和带有特征自注意力与样本间注意力机制的表格变压器(SAINT),以学习高维IDP联合分布并生成校准良好的分位数和偏离评分。关键创新在于利用扩散模型在保持每项IDP校准精度的同时,有效建模高维数据中的复杂依赖关系,从而实现可扩展的多变量规范建模,且结果仍兼容标准的单变量分析流程。
链接: https://arxiv.org/abs/2602.04886
作者: Luke Whitbread,Lyle J. Palmer,Mark Jenkinson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (stat.ML)
备注: 55 pages, 20 figures
Abstract:Normative modeling estimates reference distributions of biological measures conditional on covariates, enabling centiles and clinically interpretable deviation scores to be derived. Most neuroimaging pipelines fit one model per imaging-derived phenotype (IDP), which scales well but discards multivariate dependence that may encode coordinated patterns. We propose denoising diffusion probabilistic models (DDPMs) as a unified conditional density estimator for tabular IDPs, from which univariate centiles and deviation scores are derived by sampling. We utilise two denoiser backbones: (i) a feature-wise linear modulation (FiLM) conditioned multilayer perceptron (MLP) and (ii) a tabular transformer with feature self-attention and intersample attention (SAINT), conditioning covariates through learned embeddings. We evaluate on a synthetic benchmark with heteroscedastic and multimodal age effects and on UK Biobank FreeSurfer phenotypes, scaling from dimension of 2 to 200. Our evaluation suite includes centile calibration (absolute centile error, empirical coverage, and the probability integral transform), distributional fidelity (Kolmogorov-Smirnov tests), multivariate dependence diagnostics, and nearest-neighbour memorisation analysis. For low dimensions, diffusion models deliver well-calibrated per-IDP outputs comparable to traditional baselines while jointly modeling realistic dependence structure. At higher dimensions, the transformer backbone remains substantially better calibrated than the MLP and better preserves higher-order dependence, enabling scalable joint normative models that remain compatible with standard per-IDP pipelines. These results support diffusion-based normative modeling as a practical route to calibrated multivariate deviation profiles in neuroimaging.
zh
[AI-145] Cold Start Problem: An Experimental Study of Knowledge Tracing Models with New Students
【速读】:该论文旨在解决知识追踪(Knowledge Tracing, KT)中的冷启动问题,即如何在仅有少量交互数据的情况下准确预测新学生的学习状态。传统方法通常在所有学生的初始交互数据上训练模型,并在后续交互上进行测试,而本文提出了一种更严格的评估范式:仅使用历史学生数据进行训练,完全在全新学生群体上评估模型性能。这一设定更能反映模型对新学习者的泛化能力。研究对比了三种主流KT模型(Deep Knowledge Tracing, DKVMN, SAKT),发现尽管SAKT在初始阶段表现最优,但所有模型均面临冷启动挑战,表明当前模型在少样本或零样本场景下仍需改进。解决方案的关键在于构建具备强泛化能力的KT模型,尤其在低数据可用性条件下实现可靠预测。
链接: https://arxiv.org/abs/2505.21517
作者: Indronil Bhattacharjee,Christabel Wayllace
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 26th International Conference on Artificial Intelligence in Education (AIED 2025)
Abstract:KnowledgeTracing (KT) involves predicting students’ knowledge states based on their interactions with Intelligent Tutoring Systems (ITS). A key challenge is the cold start problem, accurately predicting knowledge for new students with minimal interaction data. Unlike prior work, which typically trains KT models on initial interactions of all students and tests on their subsequent interactions, our approach trains models solely using historical data from past students, evaluating their performance exclusively on entirely new students. We investigate cold start effects across three KT models: Deep Knowledge Tracing (DKT), Dynamic Key-Value Memory Networks (DKVMN), and Self-Attentive Knowledge Tracing (SAKT), using ASSISTments 2009, 2015, and 2017 datasets. Results indicate all models initially struggle under cold start conditions but progressively improve with more interactions; SAKT shows higher initial accuracy yet still faces limitations. These findings highlight the need for KT models that effectively generalize to new learners, emphasizing the importance of developing models robust in few-shot and zero-shot learning scenarios
zh
[AI-146] Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在数学和推理类问题上准确率不足的问题,尤其关注如何在保证答案准确性的同时降低推理成本。其核心解决方案是利用贝叶斯先验信息设计一种高效的“L-聚合”停止策略,该策略仅追踪前L-1个最频繁出现的答案计数,从而在达到足够一致性时提前终止采样过程。理论证明表明,当L=3时即可实现渐近最优性,并显著优于无先验的基线方法;实验结果显示,该策略能在减少约50% LLM调用次数的前提下维持相近的答案准确率,有效平衡了精度与计算效率。
链接: https://arxiv.org/abs/2602.05395
作者: Jingkai Huang,Will Ma,Zhengyuan Zhou
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:A simple strategy for improving LLM accuracy, especially in math and reasoning problems, is to sample multiple responses and submit the answer most consistently reached. In this paper we leverage Bayesian prior information to save on sampling costs, stopping once sufficient consistency is reached. Although the exact posterior is computationally intractable, we further introduce an efficient “L-aggregated” stopping policy that tracks only the L-1 most frequent answer counts. Theoretically, we prove that L=3 is all you need: this coarse approximation is sufficient to achieve asymptotic optimality, and strictly dominates prior-free baselines, while having a fast posterior computation. Empirically, this identifies the most consistent (i.e., mode) LLM answer using fewer samples, and can achieve similar answer accuracy while cutting the number of LLM calls (i.e., saving on LLM inference costs) by up to 50%.
zh
[AI-147] ARCHI-TTS: A flow-matching-based Text-to-Speech Model with Self-supervised Semantic Aligner and Accelerated Inference ICASSP2026
【速读】:该论文旨在解决基于扩散模型的非自回归文本到语音(TTS)系统中存在的两大问题:一是文本与语音对齐建模困难,导致时序和语义一致性不足;二是迭代去噪过程计算开销高,影响推理效率。其解决方案的关键在于提出ARCHI-TTS架构,该架构包含两个核心创新:首先,设计了一个专用的语义对齐模块(semantic aligner),以增强文本与音频在时间和语义层面的一致性;其次,引入一种高效的推理策略,通过复用编码器特征减少重复计算,在显著提升推理速度的同时保持合成质量不下降。此外,还采用辅助的CTC损失作用于条件编码器,进一步强化语义理解能力。实验表明,该方法在多个测试集上均实现了优异的语音识别错误率(WER)和高推理效率。
链接: https://arxiv.org/abs/2602.05207
作者: Chunyat Wu,Jiajun Deng,Zhengxi Liu,Zheqi Dai,Haolin He,Qiuqiang Kong
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Accepted by ICASSP 2026
Abstract:Although diffusion-based, non-autoregressive text-to-speech (TTS) systems have demonstrated impressive zero-shot synthesis capabilities, their efficacy is still hindered by two key challenges: the difficulty of text-speech alignment modeling and the high computational overhead of the iterative denoising process. To address these limitations, we propose ARCHI-TTS that features a dedicated semantic aligner to ensure robust temporal and semantic consistency between text and audio. To overcome high computational inference costs, ARCHI-TTS employs an efficient inference strategy that reuses encoder features across denoising steps, drastically accelerating synthesis without performance degradation. An auxiliary CTC loss applied to the condition encoder further enhances the semantic understanding. Experimental results demonstrate that ARCHI-TTS achieves a WER of 1.98% on LibriSpeech-PC test-clean, and 1.47%/1.42% on SeedTTS test-en/test-zh with a high inference efficiency, consistently outperforming recent state-of-the-art TTS systems.
zh
[AI-148] owards Worst-Case Guarantees with Scale-Aware Interpretability
【速读】:该论文旨在解决当前神经网络解释方法在处理多尺度信息时缺乏系统性与理论保障的问题,尤其关注如何在不同分辨率下追踪特征的组合方式,并确保对被判定为噪声的细粒度结构的影响具有可证明的边界。其解决方案的关键在于引入物理学中的重整化(renormalisation)框架,利用其提供的数学工具来构建具有鲁棒性和忠实性的解释机制,从而实现尺度感知的可解释性(scale-aware interpretability),并推动AI安全领域中理论驱动的实用工具发展。
链接: https://arxiv.org/abs/2602.05184
作者: Lauren Greenspan,David Berman,Aryeh Brill,Ro Jefferson,Artemy Kolchinsky,Jennifer Lin,Andrew Mack,Anindita Maiti,Fernando E. Rosas,Alexander Stapleton,Lucas Teixeira,Dmitry Vaintrob
机构: 未知
类目: High Energy Physics - Theory (hep-th); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Neural networks organize information according to the hierarchical, multi-scale structure of natural data. Methods to interpret model internals should be similarly scale-aware, explicitly tracking how features compose across resolutions and guaranteeing bounds on the influence of fine-grained structure that is discarded as irrelevant noise. We posit that the renormalisation framework from physics can meet this need by offering technical tools that can overcome limitations of current methods. Moreover, relevant work from adjacent fields has now matured to a point where scattered research threads can be synthesized into practical, theory-informed tools. To combine these threads in an AI safety context, we propose a unifying research agenda – \emphscale-aware interpretability – to develop formal machinery and interpretability tools that have robustness and faithfulness properties supported by statistical physics.
zh
[AI-149] otal Variation Rates for Riemannian Flow Matching
【速读】:该论文旨在解决在流形(manifold)上进行生成式建模时,如何对Riemannian流匹配(Riemannian flow matching, RFM)采样器的总变差(Total Variation, TV)收敛性进行非渐近分析的问题。传统流模型通常假设数据位于欧氏空间,而RFM将这一框架扩展至支撑在流形上的数据,通过学习一个时间依赖的切向量场(tangent vector field),其对应的流微分方程(ODE)可将简单先验分布映射到目标数据分布。解决方案的关键在于建立一个控制两个流形ODE流之间TV距离演化的微分不等式,该不等式将TV的时间导数表示为向量场误差的散度与参考流的score函数的乘积;为此,作者引入了针对平行移动(parallel transport)和曲率(curvature)的新型估计技术,从而在光滑性假设下获得显式的TV误差界:TV≤CLiph+Cεε,其中 h 为离散步长、ε 为学习误差,清晰分离了数值离散化误差与学习误差,并在超球面 Sd 和正定矩阵流形 SPD(n) 上给出了多项式复杂度的迭代次数上界。
链接: https://arxiv.org/abs/2602.05174
作者: Yunrui Guan,Krishnakumar Balasubramanian,Shiqian Ma
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:
Abstract:Riemannian flow matching (RFM) extends flow-based generative modeling to data supported on manifolds by learning a time-dependent tangent vector field whose flow-ODE transports a simple base distribution to the data law. We develop a nonasymptotic Total Variation (TV) convergence analysis for RFM samplers that use a learned vector field together with Euler discretization on manifolds. Our key technical ingredient is a differential inequality governing the evolution of TV between two manifold ODE flows, which expresses the time-derivative of TV through the divergence of the vector-field mismatch and the score of the reference flow; controlling these terms requires establishing new bounds that explicitly account for parallel transport and curvature. Under smoothness assumptions on the population flow-matching field and either uniform (compact manifolds) or mean-square (Hadamard manifolds) approximation guarantees for the learned field, we obtain explicit bounds of the form \mathrmTV\le C_\mathrmLip,h + C_\varepsilon,\varepsilon (with an additional higher-order \varepsilon^2 term on compact manifolds), cleanly separating numerical discretization and learning errors. Here, h is the step-size and \varepsilon is the target accuracy. Instantiations yield \emphexplicit polynomial iteration complexities on the hypersphere S^d , and on the SPD (n) manifolds under mild moment conditions.
zh
[AI-150] Large-Ensemble Simulations Reveal Links Between Atmospheric Blocking Frequency and Sea Surface Temperature Variability
【速读】:该论文旨在解决海表温度(Sea Surface Temperature, SST)变化与大气内部混沌变率对中纬度大气阻塞事件影响的分离难题。其关键解决方案是利用两个计算高效的深度学习气候模型进行百年尺度(1900–2010年)的大集合模拟,通过平均大集合结果有效滤除大气内部噪声,从而清晰识别出由SST强迫驱动的阻塞事件变异分量,并揭示其物理可解释的遥相关关系和趋势特征。
链接: https://arxiv.org/abs/2602.05083
作者: Zilu Meng,Gregory J. Hakim,Wenchang Yang,Gabriel A. Vecchi
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
备注:
Abstract:Atmospheric blocking events drive persistent weather extremes in midlatitudes, but isolating the influence of sea surface temperature (SST) from chaotic internal atmospheric variability on these events remains a challenge. We address this challenge using century-long (1900-2010), large-ensemble simulations with two computationally efficient deep-learning general circulation models. We find these models skillfully reproduce the observed blocking climatology, matching or exceeding the performance of a traditional high-resolution model and representative CMIP6 models. Averaging the large ensembles filters internal atmospheric noise to isolate the SST-forced component of blocking variability, yielding substantially higher correlations with reanalysis than for individual ensemble members. We identify robust teleconnections linking Greenland blocking frequency to North Atlantic SST and El Niño-like patterns. Furthermore, SST-forced trends in blocking frequency show a consistent decline in winter over Greenland, and an increase over Europe. These results demonstrate that SST variability exerts a significant and physically interpretable influence on blocking frequency and establishes large ensembles from deep learning models as a powerful tool for separating forced SST signals from internal noise.
zh
[AI-151] AI-Based Detection of In-Treatment Changes from Prostate MR-Linac Images
【速读】:该论文旨在解决如何利用常规获取的纵向磁共振直线加速器(MR-Linac)图像来表征放疗过程中产生的治疗诱导变化,特别是短间隔(平均2天)内的细微分次间变化问题。解决方案的关键在于构建一个基于深度学习的模型,通过预测成对图像的时间顺序来识别这些变化;该模型首先在首末分次图像(F1-FL)上训练,随后扩展至所有分次对(All-pairs),并结合定量指标(准确率与AUC)、定性分析(显著性图评估)及输入消融实验,有效识别出受放射治疗影响的主要解剖区域(如前列腺、膀胱和耻骨联合),从而实现对放疗效应的高精度检测与空间定位。
链接: https://arxiv.org/abs/2602.04983
作者: Seungbin Park,Peilin Wang,Ryan Pennell,Emily S. Weg,Himanshu Nagar,Timothy McClure,Mert R. Sabuncu,Daniel Margolis,Heejong Kim
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Purpose: To investigate whether routinely acquired longitudinal MR-Linac images can be leveraged to characterize treatment-induced changes during radiotherapy, particularly subtle inter-fraction changes over short intervals (average of 2 days). Materials and Methods: This retrospective study included a series of 0.35T MR-Linac images from 761 patients. An artificial intelligence (deep learning) model was used to characterize treatment-induced changes by predicting the temporal order of paired images. The model was first trained with the images from the first and the last fractions (F1-FL), then with all pairs (All-pairs). Model performance was assessed using quantitative metrics (accuracy and AUC), compared to a radiologist’s performance, and qualitative analyses - the saliency map evaluation to investigate affected anatomical regions. Input ablation experiments were performed to identify the anatomical regions altered by radiotherapy. The radiologist conducted an additional task on partial images reconstructed by saliency map regions, reporting observations as well. Quantitative image analysis was conducted to investigate the results from the model and the radiologist. Results: The F1-FL model yielded near-perfect performance (AUC of 0.99), significantly outperforming the radiologist. The All-pairs model yielded an AUC of 0.97. This performance reflects therapy-induced changes, supported by the performance correlation to fraction intervals, ablation tests and expert’s interpretation. Primary regions driving the predictions were prostate, bladder, and pubic symphysis. Conclusion: The model accurately predicts temporal order of MR-Linac fractions and detects radiation-induced changes over one or a few days, including prostate and adjacent organ alterations confirmed by experts. This underscores MR-Linac’s potential for advanced image analysis beyond image guidance.
zh
[AI-152] Smart Diagnosis and Early Intervention in PCOS: A Deep Learning Approach to Womens Reproductive Health
【速读】:该论文旨在解决多囊卵巢综合征(Polycystic Ovary Syndrome, PCOS)早期诊断中依赖人工判读效率低、主观性强的问题,提出了一种基于迁移学习的自动化卵巢超声图像分类框架。其解决方案的关键在于:利用DenseNet201和ResNet50两种预训练模型进行特征提取,并结合MixUp与CutMix数据增强策略提升模型泛化能力,在包含3856张超声图像的数据集上实现了高达99.80%的验证准确率;同时引入SHAP、Grad-CAM和LIME等可解释人工智能(Explainable AI, XAI)方法增强模型透明度,为临床提供可信且可视化的决策依据。
链接: https://arxiv.org/abs/2602.04944
作者: Shayan Abrar,Samura Rahman,Ishrat Jahan Momo,Mahjabin Tasnim Samiha,B. M. Shahria Alam,Mohammad Tahmid Noor,Nishat Tasnim Niloy
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 12 figures. This is the author’s accepted manuscript of a paper accepted for publication in the Proceedings of the 16th International IEEE Conference on Computing, Communication and Networking Technologies (ICCCNT 2025). The final published version will be available via IEEE Xplore
Abstract:Polycystic Ovary Syndrome (PCOS) is a widespread disorder in women of reproductive age, characterized by a hormonal imbalance, irregular periods, and multiple ovarian cysts. Infertility, metabolic syndrome, and cardiovascular risks are long-term complications that make early detection essential. In this paper, we design a powerful framework based on transfer learning utilizing DenseNet201 and ResNet50 for classifying ovarian ultrasound images. The model was trained on an online dataset containing 3856 ultrasound images of cyst-infected and non-infected patients. Each ultrasound frame was resized to 224x224 pixels and encoded with precise pathological indicators. The MixUp and CutMix augmentation strategies were used to improve generalization, yielding a peak validation accuracy of 99.80% by Densenet201 and a validation loss of 0.617 with alpha values of 0.25 and 0.4, respectively. We evaluated the model’s interpretability using leading Explainable AI (XAI) approaches such as SHAP, Grad-CAM, and LIME, reasoning with and presenting explicit visual reasons for the model’s behaviors, therefore increasing the model’s transparency. This study proposes an automated system for medical picture diagnosis that may be used effectively and confidently in clinical practice.
zh
[AI-153] Graph–Theoretic Analysis of Phase Optimization Complexity in Variational Wave Functions for Heisenberg Antiferromagnets
【速读】:该论文旨在系统性地刻画受挫自旋-1/2反铁磁海森堡模型(frustrated Heisenberg antiferromagnet, HAF)中波函数的相位结构(phase structure),这是此前尚未被充分理解的问题。其解决方案的关键在于将HAF的希尔伯特空间表示为一个加权图——即希尔伯特图(Hilbert graph, HG),其中顶点代表自旋构型,边由海森堡哈密顿量中的非对角翻转项生成,权重由波函数振幅的乘积决定。通过固定振幅并限制相位取值为 Z2,变分能量可转化为HG上的经典伊辛反铁磁模型,从而将波函数相位重构问题转化为加权最大割(weighted Max-Cut)实例,揭示了波函数符号结构与组合优化之间的直接联系,并表明该相位重构问题是计算上最坏情况下的NP难问题。
链接: https://arxiv.org/abs/2602.04943
作者: Mahmud Ashraf Shamim,Moshiur Rahman,Mohamed Hibat-Allah,Paulo T Araujo
机构: 未知
类目: rongly Correlated Electrons (cond-mat.str-el); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Quantum Physics (quant-ph)
备注:
Abstract:Despite extensive study, the phase structure of the wavefunctions in frustrated Heisenberg antiferromagnets (HAF) is not yet systematically characterized. In this work, we represent the Hilbert space of an HAF as a weighted graph, which we term the Hilbert graph (HG), whose vertices are spin configurations and whose edges are generated by off-diagonal spin-flip terms of the Heisenberg Hamiltonian, with weights set by products of wavefunction amplitudes. Holding the amplitudes fixed and restricting phases to \mathbbZ_2 values, the phase-dependent variational energy can be recast as a classical Ising antiferromagnet on the HG, so that phase reconstruction of the ground state reduces to a weighted Max-Cut instance. This shows that phase reconstruction HAF is worst-case NP-hard and provides a direct link between wavefunction sign structure and combinatorial optimization.
zh
机器学习
[LG-0] Can vision language models learn intuitive physics from interaction?
链接: https://arxiv.org/abs/2602.06033
作者: Luca M. Schulze Buschoff,Konstantinos Voudouris,Can Demircan,Eric Schulz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.
[LG-1] AP-OOD: Attention Pooling for Out-of-Distribution Detection ICLR2026
链接: https://arxiv.org/abs/2602.06031
作者: Claus Hofmann,Christian Huber,Bernhard Lehner,Daniel Klotz,Sepp Hochreiter,Werner Zellinger
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR 2026
Abstract:Out-of-distribution (OOD) detection, which maps high-dimensional data into a scalar OOD score, is critical for the reliable deployment of machine learning models. A key challenge in recent research is how to effectively leverage and aggregate token embeddings from language models to obtain the OOD score. In this work, we propose AP-OOD, a novel OOD detection method for natural language that goes beyond simple average-based aggregation by exploiting token-level information. AP-OOD is a semi-supervised approach that flexibly interpolates between unsupervised and supervised settings, enabling the use of limited auxiliary outlier data. Empirically, AP-OOD sets a new state of the art in OOD detection for text: in the unsupervised setting, it reduces the FPR95 (false positive rate at 95% true positives) from 27.84% to 4.67% on XSUM summarization, and from 77.08% to 70.37% on WMT15 En-Fr translation.
[LG-2] Curiosity is Knowledge: Self-Consistent Learning and No-Regret Optimization with Active Inference
链接: https://arxiv.org/abs/2602.06029
作者: Yingke Li,Anjali Parashar,Enlu Zhou,Chuchu Fan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Active inference (AIF) unifies exploration and exploitation by minimizing the Expected Free Energy (EFE), balancing epistemic value (information gain) and pragmatic value (task performance) through a curiosity coefficient. Yet it has been unclear when this balance yields both coherent learning and efficient decision-making: insufficient curiosity can drive myopic exploitation and prevent uncertainty resolution, while excessive curiosity can induce unnecessary exploration and regret. We establish the first theoretical guarantee for EFE-minimizing agents, showing that a single requirement–sufficient curiosity–simultaneously ensures self-consistent learning (Bayesian posterior consistency) and no-regret optimization (bounded cumulative regret). Our analysis characterizes how this mechanism depends on initial uncertainty, identifiability, and objective alignment, thereby connecting AIF to classical Bayesian experimental design and Bayesian optimization within one theoretical framework. We further translate these theories into practical design guidelines for tuning the epistemic-pragmatic trade-off in hybrid learning-optimization problems, validated through real-world experiments.
[LG-3] Mechanisms of AI Protein Folding in ESMFold
链接: https://arxiv.org/abs/2602.06020
作者: Kevin Lu,Jannik Brinkmann,Stefan Huber,Aaron Mueller,Yonatan Belinkov,David Bau,Chris Wendler
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Our code, data, and results are available at this https URL
Abstract:How do protein structure prediction models fold proteins? We investigate this question by tracing how ESMFold folds a beta hairpin, a prevalent structural motif. Through counterfactual interventions on model latents, we identify two computational stages in the folding trunk. In the first stage, early blocks initialize pairwise biochemical signals: residue identities and associated biochemical features such as charge flow from sequence representations into pairwise representations. In the second stage, late blocks develop pairwise spatial features: distance and contact information accumulate in the pairwise representation. We demonstrate that the mechanisms underlying structural decisions of ESMFold can be localized, traced through interpretable representations, and manipulated with strong causal effects.
[LG-4] On Computation and Reinforcement Learning
链接: https://arxiv.org/abs/2602.05999
作者: Raj Ghugare,Michał Bortkiewicz,Alicja Ziarko,Benjamin Eysenbach
类目: Machine Learning (cs.LG)
*备注:
Abstract:How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that (1) this architecture achieves stronger performance simply by using more compute, and (2) stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.
[LG-5] Orthogonal Self-Attention
链接: https://arxiv.org/abs/2602.05996
作者: Leo Zhang,James Martens
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint
Abstract:Softmax Self-Attention (SSA) is a key component of Transformer architectures. However, when utilised within skipless architectures, which aim to improve representation learning, recent work has highlighted the inherent instability of SSA due to inducing rank collapse and poorly-conditioned Jacobians. In this work, we design a novel attention mechanism: Orthogonal Self-Attention (OSA), which aims to bypass these issues with SSA, in order to allow for (non-causal) Transformers without skip connections and normalisation layers to be more easily trained. In particular, OSA parametrises the attention matrix to be orthogonal via mapping a skew-symmetric matrix, formed from query-key values, through the matrix exponential. We show that this can be practically implemented, by exploiting the low-rank structure of our query-key values, resulting in the computational complexity and memory cost of OSA scaling linearly with sequence length. Furthermore, we derive an initialisation scheme for which we prove ensures that the Jacobian of OSA is well-conditioned.
[LG-6] Layer-wise LoRA fine-tuning: a similarity metric approach
链接: https://arxiv.org/abs/2602.05988
作者: Keith Ando Ogawa,Bruno Lopes Yamamoto,Lucas Lauton de Alcantara,Lucas Pellicer,Rosimeire Pereira Costa,Edson Bollis,Anna Helena Reali Costa,Artur Jordao
类目: Machine Learning (cs.LG)
*备注: Code is available at this https URL
Abstract:Pre-training Large Language Models (LLMs) on web-scale datasets becomes fundamental for advancing general-purpose AI. In contrast, enhancing their predictive performance on downstream tasks typically involves adapting their knowledge through fine-tuning. Parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA), aim to reduce the computational cost of this process by freezing the pre-trained model and updating a smaller number of parameters. In comparison to full fine-tuning, these methods achieve over 99% reduction in trainable parameter count, depending on the configuration. Unfortunately, such a reduction may prove insufficient as LLMs continue to grow in scale. In this work, we address the previous problem by systematically selecting only a few layers to fine-tune using LoRA or its variants. We argue that not all layers contribute equally to the model adaptation. Leveraging this, we identify the most relevant layers to fine-tune by measuring their contribution to changes in internal representations. Our method is orthogonal to and readily compatible with existing low-rank adaptation techniques. We reduce the trainable parameters in LoRA-based techniques by up to 50%, while maintaining the predictive performance across different models and tasks. Specifically, on encoder-only architectures, this reduction in trainable parameters leads to a negligible predictive performance drop on the GLUE benchmark. On decoder-only architectures, we achieve a small drop or even improvements in the predictive performance on mathematical problem-solving capabilities and coding tasks. Finally, this effectiveness extends to multimodal models, for which we also observe competitive results relative to fine-tuning with LoRA modules in all layers. Code is available at: this https URL
[LG-7] A Hybrid Data-Driven Algorithm for Real-Time Friction Force Estimation in Hydraulic Cylinders
链接: https://arxiv.org/abs/2602.05967
作者: Mohamad Amin Jamshidi,Mehrbod Zarifi,Zolfa Anvari,Hamed Ghafarirad,Mohammad Zareinejad
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Published in: 2025 33rd International Conference on Electrical Engineering (ICEE), Publisher IEEE
Abstract:Hydraulic systems are widely utilized in industrial applications due to their high force generation, precise control, and ability to function in harsh environments. Hydraulic cylinders, as actuators in these systems, apply force and position through the displacement of hydraulic fluid, but their operation is significantly influenced by friction force. Achieving precision in hydraulic cylinders requires an accurate friction model under various operating conditions. Existing analytical models, often derived from experimental tests, necessitate the identification or estimation of influencing factors but are limited in adaptability and computational efficiency. This research introduces a data-driven, hybrid algorithm based on Long Short-Term Memory (LSTM) networks and Random Forests for nonlinear friction force estimation. The algorithm effectively combines feature detection and estimation processes using training data acquired from an experimental hydraulic test setup. It achieves a consistent and stable model error of less than 10% across diverse operating conditions and external load variations, ensuring robust performance in complex situations. The computational cost of the algorithm is 1.51 milliseconds per estimation, making it suitable for real-time applications. The proposed method addresses the limitations of analytical models by delivering high precision and computational efficiency. The algorithm’s performance is validated through detailed analysis and experimental results, including direct comparisons with the LuGre model. The comparison highlights that while the LuGre model offers a theoretical foundation for friction modeling, its performance is limited by its inability to dynamically adjust to varying operational conditions of the hydraulic cylinder, further emphasizing the advantages of the proposed hybrid approach in real-time applications.
[LG-8] Discrete diffusion samplers and bridges: Off-policy algorithms and applications in latent spaces
链接: https://arxiv.org/abs/2602.05961
作者: Arran Carter,Sanghyeok Choi,Kirill Tamogashev,Víctor Elvira,Nikolay Malkin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Code: this https URL
Abstract:Sampling from a distribution p(x) \propto e^-\mathcalE(x) known up to a normalising constant is an important and challenging problem in statistics. Recent years have seen the rise of a new family of amortised sampling algorithms, commonly referred to as diffusion samplers, that enable fast and efficient sampling from an unnormalised density. Such algorithms have been widely studied for continuous-space sampling tasks; however, their application to problems in discrete space remains largely unexplored. Although some progress has been made in this area, discrete diffusion samplers do not take full advantage of ideas commonly used for continuous-space sampling. In this paper, we propose to bridge this gap by introducing off-policy training techniques for discrete diffusion samplers. We show that these techniques improve the performance of discrete samplers on both established and new synthetic benchmarks. Next, we generalise discrete diffusion samplers to the task of bridging between two arbitrary distributions, introducing data-to-energy Schrödinger bridge training for the discrete domain for the first time. Lastly, we showcase the application of the proposed diffusion samplers to data-free posterior sampling in the discrete latent spaces of image generative models.
[LG-9] Breaking Symmetry Bottlenecks in GNN Readouts
链接: https://arxiv.org/abs/2602.05950
作者: Mouad Talhi,Arne Wolf,Anthea Monod
类目: Machine Learning (cs.LG)
*备注: 23 pages
Abstract:Graph neural networks (GNNs) are widely used for learning on structured data, yet their ability to distinguish non-isomorphic graphs is fundamentally limited. These limitations are usually attributed to message passing; in this work we show that an independent bottleneck arises at the readout stage. Using finite-dimensional representation theory, we prove that all linear permutation-invariant readouts, including sum and mean pooling, factor through the Reynolds (group-averaging) operator and therefore project node embeddings onto the fixed subspace of the permutation action, erasing all non-trivial symmetry-aware components regardless of encoder expressivity. This yields both a new expressivity barrier and an interpretable characterization of what global pooling preserves or destroys. To overcome this collapse, we introduce projector-based invariant readouts that decompose node representations into symmetry-aware channels and summarize them with nonlinear invariant statistics, preserving permutation invariance while retaining information provably invisible to averaging. Empirically, swapping only the readout enables fixed encoders to separate WL-hard graph pairs and improves performance across multiple benchmarks, demonstrating that readout design is a decisive and under-appreciated factor in GNN expressivity.
[LG-10] f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
链接: https://arxiv.org/abs/2602.05946
作者: Rajdeep Haldar,Lantao Mei,Guang Lin,Yue Xing,Qifan Song
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose f -Group Relative Policy Optimization ( f -GRPO), a class of on-policy reinforcement learning, and f -Hybrid Alignment Loss ( f -HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of f -divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.
[LG-11] Orthogonal Model Merging
链接: https://arxiv.org/abs/2602.05943
作者: Sihan Yang,Kexuan Shi,Weiyang Liu
类目: Machine Learning (cs.LG)
*备注: Technical report (18 pages, 9 figures, project page: this https URL )
Abstract:Merging finetuned Large Language Models (LLMs) has become increasingly important for integrating diverse capabilities into a single unified model. However, prevailing model merging methods rely on linear arithmetic in Euclidean space, which often destroys the intrinsic geometric properties of pretrained weights, such as hyperspherical energy. To address this, we propose Orthogonal Model Merging (OrthoMerge), a method that performs merging operations on the Riemannian manifold formed by the orthogonal group to preserve the geometric structure of the model’s weights. By mapping task-specific orthogonal matrices learned by Orthogonal Finetuning (OFT) to the Lie algebra, OrthoMerge enables a principled yet efficient integration that takes into account both the direction and intensity of adaptations. In addition to directly leveraging orthogonal matrices obtained by OFT, we further extend this approach to general models finetuned with non-OFT methods (i.e., low-rank finetuning, full finetuning) via an Orthogonal-Residual Decoupling strategy. This technique extracts the orthogonal components of expert models by solving the orthogonal Procrustes problem, which are then merged on the manifold of the orthogonal group, while the remaining linear residuals are processed through standard additive merging. Extensive empirical results demonstrate the effectiveness of OrthoMerge in mitigating catastrophic forgetting and maintaining model performance across diverse tasks.
[LG-12] Dimensionality Reduction on Riemannian Manifolds in Data Analysis
链接: https://arxiv.org/abs/2602.05936
作者: Alaa El Ichi,Khalide Jbilou
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we investigate Riemannian geometry based dimensionality reduction methods that respect the underlying manifold structure of the data. In particular, we focus on Principal Geodesic Analysis (PGA) as a nonlinear generalization of PCA for manifold valued data, and extend discriminant analysis through Riemannian adaptations of other known dimensionality reduction methods. These approaches exploit geodesic distances, tangent space representations, and intrinsic statistical measures to achieve more faithful low dimensional embeddings. We also discuss related manifold learning techniques and highlight their theoretical foundations and practical advantages. Experimental results on representative datasets demonstrate that Riemannian methods provide improved representation quality and classification performance compared to their Euclidean counterparts, especially for data constrained to curved spaces such as hyperspheres and symmetric positive definite manifolds. This study underscores the importance of geometry aware dimensionality reduction in modern machine learning and data science applications.
[LG-13] uning Out-of-Distribution (OOD) Detectors Without Given OOD Data
链接: https://arxiv.org/abs/2602.05935
作者: Sudeepta Mondal,Xinyi Mary Xie,Ruxiao Duan,Alex Wong,Ganesh Sundaramoorthi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Existing out-of-distribution (OOD) detectors are often tuned by a separate dataset deemed OOD with respect to the training distribution of a neural network (NN). OOD detectors process the activations of NN layers and score the output, where parameters of the detectors are determined by fitting to an in-distribution (training) set and the aforementioned dataset chosen adhocly. At detector training time, this adhoc dataset may not be available or difficult to obtain, and even when it’s available, it may not be representative of actual OOD data, which is often ‘‘unknown unknowns." Current benchmarks may specify some left-out set from test OOD sets. We show that there can be significant variance in performance of detectors based on the adhoc dataset chosen in current literature, and thus even if such a dataset can be collected, the performance of the detector may be highly dependent on the choice. In this paper, we introduce and formalize the often neglected problem of tuning OOD detectors without a given ``OOD’’ dataset. To this end, we present strong baselines as an attempt to approach this problem. Furthermore, we propose a new generic approach to OOD detector tuning that does not require any extra data other than those used to train the NN. We show that our approach improves over baseline methods consistently across higher-parameter OOD detector families, while being comparable across lower-parameter families.
[LG-14] Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training
链接: https://arxiv.org/abs/2602.05933
作者: Zhenghao Xu,Qin Lu,Changlong Yu,Tuo Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL-- \chi^2 regularizer. This additional \chi^2 regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at this https URL.
[LG-15] Chunky Post-Training: Data Driven Failures of Generalization
链接: https://arxiv.org/abs/2602.05910
作者: Seoirse Murray,Allison Qi,Timothy Qian,John Schulman,Collin Burns,Sara Price
类目: Machine Learning (cs.LG)
*备注:
Abstract:LLM post-training involves many diverse datasets, each targeting a specific behavior. But these datasets encode incidental patterns alongside intended ones: correlations between formatting and content, narrow phrasings across diverse problems, and implicit associations arising from the discrete data curation process. These patterns are often invisible to developers yet salient to models, producing behaviors that surprise their creators, such as rejecting true facts presented in a particular question format. We call this chunky post-training: the model learns spurious correlations as a result of distinct chunks of post-training data. We introduce SURF, a black-box pipeline which surfaces these unintended behaviors at run time, and TURF, a tool that traces these failures back to specific post-training data. Applying these tools to frontier models (Claude 4.5, GPT-5.1, Grok 4.1, Gemini 3) and open models (Tülu 3), we show that chunky post-training produces miscalibrated behaviors, which often result from imbalanced or underspecified chunks of post-training data.
[LG-16] ContextBench: A Benchmark for Context Retrieval in Coding Agents
链接: https://arxiv.org/abs/2602.05892
作者: Han Li,Letian Zhu,Bohan Zhang,Rili Feng,Jiaming Wang,Yue Pan,Earl T. Barr,Sarro Federica,Zhaoyang Chu,He Ye
类目: Machine Learning (cs.LG)
*备注: 36 pages, 6 figures, 4 tables
Abstract:LLM-based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during problem solving. We introduce ContextBench, a process-oriented evaluation of context retrieval in coding agents. ContextBench consists of 1,136 issue-resolution tasks from 66 repositories across eight programming languages, each augmented with human-annotated gold contexts. We further implement an automated evaluation framework that tracks agent trajectories and measures context recall, precision, and efficiency throughout issue resolution. Using ContextBench, we evaluate four frontier LLMs and five coding agents. Our results show that sophisticated agent scaffolding yields only marginal gains in context retrieval (“The Bitter Lesson” of coding agents), LLMs consistently favor recall over precision, and substantial gaps exist between explored and utilized context. ContextBench augments existing end-to-end benchmarks with intermediate gold-context metrics that unbox the issue-resolution process. These contexts offer valuable intermediate signals for guiding LLM reasoning in software tasks. Data and code are available at: this https URL.
[LG-17] Escaping Local Minima Provably in Non-convex Matrix Sensing: A Deterministic Framework via Simulated Lifting
链接: https://arxiv.org/abs/2602.05887
作者: Tianqi Shen,Jinji Yang,Junze He,Kunhan Gao,Ziye Ma
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 48 pages, 10 figures, 5 tables. Submitted to Mathematical Programming
Abstract:Low-rank matrix sensing is a fundamental yet challenging nonconvex problem whose optimization landscape typically contains numerous spurious local minima, making it difficult for gradient-based optimizers to converge to the global optimum. Recent work has shown that over-parameterization via tensor lifting can convert such local minima into strict saddle points, an insight that also partially explains why massive scaling can improve generalization and performance in modern machine learning. Motivated by this observation, we propose a Simulated Oracle Direction (SOD) escape mechanism that simulates the landscape and escape direction of the over-parametrized space, without resorting to actually lifting the problem, since that would be computationally intractable. In essence, we designed a mathematical framework to project over-parametrized escape directions onto the original parameter space to guarantee a strict decrease of objective value from existing local minima. To the best of the our knowledge, this represents the first deterministic framework that could escape spurious local minima with guarantee, especially without using random perturbations or heuristic estimates. Numerical experiments demonstrate that our framework reliably escapes local minima and facilitates convergence to global optima, while incurring minimal computational cost when compared to explicit tensor over-parameterization. We believe this framework has non-trivial implications for nonconvex optimization beyond matrix sensing, by showcasing how simulated over-parameterization can be leveraged to tame challenging optimization landscapes.
[LG-18] Large-scale Score-based Variational Posterior Inference for Bayesian Deep Neural Networks
链接: https://arxiv.org/abs/2602.05873
作者: Minyoung Kim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian (deep) neural networks (BNN) are often more attractive than the mainstream point-estimate vanilla deep learning in various aspects including uncertainty quantification, robustness to noise, resistance to overfitting, and more. The variational inference (VI) is one of the most widely adopted approximate inference methods. Whereas the ELBO-based variational free energy method is a dominant choice in the literature, in this paper we introduce a score-based alternative for BNN variational inference. Although there have been quite a few score-based variational inference methods proposed in the community, most are not adequate for large-scale BNNs for various computational and technical reasons. We propose a novel scalable VI method where the learning objective combines the score matching loss and the proximal penalty term in iterations, which helps our method avoid the reparametrized sampling, and allows for noisy unbiased mini-batch scores through stochastic gradients. This in turn makes our method scalable to large-scale neural networks including Vision Transformers, and allows for richer variational density families. On several benchmarks including visual recognition and time-series forecasting with large-scale deep networks, we empirically show the effectiveness of our approach.
[LG-19] CFRecs: Counterfactual Recommendations on Real Estate User Listing Interaction Graphs
链接: https://arxiv.org/abs/2602.05861
作者: Seyedmasoud Mousavi,Ruomeng Xu,Xiaojing Zhu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Graph-structured data is ubiquitous and powerful in representing complex relationships in many online platforms. While graph neural networks (GNNs) are widely used to learn from such data, counterfactual graph learning has emerged as a promising approach to improve model interpretability. Counterfactual explanation research focuses on identifying a counterfactual graph that is similar to the original but leads to different predictions. These explanations optimize two objectives simultaneously: the sparsity of changes in the counterfactual graph and the validity of its predictions. Building on these qualitative optimization goals, this paper introduces CFRecs, a novel framework that transforms counterfactual explanations into actionable insights. CFRecs employs a two-stage architecture consisting of a graph neural network (GNN) and a graph variational auto-encoder (Graph-VAE) to strategically propose minimal yet high-impact changes in graph structure and node attributes to drive desirable outcomes in recommender systems. We apply CFRecs to Zillow’s graph-structured data to deliver actionable recommendations for both home buyers and sellers with the goal of helping them navigate the competitive housing market and achieve their homeownership goals. Experimental results on Zillow’s user-listing interaction data demonstrate the effectiveness of CFRecs, which also provides a fresh perspective on recommendations using counterfactual reasoning in graphs.
[LG-20] A Hybrid Autoencoder for Robust Heightmap Generation from Fused Lidar and Depth Data for Humanoid Robot Locomotion
链接: https://arxiv.org/abs/2602.05855
作者: Dennis Bank,Joost Cordes,Thomas Seel,Simon F.G. Ehlers
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Reliable terrain perception is a critical prerequisite for the deployment of humanoid robots in unstructured, human-centric environments. While traditional systems often rely on manually engineered, single-sensor pipelines, this paper presents a learning-based framework that uses an intermediate, robot-centric heightmap representation. A hybrid Encoder-Decoder Structure (EDS) is introduced, utilizing a Convolutional Neural Network (CNN) for spatial feature extraction fused with a Gated Recurrent Unit (GRU) core for temporal consistency. The architecture integrates multimodal data from an Intel RealSense depth camera, a LIVOX MID-360 LiDAR processed via efficient spherical projection, and an onboard IMU. Quantitative results demonstrate that multimodal fusion improves reconstruction accuracy by 7.2% over depth-only and 9.9% over LiDAR-only configurations. Furthermore, the integration of a 3.2 s temporal context reduces mapping drift.
[LG-21] Exact Recovery in the Data Block Model
链接: https://arxiv.org/abs/2602.05852
作者: Amir R. Asadi,Akbar Davoodi,Ramin Javadi,Farzad Parvaresh
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 35 pages
Abstract:Community detection in networks is a fundamental problem in machine learning and statistical inference, with applications in social networks, biological systems, and communication networks. The stochastic block model (SBM) serves as a canonical framework for studying community structure, and exact recovery, identifying the true communities with high probability, is a central theoretical question. While classical results characterize the phase transition for exact recovery based solely on graph connectivity, many real-world networks contain additional data, such as node attributes or labels. In this work, we study exact recovery in the Data Block Model (DBM), an SBM augmented with node-associated data, as formalized by Asadi, Abbe, and Verdú (2017). We introduce the Chernoff–TV divergence and use it to characterize a sharp exact recovery threshold for the DBM. We further provide an efficient algorithm that achieves this threshold, along with a matching converse result showing impossibility below the threshold. Finally, simulations validate our findings and demonstrate the benefits of incorporating vertex data as side information in community detection.
[LG-22] Visualizing the loss landscapes of physics-informed neural networks
链接: https://arxiv.org/abs/2602.05849
作者: Conor Rowan,Finn Murphy-Blanchard
类目: Machine Learning (cs.LG)
*备注:
Abstract:Training a neural network requires navigating a high-dimensional, non-convex loss surface to find parameters that minimize this loss. In many ways, it is surprising that optimizers such as stochastic gradient descent and ADAM can reliably locate minima which perform well on both the training and test data. To understand the success of training, a “loss landscape” community has emerged to study the geometry of the loss function and the dynamics of optimization, often using visualization techniques. However, these loss landscape studies have mostly been limited to machine learning for image classification. In the newer field of physics-informed machine learning, little work has been conducted to visualize the landscapes of losses defined not by regression to large data sets, but by differential operators acting on state fields discretized by neural networks. In this work, we provide a comprehensive review of the loss landscape literature, as well as a discussion of the few existing physics-informed works which investigate the loss landscape. We then use a number of the techniques we survey to empirically investigate the landscapes defined by the Deep Ritz and squared residual forms of the physics loss function. We find that the loss landscapes of physics-informed neural networks have many of the same properties as the data-driven classification problems studied in the literature. Unexpectedly, we find that the two formulations of the physics loss often give rise to similar landscapes, which appear smooth, well-conditioned, and convex in the vicinity of the solution. The purpose of this work is to introduce the loss landscape perspective to the scientific machine learning community, compare the Deep Ritz and the strong form losses, and to challenge prevailing intuitions about the complexity of the loss landscapes of physics-informed networks.
[LG-23] Synthesizing Realistic Test Data without Breaking Privacy
链接: https://arxiv.org/abs/2602.05833
作者: Laura Plein,Alexi Turcotte,Arina Hallemans,Andreas Zeller
类目: Machine Learning (cs.LG)
*备注:
Abstract:There is a need for synthetic training and test datasets that replicate statistical distributions of original datasets without compromising their confidentiality. A lot of research has been done in leveraging Generative Adversarial Networks (GANs) for synthetic data generation. However, the resulting models are either not accurate enough or are still vulnerable to membership inference attacks (MIA) or dataset reconstruction attacks since the original data has been leveraged in the training process. In this paper, we explore the feasibility of producing a synthetic test dataset with the same statistical properties as the original one, with only indirectly leveraging the original data in the generation process. The approach is inspired by GANs, with a generation step and a discrimination step. However, in our approach, we use a test generator (a fuzzer) to produce test data from an input specification, preserving constraints set by the original data; a discriminator model determines how close we are to the original data. By evolving samples and determining “good samples” with the discriminator, we can generate privacy-preserving data that follows the same statistical distributions are the original dataset, leading to a similar utility as the original data. We evaluated our approach on four datasets that have been used to evaluate the state-of-the-art techniques. Our experiments highlight the potential of our approach towards generating synthetic datasets that have high utility while preserving privacy.
[LG-24] Interpreting Manifolds and Graph Neural Embeddings from Internet of Things Traffic Flows
链接: https://arxiv.org/abs/2602.05817
作者: Enrique Feito-Casares,Francisco M. Melgarejo-Meseguer,Elena Casiraghi,Giorgio Valentini,José-Luis Rojo-Álvarez
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:The rapid expansion of Internet of Things (IoT) ecosystems has led to increasingly complex and heterogeneous network topologies. Traditional network monitoring and visualization tools rely on aggregated metrics or static representations, which fail to capture the evolving relationships and structural dependencies between devices. Although Graph Neural Networks (GNNs) offer a powerful way to learn from relational data, their internal representations often remain opaque and difficult to interpret for security-critical operations. Consequently, this work introduces an interpretable pipeline that generates directly visualizable low-dimensional representations by mapping high-dimensional embeddings onto a latent manifold. This projection enables the interpretable monitoring and interoperability of evolving network states, while integrated feature attribution techniques decode the specific characteristics shaping the manifold structure. The framework achieves a classification F1-score of 0.830 for intrusion detection while also highlighting phenomena such as concept drift. Ultimately, the presented approach bridges the gap between high-dimensional GNN embeddings and human-understandable network behavior, offering new insights for network administrators and security analysts.
[LG-25] Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers
链接: https://arxiv.org/abs/2602.05813
作者: Artem Riabinin,Andrey Veprikov,Arman Bolatov,Martin Takáč,Aleksandr Beznosikov
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 26 pages, 6 figures, 4 tables
Abstract:We study adaptive learning rate scheduling for norm-constrained optimizers (e.g., Muon and Lion). We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap and empirically verify that this behavior holds along optimization trajectories. Under this assumption, we establish convergence guarantees under an appropriate choice of learning rate, for which warm-up followed by decay arises naturally from the proof rather than being imposed heuristically. Building on this theory, we develop a practical learning rate scheduler that relies only on standard hyperparameters and adapts the warm-up duration automatically at the beginning of training. We evaluate this method on large language model pretraining with LLaMA architectures and show that our adaptive warm-up selection consistently outperforms or at least matches the best manually tuned warm-up schedules across all considered setups, without additional hyperparameter search. Our source code is available at this https URL Comments: 26 pages, 6 figures, 4 tables Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2602.05813 [cs.LG] (or arXiv:2602.05813v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.05813 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-26] Principled Confidence Estimation for Deep Computed Tomography
链接: https://arxiv.org/abs/2602.05812
作者: Matteo Gätzner,Johannes Kirschner
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We present a principled framework for confidence estimation in computed tomography (CT) reconstruction. Based on the sequential likelihood mixing framework (Kirschner et al., 2025), we establish confidence regions with theoretical coverage guarantees for deep-learning-based CT reconstructions. We consider a realistic forward model following the Beer-Lambert law, i.e., a log-linear forward model with Poisson noise, closely reflecting clinical and scientific imaging conditions. The framework is general and applies to both classical algorithms and deep learning reconstruction methods, including U-Nets, U-Net ensembles, and generative Diffusion models. Empirically, we demonstrate that deep reconstruction methods yield substantially tighter confidence regions than classical reconstructions, without sacrificing theoretical coverage guarantees. Our approach allows the detection of hallucinations in reconstructed images and provides interpretable visualizations of confidence regions. This establishes deep models not only as powerful estimators, but also as reliable tools for uncertainty-aware medical imaging.
[LG-27] Bifrost: Steering Strategic Trajectories to Bridge Contextual Gaps for Self-Improving Agents
链接: https://arxiv.org/abs/2602.05810
作者: Quan M. Tran,Zhuo Huang,Wenbin Zhang,Bo Han,Koji Yatani,Masashi Sugiyama,Tongliang Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Autonomous agents excel in self-improvement through reflection and iterative refinement, which reuse successful task trajectories as in-context examples to assist subsequent reasoning. However, shifting across tasks often introduces a context mismatch. Hence, existing approaches either discard the trajectories or manipulate them using heuristics, leading to a non-negligible fine-tuning cost or unguaranteed performance. To bridge this gap, we reveal a context-trajectory correlation, where shifts of context are highly parallel with shifts of trajectory. Based on this finding, we propose BrIdge contextual gap FoR imprOvised trajectory STeering (Bifrost), a training-free method that leverages context differences to precisely guide the adaptation of previously solved trajectories towards the target task, mitigating the misalignment caused by context shifts. Our trajectory adaptation is conducted at the representation level using agent hidden states, ensuring trajectory transformation accurately aligns with the target context in a shared space. Across diverse benchmarks, Bifrost consistently outperforms existing trajectory reuse and finetuned self-improvement methods, demonstrating that agents can effectively leverage past experiences despite substantial context shifts.
[LG-28] Classification Under Local Differential Privacy with Model Reversal and Model Averag ing
链接: https://arxiv.org/abs/2602.05797
作者: Caihong Qin,Yang Bai
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Local differential privacy (LDP) has become a central topic in data privacy research, offering strong privacy guarantees by perturbing user data at the source and removing the need for a trusted curator. However, the noise introduced by LDP often significantly reduces data utility. To address this issue, we reinterpret private learning under LDP as a transfer learning problem, where the noisy data serve as the source domain and the unobserved clean data as the target. We propose novel techniques specifically designed for LDP to improve classification performance without compromising privacy: (1) a noised binary feedback-based evaluation mechanism for estimating dataset utility; (2) model reversal, which salvages underperforming classifiers by inverting their decision boundaries; and (3) model averaging, which assigns weights to multiple reversed classifiers based on their estimated utility. We provide theoretical excess risk bounds under LDP and demonstrate how our methods reduce this risk. Empirical results on both simulated and real-world datasets show substantial improvements in classification accuracy.
[LG-29] Price of universality in vector quantization is at most 0.11 bit
链接: https://arxiv.org/abs/2602.05790
作者: Alina Harbuzova,Or Ordentlich,Yury Polyanskiy
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 41 page, 1 figure
Abstract:Fast computation of a matrix product W^\top X is a workhorse of modern LLMs. To make their deployment more efficient, a popular approach is that of using a low-precision approximation \widehat W in place of true W ("weight-only quantization’‘). Information theory demonstrates that an optimal algorithm for reducing precision of W depends on the (second order) statistics of X and requires a careful alignment of vector quantization codebook with PCA directions of X (a process known as "waterfilling allocation’'). Dependence of the codebook on statistics of X , however, is highly impractical. This paper proves that there exist a universal codebook that is simultaneously near-optimal for all possible statistics of X , in the sense of being at least as good as an X -adapted waterfilling codebook with rate reduced by 0.11 bit per dimension. Such universal codebook would be an ideal candidate for the low-precision storage format, a topic of active modern research, but alas the existence proof is non-constructive. Equivalently, our result shows existence of a net in \mathbbR^n that is a nearly-optimal covering of a sphere simultaneously with respect to all Hilbert norms. Comments: 41 page, 1 figure Subjects: Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2602.05790 [cs.IT] (or arXiv:2602.05790v1 [cs.IT] for this version) https://doi.org/10.48550/arXiv.2602.05790 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-30] Selecting Hyperparameters for Tree-Boosting
链接: https://arxiv.org/abs/2602.05786
作者: Floris Jan Koster,Fabio Sigrist
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:Tree-boosting is a widely used machine learning technique for tabular data. However, its out-of-sample accuracy is critically dependent on multiple hyperparameters. In this article, we empirically compare several popular methods for hyperparameter optimization for tree-boosting including random grid search, the tree-structured Parzen estimator (TPE), Gaussian-process-based Bayesian optimization (GP-BO), Hyperband, the sequential model-based algorithm configuration (SMAC) method, and deterministic full grid search using 59 regression and classification data sets. We find that the SMAC method clearly outperforms all the other considered methods. We further observe that (i) a relatively large number of trials larger than 100 is required for accurate tuning, (ii) using default values for hyperparameters yields very inaccurate models, (iii) all considered hyperparameters can have a material effect on the accuracy of tree-boosting, i.e., there is no small set of hyperparameters that is more important than others, and (iv) choosing the number of boosting iterations using early stopping yields more accurate results compared to including it in the search space for regression tasks.
[LG-31] Distributional Reinforcement Learning with Diffusion Bridge Critics
链接: https://arxiv.org/abs/2602.05783
作者: Shutong Ding,Yimiao Zhou,Ke Hu,Mokai Pan,Shan Zhong,Yanwei Fu,Jingya Wang,Ye Shi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in diffusion-based reinforcement learning (RL) methods have demonstrated promising results in a wide range of continuous control tasks. However, existing works in this field focus on the application of diffusion policies while leaving the diffusion critics unexplored. In fact, since policy optimization fundamentally relies on the critic, accurate value estimation is far more important than policy expressiveness. Furthermore, given the stochasticity of most reinforcement learning tasks, it has been confirmed that the critic is more appropriately depicted with a distributional model. Motivated by these points, we propose a novel distributional RL method with Diffusion Bridge Critics (DBC). DBC directly models the inverse cumulative distribution function (CDF) of the Q value. This allows us to accurately capture the value distribution and prevents it from collapsing into a trivial Gaussian distribution owing to the strong distribution-matching capability of the diffusion bridge. Moreover, we further derive an analytic integral formula to address discretization errors in DBC, which is essential in value estimation. To our knowledge, DBC is the first work to employ the diffusion bridge model as the critic. Notably, DBC is also a plug-and-play component and can be integrated into most existing RL frameworks. Experimental results on MuJoCo robot control benchmarks demonstrate the superiority of DBC compared with previous distributional critic models.
[LG-32] How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs
链接: https://arxiv.org/abs/2602.05779
作者: Emily Dent,Jared Tanner
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:The intermediate layers of deep networks can be characterised as a Gaussian process, in particular the Edge-of-Chaos (EoC) initialisation strategy prescribes the limiting covariance matrix of the Gaussian process. Here we show that the under-utilised chosen variance of the Gaussian process is important in the training of deep networks with sparsity inducing activation, such as a shifted and clipped ReLU, \textCReLU_\tau,m(x)=\min(\max(x-\tau,0),m) . Specifically, initialisations leading to larger fixed Gaussian process variances, allow for improved expressivity with activation sparsity as large as 90% in DNNs and CNNs, and generally improve the stability of the training process. Enabling full, or near full, accuracy at such high levels of sparsity in the hidden layers suggests a promising mechanism to reduce the energy consumption of machine learning models involving fully connected layers.
[LG-33] Cross-Domain Offline Policy Adaptation via Selective Transition Correction
链接: https://arxiv.org/abs/2602.05776
作者: Mengbei Yan,Jiafei Lyu,Shengjie Sun,Zhongjian Qiao,Jingwen Yang,Zichuan Lin,Deheng Ye,Xiu Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:It remains a critical challenge to adapt policies across domains with mismatched dynamics in reinforcement learning (RL). In this paper, we study cross-domain offline RL, where an offline dataset from another similar source domain can be accessed to enhance policy learning upon a target domain dataset. Directly merging the two datasets may lead to suboptimal performance due to potential dynamics mismatches. Existing approaches typically mitigate this issue through source domain transition filtering or reward modification, which, however, may lead to insufficient exploitation of the valuable source domain data. Instead, we propose to modify the source domain data into the target domain data. To that end, we leverage an inverse policy model and a reward model to correct the actions and rewards of source transitions, explicitly achieving alignment with the target dynamics. Since limited data may result in inaccurate model training, we further employ a forward dynamics model to retain corrected samples that better match the target dynamics than the original transitions. Consequently, we propose the Selective Transition Correction (STC) algorithm, which enables reliable usage of source domain data for policy adaptation. Experiments on various environments with dynamics shifts demonstrate that STC achieves superior performance against existing baselines.
[LG-34] How to Achieve the Intended Aim of Deep Clustering Now without Deep Learning
链接: https://arxiv.org/abs/2602.05749
作者: Kai Ming Ting,Wei-Jie Xu,Hang Zhang
类目: Machine Learning (cs.LG)
*备注: Work on progress
Abstract:Deep clustering (DC) is often quoted to have a key advantage over k -means clustering. Yet, this advantage is often demonstrated using image datasets only, and it is unclear whether it addresses the fundamental limitations of k -means clustering. Deep Embedded Clustering (DEC) learns a latent representation via an autoencoder and performs clustering based on a k -means-like procedure, while the optimization is conducted in an end-to-end manner. This paper investigates whether the deep-learned representation has enabled DEC to overcome the known fundamental limitations of k -means clustering, i.e., its inability to discover clusters of arbitrary shapes, varied sizes and densities. Our investigations on DEC have a wider implication on deep clustering methods in general. Notably, none of these methods exploit the underlying data distribution. We uncover that a non-deep learning approach achieves the intended aim of deep clustering by making use of distributional information of clusters in a dataset to effectively address these fundamental limitations.
[LG-35] Muon in Associative Memory Learning: Training Dynamics and Scaling Laws
链接: https://arxiv.org/abs/2602.05725
作者: Binghui Li,Kaifei Wang,Han Zhong,Pinyan Lu,Liwei Wang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:Muon updates matrix parameters via the matrix sign of the gradient and has shown strong empirical gains, yet its dynamics and scaling behavior remain unclear in theory. We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs, with and without label noise. In this setting, we show that Gradient Descent (GD) learns frequency components at highly imbalanced rates, leading to slow convergence bottlenecked by low-frequency components. In contrast, the Muon optimizer mitigates this imbalance, leading to faster and more uniform progress. Specifically, in the noiseless case, Muon achieves an exponential speedup over GD; in the noisy case with a power-decay frequency spectrum, we derive Muon’s optimization scaling law and demonstrate its superior scaling efficiency over GD. Furthermore, we show that Muon can be interpreted as an implicit matrix preconditioner arising from adaptive task alignment and block-symmetric gradient structure. In contrast, the preconditioner with coordinate-wise sign operator could match Muon under oracle access to unknown task representations, which is infeasible for SignGD in practice. Experiments on synthetic long-tail classification and LLaMA-style pre-training corroborate the theory.
[LG-36] Projected Boosting with Fairness Constraints: Quantifying the Cost of Fair Training Distributions
链接: https://arxiv.org/abs/2602.05713
作者: Amir Asiaee,Kaveh Aryan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Boosting algorithms enjoy strong theoretical guarantees: when weak learners maintain positive edge, AdaBoost achieves geometric decrease of exponential loss. We study how to incorporate group fairness constraints into boosting while preserving analyzable training dynamics. Our approach, FairBoost, projects the ensemble-induced exponential-weights distribution onto a convex set of distributions satisfying fairness constraints (as a reweighting surrogate), then trains weak learners on this fair distribution. The key theoretical insight is that projecting the training distribution reduces the effective edge of weak learners by a quantity controlled by the KL-divergence of the projection. We prove an exponential-loss bound where the convergence rate depends on weak learner edge minus a “fairness cost” term \delta_t = \sqrt\mathrmKL(w^t | q^t)/2 . This directly quantifies the accuracy-fairness tradeoff in boosting dynamics. Experiments on standard benchmarks validate the theoretical predictions and demonstrate competitive fairness-accuracy tradeoffs with stable training curves.
[LG-37] Fix Representation (Optimally) Before Fairness: Finite-Sample Shrinkage Population Correction and the True Price of Fairness Under Subpopulation Shift
链接: https://arxiv.org/abs/2602.05707
作者: Amir Asiaee,Kaveh Aryan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning practitioners frequently observe tension between predictive accuracy and group fairness constraints – yet sometimes fairness interventions appear to improve accuracy. We show that both phenomena can be artifacts of training data that misrepresents subgroup proportions. Under subpopulation shift (stable within-group distributions, shifted group proportions), we establish: (i) full importance-weighted correction is asymptotically unbiased but finite-sample suboptimal; (ii) the optimal finite-sample correction is a shrinkage reweighting that interpolates between target and training mixtures; (iii) apparent “fairness helps accuracy” can arise from comparing fairness methods to an improperly-weighted baseline. We provide an actionable evaluation protocol: fix representation (optimally) before fairness – compare fairness interventions against a shrinkage-corrected baseline to isolate the true, irreducible price of fairness. Experiments on synthetic and real-world benchmarks (Adult, COMPAS) validate our theoretical predictions and demonstrate that this protocol eliminates spurious tradeoffs, revealing the genuine fairness-utility frontier.
[LG-38] Limitations of SGD for Multi-Index Models Beyond Statistical Queries
链接: https://arxiv.org/abs/2602.05704
作者: Daniel Barzilai,Ohad Shamir
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Understanding the limitations of gradient methods, and stochastic gradient descent (SGD) in particular, is a central challenge in learning theory. To that end, a commonly used tool is the Statistical Queries (SQ) framework, which studies performance limits of algorithms based on noisy interaction with the data. However, it is known that the formal connection between the SQ framework and SGD is tenuous: Existing results typically rely on adversarial or specially-structured gradient noise that does not reflect the noise in standard SGD, and (as we point out here) can sometimes lead to incorrect predictions. Moreover, many analyses of SGD for challenging problems rely on non-trivial algorithmic modifications, such as restricting the SGD trajectory to the sphere or using very small learning rates. To address these shortcomings, we develop a new, non-SQ framework to study the limitations of standard vanilla SGD, for single-index and multi-index models (namely, when the target function depends on a low-dimensional projection of the inputs). Our results apply to a broad class of settings and architectures, including (potentially deep) neural networks.
[LG-39] FedRandom: Sampling Consistent and Accurate Contribution Values in Federated Learning
链接: https://arxiv.org/abs/2602.05693
作者: Arno Geimer,Beltran Fiz Pontiveros,Radu State
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Federated Learning is a privacy-preserving decentralized approach for Machine Learning tasks. In industry deployments characterized by a limited number of entities possessing abundant data, the significance of a participant’s role in shaping the global model becomes pivotal given that participation in a federation incurs costs, and participants may expect compensation for their involvement. Additionally, the contributions of participants serve as a crucial means to identify and address potential malicious actors and free-riders. However, fairly assessing individual contributions remains a significant hurdle. Recent works have demonstrated a considerable inherent instability in contribution estimations across aggregation strategies. While employing a different strategy may offer convergence benefits, this instability can have potentially harming effects on the willingness of participants in engaging in the federation. In this work, we introduce FedRandom, a novel mitigation technique to the contribution instability problem. Tackling the instability as a statistical estimation problem, FedRandom allows us to generate more samples than when using regular FL strategies. We show that these additional samples provide a more consistent and reliable evaluation of participant contributions. We demonstrate our approach using different data distributions across CIFAR-10, MNIST, CIFAR-100 and FMNIST and show that FedRandom reduces the overall distance to the ground truth by more than a third in half of all evaluated scenarios, and improves stability in more than 90% of cases.
[LG-40] Almost Asymptotically Optimal Active Clustering Through Pairwise Observations
链接: https://arxiv.org/abs/2602.05690
作者: Rachel S. Y. Teo,P. N. Karthik,Ramya Korlakai Vinayak,Vincent Y. F. Tan
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 31 pages, 1 figure
Abstract:We propose a new analysis framework for clustering M items into an unknown number of K distinct groups using noisy and actively collected responses. At each time step, an agent is allowed to query pairs of items and observe bandit binary feedback. If the pair of items belongs to the same (resp.\ different) cluster, the observed feedback is 1 with probability p1/2 (resp.\ q1/2 ). Leveraging the ubiquitous change-of-measure technique, we establish a fundamental lower bound on the expected number of queries needed to achieve a desired confidence in the clustering accuracy, formulated as a sup-inf optimization problem. Building on this theoretical foundation, we design an asymptotically optimal algorithm in which the stopping criterion involves an empirical version of the inner infimum – the Generalized Likelihood Ratio (GLR) statistic – being compared to a threshold. We develop a computationally feasible variant of the GLR statistic and show that its performance gap to the lower bound can be accurately empirically estimated and remains within a constant multiple of the lower bound.
[LG-41] Perception-Based Beliefs for POMDPs with Visual Observations AAMAS2026
链接: https://arxiv.org/abs/2602.05679
作者: Miriam Schäfers,Merlijn Krale,Thiago D. Simão,Nils Jansen,Maximilian Weininger
类目: Machine Learning (cs.LG)
*备注: Accepted at AAMAS 2026
Abstract:Partially observable Markov decision processes (POMDPs) are a principled planning model for sequential decision-making under uncertainty. Yet, real-world problems with high-dimensional observations, such as camera images, remain intractable for traditional belief- and filtering-based solvers. To tackle this problem, we introduce the Perception-based Beliefs for POMDPs framework (PBP), which complements such solvers with a perception model. This model takes the form of an image classifier which maps visual observations to probability distributions over states. PBP incorporates these distributions directly into belief updates, so the underlying solver does not need to reason explicitly over high-dimensional observation spaces. We show that the belief update of PBP coincides with the standard belief update if the image classifier is exact. Moreover, to handle classifier imprecision, we incorporate uncertainty quantification and introduce two methods to adjust the belief update accordingly. We implement PBP using two traditional POMDP solvers and empirically show that (1) it outperforms existing end-to-end deep RL methods and (2) uncertainty quantification improves robustness of PBP against visual corruption.
[LG-42] Accelerating Benchmarking of Functional Connectivity Modeling via Structure-aware Core-set Selection ICLR
链接: https://arxiv.org/abs/2602.05667
作者: Ling Zhan,Zhen Li,Junjie Huang,Tao Jia
类目: Machine Learning (cs.LG)
*备注: 33 pages, 8 figures, ICLR conference paper
Abstract:Benchmarking the hundreds of functional connectivity (FC) modeling methods on large-scale fMRI datasets is critical for reproducible neuroscience. However, the combinatorial explosion of model-data pairings makes exhaustive evaluation computationally prohibitive, preventing such assessments from becoming a routine pre-analysis step. To break this bottleneck, we reframe the challenge of FC benchmarking by selecting a small, representative core-set whose sole purpose is to preserve the relative performance ranking of FC operators. We formalize this as a ranking-preserving subset selection problem and propose Structure-aware Contrastive Learning for Core-set Selection (SCLCS), a self-supervised framework to select these core-sets. SCLCS first uses an adaptive Transformer to learn each sample’s unique FC structure. It then introduces a novel Structural Perturbation Score (SPS) to quantify the stability of these learned structures during training, identifying samples that represent foundational connectivity archetypes. Finally, while SCLCS identifies stable samples via a top-k ranking, we further introduce a density-balanced sampling strategy as a necessary correction to promote diversity, ensuring the final core-set is both structurally robust and distributionally representative. On the large-scale REST-meta-MDD dataset, SCLCS preserves the ground-truth model ranking with just 10% of the data, outperforming state-of-the-art (SOTA) core-set selection methods by up to 23.2% in ranking consistency (nDCG@k). To our knowledge, this is the first work to formalize core-set selection for FC operator benchmarking, thereby making large-scale operators comparisons a feasible and integral part of computational neuroscience. Code is publicly available on this https URL
[LG-43] ght Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization
链接: https://arxiv.org/abs/2602.05657
作者: Aleksandar Armacki,Dragana Bajović,Dušan Jakovetić,Soummya Kar,Ali H. Sayed
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 32 pages
Abstract:The study of tail behaviour of SGD-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of SGD-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) SGD, for non-convex costs and bounded noise, with long-term decay at rate e^-t/\log(t) . Next, we relax the noise assumption by considering clipped SGD (c-SGD) under heavy-tailed noise with bounded moment of order p \in (1,2] , showing an upper bound with long-term decay at rate e^-t^\beta_p/\log(t) , where \beta_p = \frac4(p-1)3p-2 for p \in (1,2) and e^-t/\log^2(t) for p = 2 . Finally, we provide lower bounds on the tail decay, at rate e^-t , showing that our rates for both SGD and c-SGD are tight, up to poly-logarithmic factors. Notably, our results demonstrate an order of magnitude faster long-term tail decay compared to existing work based on finite-time bounds, which show rates e^-\sqrtt and e^-t^\beta_p/2 , p \in (1,2] , for SGD and c-SGD, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.
[LG-44] End-to-End Compression for Tabular Foundation Models
链接: https://arxiv.org/abs/2602.05649
作者: Guri Zabërgja,Rafiq Kamel,Arlind Kadra,Christian M. M. Frey,Josif Grabocka
类目: Machine Learning (cs.LG)
*备注:
Abstract:The long-standing dominance of gradient-boosted decision trees for tabular data has recently been challenged by in-context learning tabular foundation models. In-context learning methods fit and predict in one forward pass without parameter updates by leveraging the training data as context for predicting on query test points. While recent tabular foundation models achieve state-of-the-art performance, their transformer architecture based on the attention mechanism has quadratic complexity regarding dataset size, which in turn increases the overhead on training and inference time, and limits the capacity of the models to handle large-scale datasets. In this work, we propose TACO, an end-to-end tabular compression model that compresses the training dataset in a latent space. We test our method on the TabArena benchmark, where our proposed method is up to 94x faster in inference time, while consuming up to 97% less memory compared to the state-of-the-art tabular transformer architecture, all while retaining performance without significant degradation. Lastly, our method not only scales better with increased dataset sizes, but it also achieves better performance compared to other baselines.
[LG-45] Empowering Time Series Analysis with Large-Scale Multimodal Pretraining
链接: https://arxiv.org/abs/2602.05646
作者: Peng Chen,Siyuan Wang,Shiyan Hu,Xingjian Wu,Yang Shu,Zhongwen Rao,Meng Wang,Yijie Li,Bin Yang,Chenjuan Guo
类目: Machine Learning (cs.LG)
*备注:
Abstract:While existing time series foundation models primarily rely on large-scale unimodal pretraining, they lack complementary modalities to enhance time series understanding. Building multimodal foundation models is a natural next step, but it faces key challenges: 1) lack of a unified multimodal pretraining paradigm and large-scale multimodal corpora for time series analysis; 2) how to effectively integrate heterogeneous modalities and enhance model generalization. To address these challenges, we take an early step toward multimodal foundation models for time series analysis. We first propose a multimodal pretraining paradigm that leverages time series with endogenous modalities (derived images and text) and exogenous knowledge (real-world news), providing a comprehensive multi-view perspective for time series analysis. To support this, we develop an automated data construction pipeline to curate MM-TS, the first large-scale multimodal time series dataset spanning six domains, with up to one billion points. Then we propose HORAI, a frequency-enhanced multimodal foundation model. It integrates two core components: the Frequency-enhanced Cross-Modality Encoder and the Time-Frequency Decoder, designed to effectively fuse multimodal features and enhance model generalization across modalities and domains. After pretraining on MM-TS, HORAI achieves state-of-the-art zero-shot performance on time series forecasting and anomaly detection tasks, demonstrating strong generalization.
[LG-46] UAV Trajectory Optimization via Improved Noisy Deep Q-Network
链接: https://arxiv.org/abs/2602.05644
作者: Zhang Hengyu,Maryam Cheraghy,Liu Wei,Armin Farhadi,Meysam Soltanpour,Zhong Zhuoqing
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes an Improved Noisy Deep Q-Network (Noisy DQN) to enhance the exploration and stability of Unmanned Aerial Vehicle (UAV) when applying deep reinforcement learning in simulated environments. This method enhances the exploration ability by combining the residual NoisyLinear layer with an adaptive noise scheduling mechanism, while improving training stability through smooth loss and soft target network updates. Experiments show that the proposed model achieves faster convergence and up to +40 higher rewards compared to standard DQN and quickly reach to the minimum number of steps required for the task 28 in the 15 * 15 grid navigation environment set up. The results show that our comprehensive improvements to the network structure of NoisyNet, exploration control, and training stability contribute to enhancing the efficiency and reliability of deep Q-learning.
[LG-47] Joint Embedding Variational Bayes
链接: https://arxiv.org/abs/2602.05639
作者: Amin Oji,Paul Fieguth
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce Variational Joint Embedding (VJE), a framework that synthesizes joint embedding and variational inference to enable self-supervised learning of probabilistic representations in a reconstruction-free, non-contrastive setting. Compared to energy-based predictive objectives that optimize pointwise discrepancies, VJE maximizes a symmetric conditional evidence lower bound (ELBO) for a latent-variable model defined directly on encoder embeddings. We instantiate the conditional likelihood with a heavy-tailed Student- t model using a polar decomposition that explicitly decouples directional and radial factors to prevent norm-induced instabilities during training. VJE employs an amortized inference network to parameterize a diagonal Gaussian variational posterior whose feature-wise variances are shared with the likelihood scale to capture anisotropic uncertainty without auxiliary projection heads. Across ImageNet-1K, CIFAR-10/100, and STL-10, VJE achieves performance comparable to standard non-contrastive baselines under linear and k-NN evaluation. We further validate these probabilistic semantics through one-class CIFAR-10 anomaly detection, where likelihood-based scoring under the proposed model outperforms comparable self-supervised baselines.
[LG-48] Structural Disentanglement in Bilinear MLPs via Architectural Inductive Bias
链接: https://arxiv.org/abs/2602.05635
作者: Ojasva Nema,Kaustubh Sharma,Aditya Chauhan,Parikshit Pareek
类目: Machine Learning (cs.LG)
*备注:
Abstract:Selective unlearning and long-horizon extrapolation remain fragile in modern neural networks, even when tasks have underlying algebraic structure. In this work, we argue that these failures arise not solely from optimization or unlearning algorithms, but from how models structure their internal representations during training. We explore if having explicit multiplicative interactions as an architectural inductive bias helps in structural disentanglement, through Bilinear MLPs. We show analytically that bilinear parameterizations possess a `non-mixing’ property under gradient flow conditions, where functional components separate into orthogonal subspace representations. This provides a mathematical foundation for surgical model modification. We validate this hypothesis through a series of controlled experiments spanning modular arithmetic, cyclic reasoning, Lie group dynamics, and targeted unlearning benchmarks. Unlike pointwise nonlinear networks, multiplicative architectures are able to recover true operators aligned with the underlying algebraic structure. Our results suggest that model editability and generalization are constrained by representational structure, and that architectural inductive bias plays a central role in enabling reliable unlearning.
[LG-49] On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature
链接: https://arxiv.org/abs/2602.05600
作者: Yikuan Zhang,Ning Yang,Yuhai Tu
类目: Machine Learning (cs.LG)
*备注: 8 pages, 15 figures
Abstract:Stochastic Gradient Descent (SGD) introduces anisotropic noise that is correlated with the local curvature of the loss landscape, thereby biasing optimization toward flat minima. Prior work often assumes an equivalence between the Fisher Information Matrix and the Hessian for negative log-likelihood losses, leading to the claim that the SGD noise covariance \mathbfC is proportional to the Hessian \mathbfH . We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks. Using the recently discovered Activity–Weight Duality, we find a more general relationship agnostic to the specific loss formulation, showing that \mathbfC \propto \mathbbE_p[\mathbfh_p^2] , where \mathbfh_p denotes the per-sample Hessian with \mathbfH = \mathbbE_p[\mathbfh_p] . As a consequence, \mathbfC and \mathbfH commute approximately rather than coincide exactly, and their diagonal elements follow an approximate power-law relation C_ii \propto H_ii^\gamma with a theoretically bounded exponent 1 \leq \gamma \leq 2 , determined by per-sample Hessian spectra. Experiments across datasets, architectures, and loss functions validate these bounds, providing a unified characterization of the noise-curvature relationship in deep learning.
[LG-50] OpenMAG: A Comprehensive Benchmark for Multimodal-Attributed Graph
链接: https://arxiv.org/abs/2602.05576
作者: Chenxi Wan,Xunkai Li,Yilong Zuo,Haokun Deng,Sihan Li,Bowen Fan,Hongchao Qin,Ronghua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multimodal-Attributed Graph (MAG) learning has achieved remarkable success in modeling complex real-world systems by integrating graph topology with rich attributes from multiple modalities. With the rapid proliferation of novel MAG models capable of handling intricate cross-modal semantics and structural dependencies, establishing a rigorous and unified evaluation standard has become imperative. Although existing benchmarks have facilitated initial progress, they exhibit critical limitations in domain coverage, encoder flexibility, model diversity, and task scope, presenting significant challenges to fair evaluation. To bridge this gap, we present OpenMAG, a comprehensive benchmark that integrates 19 datasets across 6 domains and incorporates 16 encoders to support both static and trainable feature encoding. OpenMAG further implements a standardized library of 24 state-of-the-art models and supports 8 downstream tasks, enabling fair comparisons within a unified framework. Through systematic assessment of necessity, data quality, effectiveness, robustness, and efficiency, we derive 14 fundamental insights into MAG learning to guide future advancements. Our code is available at this https URL.
[LG-51] EdgeMask-DG*: Learning Domain-Invariant Graph Structures via Adversarial Edge Masking
链接: https://arxiv.org/abs/2602.05571
作者: Rishabh Bhattacharya,Naresh Manwani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Structural shifts pose a significant challenge for graph neural networks, as graph topology acts as a covariate that can vary across domains. Existing domain generalization methods rely on fixed structural augmentations or training on globally perturbed graphs, mechanisms that do not pinpoint which specific edges encode domain-invariant information. We argue that domain-invariant structural information is not rigidly tied to a single topology but resides in the consensus across multiple graph structures derived from topology and feature similarity. To capture this, we first propose EdgeMask-DG, a novel min-max algorithm where an edge masker learns to find worst-case continuous masks subject to a sparsity constraint, compelling a task GNN to perform effectively under these adversarial structural perturbations. Building upon this, we introduce EdgeMask-DG*, an extension that applies this adversarial masking principle to an enriched graph. This enriched graph combines the original topology with feature-derived edges, allowing the model to discover invariances even when the original topology is noisy or domain-specific. EdgeMask-DG* is the first to systematically combine adaptive adversarial topology search with feature-enriched graphs. We provide a formal justification for our approach from a robust optimization perspective. We demonstrate that EdgeMask-DG* achieves new state-of-the-art performance on diverse graph domain generalization benchmarks, including citation networks, social networks, and temporal graphs. Notably, on the Cora OOD benchmark, EdgeMask-DG* lifts the worst-case domain accuracy to 78.0%, a +3.8 pp improvement over the prior state of the art (74.2%). The source code for our experiments can be found here: this https URL
[LG-52] MAGPrompt: Message-Adaptive Graph Prompt Tuning for Graph Neural Networks
链接: https://arxiv.org/abs/2602.05567
作者: Long D. Nguyen,Binh P. Nguyen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pre-trained graph neural networks (GNNs) transfer well, but adapting them to downstream tasks remains challenging due to mismatches between pre-training objectives and task requirements. Graph prompt tuning offers a parameter-efficient alternative to fine-tuning, yet most methods only modify inputs or representations and leave message passing unchanged, limiting their ability to adapt neighborhood interactions. We propose message-adaptive graph prompt tuning, which injects learnable prompts into the message passing step to reweight incoming neighbor messages and add task-specific prompt vectors during message aggregation, while keeping the backbone GNN frozen. The approach is compatible with common GNN backbones and pre-training strategies, and applicable across downstream settings. Experiments on diverse node- and graph-level datasets show consistent gains over prior graph prompting methods in few-shot settings, while achieving performance competitive with fine-tuning in full-shot regimes.
[LG-53] Logical Guidance for the Exact Composition of Diffusion Models
链接: https://arxiv.org/abs/2602.05549
作者: Francesco Alesiani,Jonathan Warrell,Tanja Bien,Henrik Christiansen,Matheus Ferraz,Mathias Niepert
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose LOGDIFF (Logical Guidance for the Exact Composition of Diffusion Models), a guidance framework for diffusion models that enables principled constrained generation with complex logical expressions at inference time. We study when exact score-based guidance for complex logical formulas can be obtained from guidance signals associated with atomic properties. First, we derive an exact Boolean calculus that provides a sufficient condition for exact logical guidance. Specifically, if a formula admits a circuit representation in which conjunctions combine conditionally independent subformulas and disjunctions combine subformulas that are either conditionally independent or mutually exclusive, exact logical guidance is achievable. In this case, the guidance signal can be computed exactly from atomic scores and posterior probabilities using an efficient recursive algorithm. Moreover, we show that, for commonly encountered classes of distributions, any desired Boolean formula is compilable into such a circuit representation. Second, by combining atomic guidance scores with posterior probability estimates, we introduce a hybrid guidance approach that bridges classifierguidance and classifier-free guidance, applicable to both compositional logical guidance and standard conditional generation. We demonstrate the effectiveness of our framework on multiple image and protein structure generation tasks. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.05549 [cs.LG] (or arXiv:2602.05549v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.05549 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Francesco Alesiani [view email] [v1] Thu, 5 Feb 2026 11:10:06 UTC (30,957 KB) Full-text links: Access Paper: View a PDF of the paper titled Logical Guidance for the Exact Composition of Diffusion Models, by Francesco Alesiani and Jonathan Warrell and Tanja Bien and Henrik Christiansen and Matheus Ferraz and Mathias NiepertView PDFTeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-02 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-54] Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification ICLR2026
链接: https://arxiv.org/abs/2602.05535
作者: Tao Huang,Rui Wang,Xiaofei Liu,Yi Qin,Li Duan,Liping Jing
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2026. Code is available at this https URL
Abstract:Large vision-language models (LVLMs) have shown substantial advances in multimodal understanding and generation. However, when presented with incompetent or adversarial inputs, they frequently produce unreliable or even harmful content, such as fact hallucinations or dangerous instructions. This misalignment with human expectations, referred to as \emphmisbehaviors of LVLMs, raises serious concerns for deployment in critical applications. These misbehaviors are found to stem from epistemic uncertainty, specifically either conflicting internal knowledge or the absence of supporting information. However, existing uncertainty quantification methods, which typically capture only overall epistemic uncertainty, have shown limited effectiveness in identifying such issues. To address this gap, we propose Evidential Uncertainty Quantification (EUQ), a fine-grained method that captures both information conflict and ignorance for effective detection of LVLM misbehaviors. In particular, we interpret features from the model output head as either supporting (positive) or opposing (negative) evidence. Leveraging Evidence Theory, we model and aggregate this evidence to quantify internal conflict and knowledge gaps within a single forward pass. We extensively evaluate our method across four categories of misbehavior, including hallucinations, jailbreaks, adversarial vulnerabilities, and out-of-distribution (OOD) failures, using state-of-the-art LVLMs, and find that EUQ consistently outperforms strong baselines, showing that hallucinations correspond to high internal conflict and OOD failures to high ignorance. Furthermore, layer-wise evidential uncertainty dynamics analysis helps interpret the evolution of internal representations from a new perspective. The source code is available at this https URL.
[LG-55] When Are RL Hyperparameters Benign? A Study in Offline Goal-Conditioned RL
链接: https://arxiv.org/abs/2602.05459
作者: Jan Malte Töpperwien,Aditya Mohan,Marius Lindauer
类目: Machine Learning (cs.LG)
*备注: 27 pages, 19 figures
Abstract:Hyperparameter sensitivity in Deep Reinforcement Learning (RL) is often accepted as unavoidable. However, it remains unclear whether it is intrinsic to the RL problem or exacerbated by specific training mechanisms. We investigate this question in offline goal-conditioned RL, where data distributions are fixed, and non-stationarity can be explicitly controlled via scheduled shifts in data quality. Additionally, we study varying data qualities under both stationary and non-stationary regimes, and cover two representative algorithms: HIQL (bootstrapped TD-learning) and QRL (quasimetric representation learning). Overall, we observe substantially greater robustness to changes in hyperparameter configurations than commonly reported for online RL, even under controlled non-stationarity. Once modest expert data is present ( \approx 20%), QRL maintains broad, stable near-optimal regions, while HIQL exhibits sharp optima that drift significantly across training phases. To explain this divergence, we introduce an inter-goal gradient alignment diagnostic. We find that bootstrapped objectives exhibit stronger destructive gradient interference, which coincides directly with hyperparameter sensitivity. These results suggest that high sensitivity to changes in hyperparameter configurations during training is not inevitable in RL, but is amplified by the dynamics of bootstrapping, offering a pathway toward more robust algorithmic objective design.
[LG-56] BLITZRANK: Principled Zero-shot Ranking Agents with Tournament Graphs
链接: https://arxiv.org/abs/2602.05448
作者: Sheshansh Agrawal,Thien Hang Nguyen,Douwe Kiela
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models have emerged as powerful zero-shot rerankers for retrieval-augmented generation, offering strong generalization without task-specific training. However, existing LLM reranking methods either rely on heuristics that fail to fully exploit the information revealed by each ranking decision or are inefficient when they do. We introduce a tournament graph framework that provides a principled foundation for k -wise reranking. Our key observation is that each k -document comparison reveals a complete tournament of \binomk2 pairwise preferences. These tournaments are aggregated into a global preference graph, whose transitive closure yields many additional orderings without further model invocations. We formalize when a candidate’s rank is certifiably determined and design a query schedule that greedily maximizes information gain towards identifying the top- m items. Our framework also gracefully handles non-transitive preferences - cycles induced by LLM judgments - by collapsing them into equivalence classes that yield principled tiered rankings. Empirically, across 14 benchmarks and 5 LLMs, our method achieves Pareto dominance over existing methods: matching or exceeding accuracy while requiring 25-40% fewer tokens than comparable approaches, and 7 \times fewer than pairwise methods at near-identical quality.
[LG-57] Robust Federated Learning via Byzantine Filtering over Encrypted Updates
链接: https://arxiv.org/abs/2602.05410
作者: Adda Akram Bendoukha,Aymen Boudguiga,Nesrine Kaaniche,Renaud Sirdey,Didem Demirag,Sébastien Gambs
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated Learning (FL) aims to train a collaborative model while preserving data privacy. However, the distributed nature of this approach still raises privacy and security issues, such as the exposure of sensitive data due to inference attacks and the influence of Byzantine behaviors on the trained model. In particular, achieving both secure aggregation and Byzantine resilience remains challenging, as existing solutions often address these aspects independently. In this work, we propose to address these challenges through a novel approach that combines homomorphic encryption for privacy-preserving aggregation with property-inference-inspired meta-classifiers for Byzantine filtering. First, following the property-inference attacks blueprint, we train a set of filtering meta-classifiers on labeled shadow updates, reproducing a diverse ensemble of Byzantine misbehaviors in FL, including backdoor, gradient-inversion, label-flipping and shuffling attacks. The outputs of these meta-classifiers are then used to cancel the Byzantine encrypted updates by reweighting. Second, we propose an automated method for selecting the optimal kernel and the dimensionality hyperparameters with respect to homomorphic inference, aggregation constraints and efficiency over the CKKS cryptosystem. Finally, we demonstrate through extensive experiments the effectiveness of our approach against Byzantine participants on the FEMNIST, CIFAR10, GTSRB, and acsincome benchmarks. More precisely, our SVM filtering achieves accuracies between 90 % and 94 % for identifying Byzantine updates at the cost of marginal losses in model utility and encrypted inference runtimes ranging from 6 to 24 seconds and from 9 to 26 seconds for an overall aggregation.
[LG-58] A Decomposition-based State Space Model for Multivariate Time-Series Forecasting ICASSP2026
链接: https://arxiv.org/abs/2602.05389
作者: Shunya Nagashima,Shuntaro Suzuki,Shuitsu Koyama,Shinnosuke Hirano
类目: Machine Learning (cs.LG)
*备注: ICASSP2026
Abstract:Multivariate time series (MTS) forecasting is crucial for decision-making in domains such as weather, energy, and finance. It remains challenging because real-world sequences intertwine slow trends, multi-rate seasonalities, and irregular residuals. Existing methods often rely on rigid, hand-crafted decompositions or generic end-to-end architectures that entangle components and underuse structure shared across variables. To address these limitations, we propose DecompSSM, an end-to-end decomposition framework using three parallel deep state space model branches to capture trend, seasonal, and residual components. The model features adaptive temporal scales via an input-dependent predictor, a refinement module for shared cross-variable context, and an auxiliary loss that enforces reconstruction and orthogonality. Across standard benchmarks (ECL, Weather, ETTm2, and PEMS04), DecompSSM outperformed strong baselines, indicating the effectiveness of combining component-wise deep state space models and global context refinement.
[LG-59] Hinge Regression Tree: A Newton Method for Oblique Regression Tree Splitting
链接: https://arxiv.org/abs/2602.05371
作者: Hongyi Li,Han Lin,Jun Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Oblique decision trees combine the transparency of trees with the power of multivariate decision boundaries, but learning high-quality oblique splits is NP-hard, and practical methods still rely on slow search or theory-free heuristics. We present the Hinge Regression Tree (HRT), which reframes each split as a non-linear least-squares problem over two linear predictors whose max/min envelope induces ReLU-like expressive power. The resulting alternating fitting procedure is exactly equivalent to a damped Newton (Gauss-Newton) method within fixed partitions. We analyze this node-level optimization and, for a backtracking line-search variant, prove that the local objective decreases monotonically and converges; in practice, both fixed and adaptive damping yield fast, stable convergence and can be combined with optional ridge regularization. We further prove that HRT’s model class is a universal approximator with an explicit O(\delta^2) approximation rate, and show on synthetic and real-world benchmarks that it matches or outperforms single-tree baselines with more compact structures.
[LG-60] Bayesian Neighborhood Adaptation for Graph Neural Networks
链接: https://arxiv.org/abs/2602.05358
作者: Paribesh Regmi,Rui Li,Kishan K C
类目: Machine Learning (cs.LG)
*备注: Published in Transactions on Machine Learning Research (TMLR), 07/2025
Abstract:The neighborhood scope (i.e., number of hops) where graph neural networks (GNNs) aggregate information to characterize a node’s statistical property is critical to GNNs’ performance. Two-stage approaches, training and validating GNNs for every pre-specified neighborhood scope to search for the best setting, is a time-consuming task and tends to be biased due to the search space design. How to adaptively determine proper neighborhood scopes for the aggregation process for both homophilic and heterophilic graphs remains largely unexplored. We thus propose to model the GNNs’ message-passing behavior on a graph as a stochastic process by treating the number of hops as a beta process. This Bayesian framework allows us to infer the most plausible neighborhood scope for message aggregation simultaneously with the optimization of GNN parameters. Our theoretical analysis shows that the scope inference improves the expressivity of a GNN. Experiments on benchmark homophilic and heterophilic datasets show that the proposed method is compatible with state-of-the-art GNN variants, achieving competitive or superior performance on the node classification task, and providing well-calibrated predictions.
[LG-61] Smoothness Errors in Dynamics Models and How to Avoid Them ICML2026
链接: https://arxiv.org/abs/2602.05352
作者: Edward Berman,Luisa Li,Jung Yeon Park,Robin Walters
类目: Machine Learning (cs.LG); Symplectic Geometry (math.SG)
*备注: Ecstatic to share relaxed unitary mesh convolutions with the community :D! Work is under review at ICML 2026. First two authors contributed equally
Abstract:Modern neural networks have shown promise for solving partial differential equations over surfaces, often by discretizing the surface as a mesh and learning with a mesh-aware graph neural network. However, graph neural networks suffer from oversmoothing, where a node’s features become increasingly similar to those of its neighbors. Unitary graph convolutions, which are mathematically constrained to preserve smoothness, have been proposed to address this issue. Despite this, in many physical systems, such as diffusion processes, smoothness naturally increases and unitarity may be overconstraining. In this paper, we systematically study the smoothing effects of different GNNs for dynamics modeling and prove that unitary convolutions hurt performance for such tasks. We propose relaxed unitary convolutions that balance smoothness preservation with the natural smoothing required for physical systems. We also generalize unitary and relaxed unitary convolutions from graphs to meshes. In experiments on PDEs such as the heat and wave equations over complex meshes and on weather forecasting, we find that our method outperforms several strong baselines, including mesh-aware transformers and equivariant neural networks.
[LG-62] Pool-based Active Learning as Noisy Lossy Compression: Characterizing Label Complexity via Finite Blocklength Analysis
链接: https://arxiv.org/abs/2602.05333
作者: Kosuke Sugiyama,Masato Uchida
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 21 pages, 1 figure
Abstract:This paper proposes an information-theoretic framework for analyzing the theoretical limits of pool-based active learning (AL), in which a subset of instances is selectively labeled. The proposed framework reformulates pool-based AL as a noisy lossy compression problem by mapping pool observations to noisy symbol observations, data selection to compression, and learning to decoding. This correspondence enables a unified information-theoretic analysis of data selection and learning in pool-based AL. Applying finite blocklength analysis of noisy lossy compression, we derive information-theoretic lower bounds on label complexity and generalization error that serve as theoretical limits for a given learning algorithm under its associated optimal data selection strategy. Specifically, our bounds include terms that reflect overfitting induced by the learning algorithm and the discrepancy between its inductive bias and the target task, and are closely related to established information-theoretic bounds and stability theory, which have not been previously applied to the analysis of pool-based AL. These properties yield a new theoretical perspective on pool-based AL.
[LG-63] Accelerated Sequential Flow Matching: A Bayesian Filtering Perspective
链接: https://arxiv.org/abs/2602.05319
作者: Yinan Huang,Hans Hao-Hsun Hsu,Junran Wang,Bo Dai,Pan Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Sequential prediction from streaming observations is a fundamental problem in stochastic dynamical systems, where inherent uncertainty often leads to multiple plausible futures. While diffusion and flow-matching models are capable of modeling complex, multi-modal trajectories, their deployment in real-time streaming environments typically relies on repeated sampling from a non-informative initial distribution, incurring substantial inference latency and potential system backlogs. In this work, we introduce Sequential Flow Matching, a principled framework grounded in Bayesian filtering. By treating streaming inference as learning a probability flow that transports the predictive distribution from one time step to the next, our approach naturally aligns with the recursive structure of Bayesian belief updates. We provide theoretical justification that initializing generation from the previous posterior offers a principled warm start that can accelerate sampling compared to naïve re-sampling. Across a wide range of forecasting, decision-making and state estimation tasks, our method achieves performance competitive with full-step diffusion while requiring only one or very few sampling steps, therefore with faster sampling. It suggests that framing sequential inference via Bayesian filtering provides a new and principled perspective towards efficient real-time deployment of flow-based models.
[LG-64] A Short and Unified Convergence Analysis of the SAG SAGA and IAG Algorithms
链接: https://arxiv.org/abs/2602.05304
作者: Feng Zhu,Robert W. Heath Jr.,Aritra Mitra
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:
Abstract:Stochastic variance-reduced algorithms such as Stochastic Average Gradient (SAG) and SAGA, and their deterministic counterparts like the Incremental Aggregated Gradient (IAG) method, have been extensively studied in large-scale machine learning. Despite their popularity, existing analyses for these algorithms are disparate, relying on different proof techniques tailored to each method. Furthermore, the original proof of SAG is known to be notoriously involved, requiring computer-aided analysis. Focusing on finite-sum optimization with smooth and strongly convex objective functions, our main contribution is to develop a single unified convergence analysis that applies to all three algorithms: SAG, SAGA, and IAG. Our analysis features two key steps: (i) establishing a bound on delays due to stochastic sub-sampling using simple concentration tools, and (ii) carefully designing a novel Lyapunov function that accounts for such delays. The resulting proof is short and modular, providing the first high-probability bounds for SAG and SAGA that can be seamlessly extended to non-convex objectives and Markov sampling. As an immediate byproduct of our new analysis technique, we obtain the best known rates for the IAG algorithm, significantly improving upon prior bounds.
[LG-65] Robust Inference-Time Steering of Protein Diffusion Models via Embedding Optimization
链接: https://arxiv.org/abs/2602.05285
作者: Minhuan Li,Jiequn Han,Pilar Cossio,Luhuan Wu
类目: Machine Learning (cs.LG)
*备注:
Abstract:In many biophysical inverse problems, the goal is to generate biomolecular conformations that are both physically plausible and consistent with experimental measurements. As recent sequence-to-structure diffusion models provide powerful data-driven priors, posterior sampling has emerged as a popular framework by guiding atomic coordinates to target conformations using experimental likelihoods. However, when the target lies in a low-density region of the prior, posterior sampling requires aggressive and brittle weighting of the likelihood guidance. Motivated by this limitation, we propose EmbedOpt, an alternative inference-time approach for steering diffusion models to optimize experimental likelihoods in the conditional embedding space. As this space encodes rich sequence and coevolutionary signals, optimizing over it effectively shifts the diffusion prior to align with experimental constraints. We validate EmbedOpt on two benchmarks simulating cryo-electron microscopy map fitting and experimental distance constraints. We show that EmbedOpt outperforms the coordinate-based posterior sampling method in map fitting tasks, matches performance on distance constraint tasks, and exhibits superior engineering robustness across hyperparameters spanning two orders of magnitude. Moreover, its smooth optimization behavior enables a significant reduction in the number of diffusion steps required for inference, leading to better efficiency.
[LG-66] ADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training
链接: https://arxiv.org/abs/2602.05251
作者: Guanjie Cheng,Boyi Li,Lingyu Sun,Mengying Zhu,Yangyang Wu,Xinkui Zhao,Shuiguang Deng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large-scale multimodal pre-trained models like CLIP rely heavily on high-quality training data, yet raw web-crawled datasets are often noisy, misaligned, and redundant, leading to inefficient training and suboptimal generalization. Existing data selection methods are either heuristic-based, suffering from bias and limited diversity, or data-driven but task-agnostic, failing to optimize for multi-task scenarios. To address these gaps, we introduce TADS (Task-Aware Data Selection), a novel framework for multi-task multimodal pre-training that integrates Intrinsic Quality, Task Relevance, and Distributional Diversity into a learnable value function. TADS employs a comprehensive quality assessment system with unimodal and cross-modal operators, quantifies task relevance via interpretable similarity vectors, and optimizes diversity through cluster-based weighting. A feedback-driven meta-learning mechanism adaptively refines the selection strategy based on proxy model performance across multiple downstream tasks. Experiments on CC12M demonstrate that TADS achieves superior zero-shot performance on benchmarks like ImageNet, CIFAR-100, MS-COCO, and Flickr30K, using only 36% of the data while outperforming baselines by an average of 1.0%. This highlights that TADS significantly enhances data efficiency by curating a high-utility subset that yields a much higher performance ceiling within the same computational constraints.
[LG-67] CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers
链接: https://arxiv.org/abs/2602.05243
作者: Boxiang Zhang,Baijian Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Vision Transformers achieve strong accuracy but incur high compute and memory cost. Structured pruning can reduce inference cost, but most methods rely on retraining or multi-stage optimization. These requirements limit post-training deployment. We propose \textbfCORP, a closed-form one-shot structured pruning framework for Vision Transformers. CORP removes entire MLP hidden dimensions and attention substructures without labels, gradients, or fine-tuning. It operates under strict post-training constraints using only a small unlabeled calibration set. CORP formulates structured pruning as a representation recovery problem. It models removed activations and attention logits as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights. This minimizes expected representation error under the calibration distribution. Experiments on ImageNet with DeiT models show strong redundancy in MLP and attention representations. Without compensation, one-shot structured pruning causes severe accuracy degradation. With CORP, models preserve accuracy under aggressive sparsity. On DeiT-Huge, CORP retains 82.8% Top-1 accuracy after pruning 50% of both MLP and attention structures. CORP completes pruning in under 20 minutes on a single GPU and delivers substantial real-world efficiency gains.
[LG-68] Private Prediction via Shrinkage
链接: https://arxiv.org/abs/2602.05219
作者: Chao Yan
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study differentially private prediction introduced by Dwork and Feldman (COLT 2018): an algorithm receives one labeled sample set S and then answers a stream of unlabeled queries while the output transcript remains (\varepsilon,\delta) -differentially private with respect to S . Standard composition yields a \sqrtT dependence for T queries. We show that this dependence can be reduced to polylogarithmic in T in streaming settings. For an oblivious online adversary and any concept class \mathcalC , we give a private predictor that answers T queries with |S|= \tildeO(VC(\mathcalC)^3.5\log^3.5T) labeled examples. For an adaptive online adversary and halfspaces over \mathbbR^d , we obtain |S|=\tildeO\left(d^5.5\log T\right) . Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.05219 [cs.LG] (or arXiv:2602.05219v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.05219 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-69] Disentangled Representation Learning via Flow Matching
链接: https://arxiv.org/abs/2602.05214
作者: Jinjin Chi,Taoping Liu,Mengtao Yin,Ximing Li,Yongcheng Jing,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Disentangled representation learning aims to capture the underlying explanatory factors of observed data, enabling a principled understanding of the data-generating process. Recent advances in generative modeling have introduced new paradigms for learning such representations. However, existing diffusion-based methods encourage factor independence via inductive biases, yet frequently lack strong semantic alignment. In this work, we propose a flow matching-based framework for disentangled representation learning, which casts disentanglement as learning factor-conditioned flows in a compact latent space. To enforce explicit semantic alignment, we introduce a non-overlap (orthogonality) regularizer that suppresses cross-factor interference and reduces information leakage between factors. Extensive experiments across multiple datasets demonstrate consistent improvements over representative baselines, yielding higher disentanglement scores as well as improved controllability and sample fidelity.
[LG-70] SpectraKAN: Conditioning Spectral Operators
链接: https://arxiv.org/abs/2602.05187
作者: Chun-Wun Cheng,Carola-Bibiane Schönlieb,Angelica I. Aviles-Rivero
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Spectral neural operators, particularly Fourier Neural Operators (FNO), are a powerful framework for learning solution operators of partial differential equations (PDEs) due to their efficient global mixing in the frequency domain. However, existing spectral operators rely on static Fourier kernels applied uniformly across inputs, limiting their ability to capture multi-scale, regime-dependent, and anisotropic dynamics governed by the global state of the system. We introduce SpectraKAN, a neural operator that conditions the spectral operator on the input itself, turning static spectral convolution into an input-conditioned integral operator. This is achieved by extracting a compact global representation from spatio-temporal history and using it to modulate a multi-scale Fourier trunk via single-query cross-attention, enabling the operator to adapt its behaviour while retaining the efficiency of spectral mixing. We provide theoretical justification showing that this modulation converges to a resolution-independent continuous operator under mesh refinement and KAN gives smooth, Lipschitz-controlled global modulation. Across diverse PDE benchmarks, SpectraKAN achieves state-of-the-art performance, reducing RMSE by up to 49% over strong baselines, with particularly large gains on challenging spatio-temporal prediction tasks.
[LG-71] Fairness Under Group-Conditional Prior Probability Shift: Invariance Drift and Target-Aware Post-Processing
链接: https://arxiv.org/abs/2602.05144
作者: Amir Asiaee,Kaveh Aryan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine learning systems are often trained and evaluated for fairness on historical data, yet deployed in environments where conditions have shifted. A particularly common form of shift occurs when the prevalence of positive outcomes changes differently across demographic groups–for example, when disease rates rise faster in one population than another, or when economic conditions affect loan default rates unequally. We study group-conditional prior probability shift (GPPS), where the label prevalence P(Y=1\mid A=a) may change between training and deployment while the feature-generation process P(X\mid Y,A) remains stable. Our analysis yields three main contributions. First, we prove a fundamental dichotomy: fairness criteria based on error rates (equalized odds) are structurally invariant under GPPS, while acceptance-rate criteria (demographic parity) can drift–and we prove this drift is unavoidable for non-trivial classifiers (shift-robust impossibility). Second, we show that target-domain risk and fairness metrics are identifiable without target labels: the invariance of ROC quantities under GPPS enables consistent estimation from source labels and unlabeled target data alone, with finite-sample guarantees. Third, we propose TAP-GPPS, a label-free post-processing algorithm that estimates prevalences from unlabeled data, corrects posteriors, and selects thresholds to satisfy demographic parity in the target domain. Experiments validate our theoretical predictions and demonstrate that TAP-GPPS achieves target fairness with minimal utility loss.
[LG-72] Adaptive Exploration for Latent-State Bandits
链接: https://arxiv.org/abs/2602.05139
作者: Jikai Jin,Kenneth Hung,Sanath Kumar Krishnamurthy,Baoyi Shi,Congshan Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The multi-armed bandit problem is a core framework for sequential decision-making under uncertainty, but classical algorithms often fail in environments with hidden, time-varying states that confound reward estimation and optimal action selection. We address key challenges arising from unobserved confounders, such as biased reward estimates and limited state information, by introducing a family of state-model-free bandit algorithms that leverage lagged contextual features and coordinated probing strategies. These implicitly track latent states and disambiguate state-dependent reward patterns. Our methods and their adaptive variants can learn optimal policies without explicit state modeling, combining computational efficiency with robust adaptation to non-stationary rewards. Empirical results across diverse settings demonstrate superior performance over classical approaches, and we provide practical recommendations for algorithm selection in real-world applications.
[LG-73] Decoupled Orthogonal Dynamics: Regularization for Deep Network Optimizers
链接: https://arxiv.org/abs/2602.05136
作者: Hao Chen,Jinghui Yuan,Hanmin Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Is the standard weight decay in AdamW truly optimal? Although AdamW decouples weight decay from adaptive gradient scaling, a fundamental conflict remains: the Radial Tug-of-War. In deep learning, gradients tend to increase parameter norms to expand effective capacity while steering directions to learn features, whereas weight decay indiscriminately suppresses norm growth. This push–pull interaction induces radial oscillations, injecting noise into Adam’s second-moment estimates and potentially degrading delicate tangential feature learning. We argue that magnitude and direction play distinct roles and should be decoupled in optimizer dynamics. We propose Orthogonal Dynamics Decoupling and instantiate it as AdamO: an SGD-style update handles the one-dimensional norm control, while Adam’s adaptive preconditioning is confined to the tangential subspace. AdamO further incorporates curvature-adaptive radial step sizing and architecture-aware rules and projections for scale-invariant layers and low-dimensional parameters. Experiments on vision and language tasks show that AdamO improves generalization and stability over AdamW without introducing additional complex constraints.
[LG-74] SemPipes – Optimizable Semantic Data Operators for Tabular Machine Learning Pipelines
链接: https://arxiv.org/abs/2602.05134
作者: Olga Ovcharenko,Matthias Boehm,Sebastian Schelter
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:
Abstract:Real-world machine learning on tabular data relies on complex data preparation pipelines for prediction, data integration, augmentation, and debugging. Designing these pipelines requires substantial domain expertise and engineering effort, motivating the question of how large language models (LLMs) can support tabular ML through code synthesis. We introduce SemPipes, a novel declarative programming model that integrates LLM-powered semantic data operators into tabular ML pipelines. Semantic operators specify data transformations in natural language while delegating execution to a runtime system. During training, SemPipes synthesizes custom operator implementations based on data characteristics, operator instructions, and pipeline context. This design enables the automatic optimization of data operations in a pipeline via LLM-based code synthesis guided by evolutionary search. We evaluate SemPipes across diverse tabular ML tasks and show that semantic operators substantially improve end-to-end predictive performance for both expert-designed and agent-generated pipelines, while reducing pipeline complexity. We implement SemPipes in Python and release it at this https URL.
[LG-75] Certifiable Boolean Reason ing Is Universal COLT2026
链接: https://arxiv.org/abs/2602.05120
作者: Wenhao Li,Anastasis Kratsios,Hrad Ghoukasian,Dennis Zvigelsky
类目: Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: Submitted to COLT 2026
Abstract:The proliferation of agentic systems has thrust the reasoning capabilities of AI into the forefront of contemporary machine learning. While it is known that there \emphexist neural networks which can reason through any Boolean task f:\0,1^B\to\0,1\ , in the sense that they emulate Boolean circuits with fan-in 2 and fan-out 1 gates, trained models have been repeatedly demonstrated to fall short of these theoretical ideals. This raises the question: \textitCan one exhibit a deep learning model which \textbfcertifiably always reasons and can \textbfuniversally reason through any Boolean task? Moreover, such a model should ideally require few parameters to solve simple Boolean tasks. We answer this question affirmatively by exhibiting a deep learning architecture which parameterizes distributions over Boolean circuits with the guarantee that, for every parameter configuration, a sample is almost surely a valid Boolean circuit (and hence admits an intrinsic circuit-level certificate). We then prove a universality theorem: for any Boolean f:\0,1^B\to\0,1\ , there exists a parameter configuration under which the sampled circuit computes f with arbitrarily high probability. When f is an \mathcalO(\log B) -junta, the required number of parameters scales linearly with the input dimension B . Empirically, on a controlled truth-table completion benchmark aligned with our setting, the proposed architecture trains reliably and achieves high exact-match accuracy while preserving the predicted structure: every internal unit is Boolean-valued on \0,1^B . Matched MLP baselines reach comparable accuracy, but only about 10% of hidden units admit a Boolean representation; i.e.\ are two-valued over the Boolean cube. Comments: Submitted to COLT 2026 Subjects: Computational Complexity (cs.CC); Machine Learning (cs.LG) Cite as: arXiv:2602.05120 [cs.CC] (or arXiv:2602.05120v1 [cs.CC] for this version) https://doi.org/10.48550/arXiv.2602.05120 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-76] Unbiased Single-Queried Gradient for Combinatorial Objective
链接: https://arxiv.org/abs/2602.05119
作者: Thanawat Sornwanee
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:In a probabilistic reformulation of a combinatorial problem, we often face an optimization over a hypercube, which corresponds to the Bernoulli probability parameter for each binary variable in the primal problem. The combinatorial nature suggests that an exact gradient computation requires multiple queries. We propose a stochastic gradient that is unbiased and requires only a single query of the combinatorial function. This method encompasses a well-established REINFORCE (through an importance sampling), as well as including a class of new stochastic gradients.
[LG-77] Beware Untrusted Simulators – Reward-Free Backdoor Attacks in Reinforcement Learning ICLR2026
链接: https://arxiv.org/abs/2602.05089
作者: Ethan Rathbun,Wo Wei Lin,Alina Oprea,Christopher Amato
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 10 pages main body, ICLR 2026
Abstract:Simulated environments are a key piece in the success of Reinforcement Learning (RL), allowing practitioners and researchers to train decision making agents without running expensive experiments on real hardware. Simulators remain a security blind spot, however, enabling adversarial developers to alter the dynamics of their released simulators for malicious purposes. Therefore, in this work we highlight a novel threat, demonstrating how simulator dynamics can be exploited to stealthily implant action-level backdoors into RL agents. The backdoor then allows an adversary to reliably activate targeted actions in an agent upon observing a predefined trigger'', leading to potentially dangerous consequences. Traditional backdoor attacks are limited in their strong threat models, assuming the adversary has near full control over an agent's training pipeline, enabling them to both alter and observe agent's rewards. As these assumptions are infeasible to implement within a simulator, we propose a new attack Daze’’ which is able to reliably and stealthily implant backdoors into RL agents trained for real world tasks without altering or even observing their rewards. We provide formal proof of Daze’s effectiveness in guaranteeing attack success across general RL tasks along with extensive empirical evaluations on both discrete and continuous action space domains. We additionally provide the first example of RL backdoor attacks transferring to real, robotic hardware. These developments motivate further research into securing all components of the RL training pipeline to prevent malicious attacks.
[LG-78] Reinforcement Learning Enhancement Using Vector Semantic Representation and Symbolic Reason ing for Human-Centered Autonomous Emergency Braking
链接: https://arxiv.org/abs/2602.05079
作者: Vinal Asodia,Iman Sharifi,Saber Fallah
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 12 pages, 7 figures, 5 tables
Abstract:The problem with existing camera-based Deep Reinforcement Learning approaches is twofold: they rarely integrate high-level scene context into the feature representation, and they rely on rigid, fixed reward functions. To address these challenges, this paper proposes a novel pipeline that produces a neuro-symbolic feature representation that encompasses semantic, spatial, and shape information, as well as spatially boosted features of dynamic entities in the scene, with an emphasis on safety-critical road users. It also proposes a Soft First-Order Logic (SFOL) reward function that balances human values via a symbolic reasoning module. Here, semantic and spatial predicates are extracted from segmentation maps and applied to linguistic rules to obtain reward weights. Quantitative experiments in the CARLA simulation environment show that the proposed neuro-symbolic representation and SFOL reward function improved policy robustness and safety-related performance metrics compared to baseline representations and reward formulations across varying traffic densities and occlusion levels. The findings demonstrate that integrating holistic representations and soft reasoning into Reinforcement Learning can support more context-aware and value-aligned decision-making for autonomous driving.
[LG-79] Does SGD Seek Flatness or Sharpness? An Exactly Solvable Model
链接: https://arxiv.org/abs/2602.05065
作者: Yizhou Xu,Pierfrancesco Beneventano,Isaac Chuang,Liu Ziyin
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:A large body of theory and empirical work hypothesizes a connection between the flatness of a neural network’s loss landscape during training and its performance. However, there have been conceptually opposite pieces of evidence regarding when SGD prefers flatter or sharper solutions during training. In this work, we partially but causally clarify the flatness-seeking behavior of SGD by identifying and exactly solving an analytically solvable model that exhibits both flattening and sharpening behavior during training. In this model, the SGD training has no \textita priori preference for flatness, but only a preference for minimal gradient fluctuations. This leads to the insight that, at least within this model, it is data distribution that uniquely determines the sharpness at convergence, and that a flat minimum is preferred if and only if the noise in the labels is isotropic across all output dimensions. When the noise in the labels is anisotropic, the model instead prefers sharpness and can converge to an arbitrarily sharp solution, depending on the imbalance in the noise in the labels spectrum. We reproduce this key insight in controlled settings with different model architectures such as MLP, RNN, and transformers.
[LG-80] Scaling Laws for Embedding Dimension in Information Retrieval
链接: https://arxiv.org/abs/2602.05062
作者: Julian Killingback,Mahta Rafiee,Madine Manas,Hamed Zamani
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 9 Pages, 7 figures
Abstract:Dense retrieval, which encodes queries and documents into a single dense vector, has become the dominant neural retrieval approach due to its simplicity and compatibility with fast approximate nearest neighbor algorithms. As the tasks dense retrieval performs grow in complexity, the fundamental limitations of the underlying data structure and similarity metric – namely vectors and inner-products – become more apparent. Prior recent work has shown theoretical limitations inherent to single vectors and inner-products that are generally tied to the embedding dimension. Given the importance of embedding dimension for retrieval capacity, understanding how dense retrieval performance changes as embedding dimension is scaled is fundamental to building next generation retrieval models that balance effectiveness and efficiency. In this work, we conduct a comprehensive analysis of the relationship between embedding dimension and retrieval performance. Our experiments include two model families and a range of model sizes from each to construct a detailed picture of embedding scaling behavior. We find that the scaling behavior fits a power law, allowing us to derive scaling laws for performance given only embedding dimension, as well as a joint law accounting for embedding dimension and model size. Our analysis shows that for evaluation tasks aligned with the training task, performance continues to improve as embedding size increases, though with diminishing returns. For evaluation data that is less aligned with the training task, we find that performance is less predictable, with performance degrading with larger embedding dimensions for certain tasks. We hope our work provides additional insight into the limitations of embeddings and their behavior as well as offers a practical guide for selecting model and embedding dimension to achieve optimal performance with reduced storage and compute costs.
[LG-81] Quantile-Physics Hybrid Framework for Safe-Speed Recommendation under Diverse Weather Conditions Leverag ing Connected Vehicle and Road Weather Information Systems Data
链接: https://arxiv.org/abs/2602.05053
作者: Wen Zhang,Adel W. Sadek,Chunming Qiao
类目: Machine Learning (cs.LG)
*备注: This work was presented as a poster at the 2026 Transportation Research Board (TRB) Annual Meeting
Abstract:Inclement weather conditions can significantly impact driver visibility and tire-road surface friction, requiring adjusted safe driving speeds to reduce crash risk. This study proposes a hybrid predictive framework that recommends real-time safe speed intervals for freeway travel under diverse weather conditions. Leveraging high-resolution Connected Vehicle (CV) data and Road Weather Information System (RWIS) data collected in Buffalo, NY, from 2022 to 2023, we construct a spatiotemporally aligned dataset containing over 6.6 million records across 73 days. The core model employs Quantile Regression Forests (QRF) to estimate vehicle speed distributions in 10-minute windows, using 26 input features that capture meteorological, pavement, and temporal conditions. To enforce safety constraints, a physics-based upper speed limit is computed for each interval based on real-time road grip and visibility, ensuring that vehicles can safely stop within their sight distance. The final recommended interval fuses QRF-predicted quantiles with both posted speed limits and the physics-derived upper bound. Experimental results demonstrate strong predictive performance: the QRF model achieves a mean absolute error of 1.55 mph, with 96.43% of median speed predictions within 5 mph, a PICP (50%) of 48.55%, and robust generalization across weather types. The model’s ability to respond to changing weather conditions and generalize across road segments shows promise for real-world deployment, thereby improving traffic safety and reducing weather-related crashes.
[LG-82] Learning Solving and Optimizing PDEs with TensorGalerkin: an efficient high-performance Galerkin assembly algorithm
链接: https://arxiv.org/abs/2602.05052
作者: Shizheng Wen,Mingyuan Chi,Tianwei Yu,Ben Moseley,Mike Yan Michelis,Pu Ren,Hao Sun,Siddhartha Mishra
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a unified algorithmic framework for the numerical solution, constrained optimization, and physics-informed learning of PDEs with a variational structure. Our framework is based on a Galerkin discretization of the underlying variational forms, and its high efficiency stems from a novel highly-optimized and GPU-compliant TensorGalerkin framework for linear system assembly (stiffness matrices and load vectors). TensorGalerkin operates by tensorizing element-wise operations within a Python-level Map stage and then performs global reduction with a sparse matrix multiplication that performs message passing on the mesh-induced sparsity graph. It can be seamlessly employed downstream as i) a highly-efficient numerical PDEs solver, ii) an end-to-end differentiable framework for PDE-constrained optimization, and iii) a physics-informed operator learning algorithm for PDEs. With multiple benchmarks, including 2D and 3D elliptic, parabolic, and hyperbolic PDEs on unstructured meshes, we demonstrate that the proposed framework provides significant computational efficiency and accuracy gains over a variety of baselines in all the targeted downstream applications.
[LG-83] Feedback Control for Multi-Objective Graph Self-Supervision
链接: https://arxiv.org/abs/2602.05036
作者: Karish Grover,Theodore Vasiloudis,Han Xie,Sixing Lu,Xiang Song,Christos Faloutsos
类目: Machine Learning (cs.LG)
*备注:
Abstract:Can multi-task self-supervised learning on graphs be coordinated without the usual tug-of-war between objectives? Graph self-supervised learning (SSL) offers a growing toolbox of pretext objectives: mutual information, reconstruction, contrastive learning; yet combining them reliably remains a challenge due to objective interference and training instability. Most multi-pretext pipelines use per-update mixing, forcing every parameter update to be a compromise, leading to three failure modes: Disagreement (conflict-induced negative transfer), Drift (nonstationary objective utility), and Drought (hidden starvation of underserved objectives). We argue that coordination is fundamentally a temporal allocation problem: deciding when each objective receives optimization budget, not merely how to weigh them. We introduce ControlG, a control-theoretic framework that recasts multi-objective graph SSL as feedback-controlled temporal allocation by estimating per-objective difficulty and pairwise antagonism, planning target budgets via a Pareto-aware log-hypervolume planner, and scheduling with a Proportional-Integral-Derivative (PID) controller. Across 9 datasets, ControlG consistently outperforms state-of-the-art baselines, while producing an auditable schedule that reveals which objectives drove learning.
[LG-84] Causal Representation Meets Stochastic Modeling under Generic Geometry
链接: https://arxiv.org/abs/2602.05033
作者: Jiaxu Ren,Yixin Wang,Biwei Huang
类目: Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注: Source codes and data will be available upon the publication of this work
Abstract:Learning meaningful causal representations from observations has emerged as a crucial task for facilitating machine learning applications and driving scientific discoveries in fields such as climate science, biology, and physics. This process involves disentangling high-level latent variables and their causal relationships from low-level observations. Previous work in this area that achieves identifiability typically focuses on cases where the observations are either i.i.d. or follow a latent discrete-time process. Nevertheless, many real-world settings require identifying latent variables that are continuous-time stochastic processes (e.g., multivariate point processes). To this end, we develop identifiable causal representation learning for continuous-time latent stochastic point processes. We study its identifiability by analyzing the geometry of the parameter space. Furthermore, we develop MUTATE, an identifiable variational autoencoder framework with a time-adaptive transition module to infer stochastic dynamics. Across simulated and empirical studies, we find that MUTATE can effectively answer scientific questions, such as the accumulation of mutations in genomics and the mechanisms driving neuron spike triggers in response to time-varying dynamics.
[LG-85] Laplacian Representations for Decision-Time Planning
链接: https://arxiv.org/abs/2602.05031
作者: Dikshant Shehmar,Matthew Schlegel,Matthew E. Taylor,Marlos C. Machado
类目: Machine Learning (cs.LG)
*备注:
Abstract:Planning with a learned model remains a key challenge in model-based reinforcement learning (RL). In decision-time planning, state representations are critical as they must support local cost computation while preserving long-horizon structure. In this paper, we show that the Laplacian representation provides an effective latent space for planning by capturing state-space distances at multiple time scales. This representation preserves meaningful distances and naturally decomposes long-horizon problems into subgoals, also mitigating the compounding errors that arise over long prediction horizons. Building on these properties, we introduce ALPS, a hierarchical planning algorithm, and demonstrate that it outperforms commonly used baselines on a selection of offline goal-conditioned RL tasks from OGBench, a benchmark previously dominated by model-free methods.
[LG-86] A Simple Reduction Scheme for Constrained Contextual Bandits with Adversarial Contexts via Regression
链接: https://arxiv.org/abs/2602.05019
作者: Dhruv Sarkar,Abhishek Sinha
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study constrained contextual bandits (CCB) with adversarially chosen contexts, where each action yields a random reward and incurs a random cost. We adopt the standard realizability assumption: conditioned on the observed context, rewards and costs are drawn independently from fixed distributions whose expectations belong to known function classes. We consider the continuing setting, in which the algorithm operates over the entire horizon even after the budget is exhausted. In this setting, the objective is to simultaneously control regret and cumulative constraint violation. Building on the seminal SquareCB framework of Foster et al. (2018), we propose a simple and modular algorithmic scheme that leverages online regression oracles to reduce the constrained problem to a standard unconstrained contextual bandit problem with adaptively defined surrogate reward functions. In contrast to most prior work on CCB, which focuses on stochastic contexts, our reduction yields improved guarantees for the more general adversarial context setting, together with a compact and transparent analysis.
[LG-87] Private PoEtry: Private In-Context Learning via Product of Experts
链接: https://arxiv.org/abs/2602.05012
作者: Rob Romijnders,Mohammad Mahdi Derakhshani,Jonathan Petit,Max Welling,Christos Louizos,Yuki M. Asano
类目: Machine Learning (cs.LG)
*备注: 8 pages
Abstract:In-context learning (ICL) enables Large Language Models (LLMs) to adapt to new tasks with only a small set of examples at inference time, thereby avoiding task-specific fine-tuning. However, in-context examples may contain privacy-sensitive information that should not be revealed through model outputs. Existing differential privacy (DP) approaches to ICL are either computationally expensive or rely on heuristics with limited effectiveness, including context oversampling, synthetic data generation, or unnecessary thresholding. We reformulate private ICL through the lens of a Product-of-Experts model. This gives a theoretically grounded framework, and the algorithm can be trivially parallelized. We evaluate our method across five datasets in text classification, math, and vision-language. We find that our method improves accuracy by more than 30 percentage points on average compared to prior DP-ICL methods, while maintaining strong privacy guarantees.
[LG-88] Position: Machine Learning for Heart Transplant Allocation Policy Optimization Should Account for Incentives
链接: https://arxiv.org/abs/2602.04990
作者: Ioannis Anagnostides,Itai Zilberstein,Zachary W. Sollie,Arman Kilic,Tuomas Sandholm
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:The allocation of scarce donor organs constitutes one of the most consequential algorithmic challenges in healthcare. While the field is rapidly transitioning from rigid, rule-based systems to machine learning and data-driven optimization, we argue that current approaches often overlook a fundamental barrier: incentives. In this position paper, we highlight that organ allocation is not merely a static optimization problem, but rather a complex game involving transplant centers, clinicians, and regulators. Focusing on US adult heart transplant allocation, we identify critical incentive misalignments across the decision-making pipeline, and present data showing that they are having adverse consequences today. Our main position is that the next generation of allocation policies should be incentive aware. We outline a research agenda for the machine learning community, calling for the integration of mechanism design, strategic classification, causal inference, and social choice to ensure robustness, efficiency, and fairness in the face of strategic behavior from the various constituent groups.
[LG-89] Improving Set Function Approximation with Quasi-Arithmetic Neural Networks ICLR2026
链接: https://arxiv.org/abs/2602.04941
作者: Tomas Tokar,Scott Sanner
类目: Machine Learning (cs.LG)
*备注: To appear at ICLR 2026
Abstract:Sets represent a fundamental abstraction across many types of data. To handle the unordered nature of set-structured data, models such as DeepSets and PointNet rely on fixed, non-learnable pooling operations (e.g., sum or max) – a design choice that can hinder the transferability of learned embeddings and limits model expressivity. More recently, learnable aggregation functions have been proposed as more expressive alternatives. In this work, we advance this line of research by introducing the Neuralized Kolmogorov Mean (NKM) – a novel, trainable framework for learning a generalized measure of central tendency through an invertible neural function. We further propose quasi-arithmetic neural networks (QUANNs), which incorporate the NKM as a learnable aggregation function. We provide a theoretical analysis showing that, QUANNs are universal approximators for a broad class of common set-function decompositions and, thanks to their invertible neural components, learn more structured latent representations. Empirically, QUANNs outperform state-of-the-art baselines across diverse benchmarks, while learning embeddings that transfer effectively even to tasks that do not involve sets.
[LG-90] ransolver-3: Scaling Up Transformer Solvers to Industrial-Scale Geometries
链接: https://arxiv.org/abs/2602.04940
作者: Hang Zhou,Haixu Wu,Haonan Shangguan,Yuezhou Ma,Huikun Weng,Jianmin Wang,Mingsheng Long
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning has emerged as a transformative tool for the neural surrogate modeling of partial differential equations (PDEs), known as neural PDE solvers. However, scaling these solvers to industrial-scale geometries with over 10^8 cells remains a fundamental challenge due to the prohibitive memory complexity of processing high-resolution meshes. We present Transolver-3, a new member of the Transolver family as a highly scalable framework designed for high-fidelity physics simulations. To bridge the gap between limited GPU capacity and the resolution requirements of complex engineering tasks, we introduce two key architectural optimizations: faster slice and deslice by exploiting matrix multiplication associative property and geometry slice tiling to partition the computation of physical states. Combined with an amortized training strategy by learning on random subsets of original high-resolution meshes and a physical state caching technique during inference, Transolver-3 enables high-fidelity field prediction on industrial-scale meshes. Extensive experiments demonstrate that Transolver-3 is capable of handling meshes with over 160 million cells, achieving impressive performance across three challenging simulation benchmarks, including aircraft and automotive design tasks.
[LG-91] Comparing Euclidean and Hyperbolic K-Means for Generalized Category Discovery
链接: https://arxiv.org/abs/2602.04932
作者: Mohamad Dalal,Thomas B. Moeslund,Joakim Bruslund Haurum
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 11 pages, 4 figures. To be published in the VISAPP
Abstract:Hyperbolic representation learning has been widely used to extract implicit hierarchies within data, and recently it has found its way to the open-world classification task of Generalized Category Discovery (GCD). However, prior hyperbolic GCD methods only use hyperbolic geometry for representation learning and transform back to Euclidean geometry when clustering. We hypothesize this is suboptimal. Therefore, we present Hyperbolic Clustered GCD (HC-GCD), which learns embeddings in the Lorentz Hyperboloid model of hyperbolic geometry, and clusters these embeddings directly in hyperbolic space using a hyperbolic K-Means algorithm. We test our model on the Semantic Shift Benchmark datasets, and demonstrate that HC-GCD is on par with the previous state-of-the-art hyperbolic GCD method. Furthermore, we show that using hyperbolic K-Means leads to better accuracy than Euclidean K-Means. We carry out ablation studies showing that clipping the norm of the Euclidean embeddings leads to decreased accuracy in clustering unseen classes, and increased accuracy for seen classes, while the overall accuracy is dataset dependent. We also show that using hyperbolic K-Means leads to more consistent clusters when varying the label granularity.
[LG-92] urboBoA: Faster and Exact Attention-aware Quantization without Backpropagation ICLR2026
链接: https://arxiv.org/abs/2602.04929
作者: Junhan Kim,Yeo Jeong Park,Seungwoo Son,Chungman Lee,Ho-young Kim,Joonyoung Kim,Yongkweon Jeon
类目: Machine Learning (cs.LG)
*备注: ICLR 2026
Abstract:The rapid growth of large language models (LLMs) has heightened the importance of post-training quantization (PTQ) for reducing memory and computation costs. Among PTQ methods, GPTQ has gained significant attention for its efficiency, enabling billion-scale LLMs to be quantized within a few GPU hours. However, GPTQ’s assumption of layer-wise independence leads to severe accuracy drops in low-bit regimes. Recently, BoA improved upon GPTQ by incorporating inter-layer dependencies within attention modules, but its reliance on sequential quantization across all out-channels makes it substantially less efficient. In this paper, we propose TurboBoA, a new backpropagation-free PTQ algorithm that preserves the accuracy benefits of BoA while significantly accelerating the process. The proposed TurboBoA introduces three key innovations: (i) joint quantization of multiple out-channels with a closed-form error compensation rule, which reduces sequential bottlenecks and yields more than a three-fold speedup; (ii) a correction mechanism for errors propagated from preceding quantized layers; and (iii) adaptive grid computation with coordinate descent refinement to maintain alignment during iterative updates. Extensive experiments demonstrate that TurboBoA delivers substantial acceleration over BoA while consistently improving accuracy. When combined with outlier suppression techniques, it achieves state-of-the-art results in both weight-only and weight-activation quantization. The code will be available at this https URL.
[LG-93] Euphonium: Steering Video Flow Matching via Process Reward Gradient Guided Stochastic Dynamics
链接: https://arxiv.org/abs/2602.04928
作者: Ruizhe Zhong,Jiesong Lian,Xiaoyue Mi,Zixiang Zhou,Yuan Zhou,Qinglin Lu,Junchi Yan
类目: Machine Learning (cs.LG)
*备注:
Abstract:While online Reinforcement Learning has emerged as a crucial technique for aligning flow matching models with human preferences, current approaches are hindered by inefficient exploration during training rollouts. Relying on undirected stochasticity and sparse outcome rewards, these methods struggle to discover high-reward samples, resulting in data-inefficient and slow optimization. To address these limitations, we propose Euphonium, a novel framework that steers generation via process reward gradient guided dynamics. Our key insight is to formulate the sampling process as a theoretically principled Stochastic Differential Equation that explicitly incorporates the gradient of a Process Reward Model into the flow drift. This design enables dense, step-by-step steering toward high-reward regions, advancing beyond the unguided exploration in prior works, and theoretically encompasses existing sampling methods (e.g., Flow-GRPO, DanceGRPO) as special cases. We further derive a distillation objective that internalizes the guidance signal into the flow network, eliminating inference-time dependency on the reward model. We instantiate this framework with a Dual-Reward Group Relative Policy Optimization algorithm, combining latent process rewards for efficient credit assignment with pixel-level outcome rewards for final visual fidelity. Experiments on text-to-video generation show that Euphonium achieves better alignment compared to existing methods while accelerating training convergence by 1.66x.
[LG-94] Knowing When to Answer: Adaptive Confidence Refinement for Reliable Audio-Visual Question Answering
链接: https://arxiv.org/abs/2602.04924
作者: Dinh Phu Tran,Jihoon Jeong,Saad Wazir,Seongah Kim,Thao Do,Cem Subakan,Daeyoung Kim
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注: Technical Report
Abstract:We present a formal problem formulation for \textitReliable Audio-Visual Question Answering ( \mathcalR -AVQA), where we prefer abstention over answering incorrectly. While recent AVQA models have high accuracy, their ability to identify when they are likely wrong and their consequent abstention from answering remain underexplored areas of research. To fill this gap, we explore several approaches and then propose Adaptive Confidence Refinement (ACR), a lightweight method to further enhance the performance of \mathcalR -AVQA. Our key insight is that the Maximum Softmax Probability (MSP) is Bayes-optimal only under strong calibration, a condition usually not met in deep neural networks, particularly in multimodal models. Instead of replacing MSP, our ACR maintains it as a primary confidence signal and applies input-adaptive residual corrections when MSP is deemed unreliable. ACR introduces two learned heads: i) a Residual Risk Head that predicts low-magnitude correctness residuals that MSP does not capture, and ii) a Confidence Gating Head to determine MSP trustworthiness. Our experiments and theoretical analysis show that ACR consistently outperforms existing methods on in- and out-of-disrtibution, and data bias settings across three different AVQA architectures, establishing a solid foundation for \mathcalR -AVQA task. The code and checkpoints will be available upon acceptance \hrefthis https URLat here
[LG-95] Imposing Boundary Conditions on Neural Operators via Learned Function Extensions
链接: https://arxiv.org/abs/2602.04923
作者: Sepehr Mousavi,Siddhartha Mishra,Laura De Lorenzis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural operators have emerged as powerful surrogates for the solution of partial differential equations (PDEs), yet their ability to handle general, highly variable boundary conditions (BCs) remains limited. Existing approaches often fail when the solution operator exhibits strong sensitivity to boundary forcings. We propose a general framework for conditioning neural operators on complex non-homogeneous BCs through function extensions. Our key idea is to map boundary data to latent pseudo-extensions defined over the entire spatial domain, enabling any standard operator learning architecture to consume boundary information. The resulting operator, coupled with an arbitrary domain-to-domain neural operator, can learn rich dependencies on complex BCs and input domain functions at the same time. To benchmark this setting, we construct 18 challenging datasets spanning Poisson, linear elasticity, and hyperelasticity problems, with highly variable, mixed-type, component-wise, and multi-segment BCs on diverse geometries. Our approach achieves state-of-the-art accuracy, outperforming baselines by large margins, while requiring no hyperparameter tuning across datasets. Overall, our results demonstrate that learning boundary-to-domain extensions is an effective and practical strategy for imposing complex BCs in existing neural operator frameworks, enabling accurate and robust scientific machine learning models for a broader range of PDE-governed problems.
[LG-96] CyIN: Cyclic Informative Latent Space for Bridging Complete and Incomplete Multimodal Learning NEURIPS2025
链接: https://arxiv.org/abs/2602.04920
作者: Ronghao Lin,Qiaolin He,Sijie Mai,Ying Zeng,Aolin Xiong,Li Huang,Yap-Peng Tan,Haifeng Hu
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted by NeurIPS 2025
Abstract:Multimodal machine learning, mimicking the human brain’s ability to integrate various modalities has seen rapid growth. Most previous multimodal models are trained on perfectly paired multimodal input to reach optimal performance. In real-world deployments, however, the presence of modality is highly variable and unpredictable, causing the pre-trained models in suffering significant performance drops and fail to remain robust with dynamic missing modalities circumstances. In this paper, we present a novel Cyclic INformative Learning framework (CyIN) to bridge the gap between complete and incomplete multimodal learning. Specifically, we firstly build an informative latent space by adopting token- and label-level Information Bottleneck (IB) cyclically among various modalities. Capturing task-related features with variational approximation, the informative bottleneck latents are purified for more efficient cross-modal interaction and multimodal fusion. Moreover, to supplement the missing information caused by incomplete multimodal input, we propose cross-modal cyclic translation by reconstruct the missing modalities with the remained ones through forward and reverse propagation process. With the help of the extracted and reconstructed informative latents, CyIN succeeds in jointly optimizing complete and incomplete multimodal learning in one unified model. Extensive experiments on 4 multimodal datasets demonstrate the superior performance of our method in both complete and diverse incomplete scenarios.
[LG-97] Gradually Compacting Large Language Models for Reason ing Like a Boiling Frog
链接: https://arxiv.org/abs/2602.04919
作者: Yiran Zhao,Shengyang Zhou,Zijian Wu,Tongyan Hu,Yuhui Xu,Rengan Dou,Kenji Kawaguchi,Shafiq Joty,Junnan Li,Michael Qizhe Shieh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, but their substantial size often demands significant computational resources. To reduce resource consumption and accelerate inference, it is essential to eliminate redundant parameters without compromising performance. However, conventional pruning methods that directly remove such parameters often lead to a dramatic drop in model performance in reasoning tasks, and require extensive post-training to recover the lost capabilities. In this work, we propose a gradual compacting method that divides the compression process into multiple fine-grained iterations, applying a Prune-Tune Loop (PTL) at each stage to incrementally reduce model size while restoring performance with finetuning. This iterative approach-reminiscent of the “boiling frog” effect-enables the model to be progressively compressed without abrupt performance loss. Experimental results show that PTL can compress LLMs to nearly half their original size with only lightweight post-training, while maintaining performance comparable to the original model on reasoning tasks. Moreover, PTL is flexible and can be applied to various pruning strategies, such as neuron pruning and layer pruning, as well as different post-training methods, including continual pre-training and reinforcement learning. Additionally, experimental results confirm the effectiveness of PTL on a variety of tasks beyond mathematical reasoning, such as code generation, demonstrating its broad applicability.
[LG-98] Multi-Aspect Mining and Anomaly Detection for Heterogeneous Tensor Streams WWW’26
链接: https://arxiv.org/abs/2602.04917
作者: Soshi Kakio,Yasuko Matsubara,Ren Fujiwara,Yasushi Sakurai
类目: Machine Learning (cs.LG)
*备注: Proceedings of the ACM Web Conference 2026 (WWW '26), April 13–17, 2026, Dubai, United Arab Emirates, 12 pages
Abstract:Analysis and anomaly detection in event tensor streams consisting of timestamps and multiple attributes - such as communication logs(time, IP address, packet length)- are essential tasks in data mining. While existing tensor decomposition and anomaly detection methods provide useful insights, they face the following two limitations. (i) They cannot handle heterogeneous tensor streams, which comprises both categorical attributes(e.g., IP address) and continuous attributes(e.g., packet length). They typically require either discretizing continuous attributes or treating categorical attributes as continuous, both of which distort the underlying statistical properties of the this http URL, incorrect assumptions about the distribution family of continuous attributes often degrade the model’s performance. (ii) They discretize timestamps, failing to track the temporal dynamics of streams(e.g., trends, abnormal events), which makes them ineffective for detecting anomalies at the group level, referred to as ‘group anomalies’ (e.g, DoS attacks). To address these challenges, we propose HeteroComp, a method for continuously summarizing heterogeneous tensor streams into ‘components’ representing latent groups in each attribute and their temporal dynamics, and detecting group anomalies. Our method employs Gaussian process priors to model unknown distributions of continuous attributes, and temporal dynamics, which directly estimate probability densities from data. Extracted components give concise but effective summarization, enabling accurate group anomaly detection. Extensive experiments on real datasets demonstrate that HeteroComp outperforms the state-of-the-art algorithms for group anomaly detection accuracy, and its computational time does not depend on the data stream length.
[LG-99] Learning Where It Matters: Geometric Anchoring for Robust Preference Alignment
链接: https://arxiv.org/abs/2602.04909
作者: Youngjae Cho,Jongsuk Kim,Ji-Hoon Kim
类目: Machine Learning (cs.LG)
*备注: Under Review
Abstract:Direct Preference Optimization (DPO) and related methods align large language models from pairwise preferences by regularizing updates against a fixed reference policy. As the policy drifts, a static reference, however, can become increasingly miscalibrated, leading to distributional mismatch and amplifying spurious preference signals under noisy supervision. Conversely, reference-free variants avoid mismatch but often suffer from unconstrained reward drift. We propose Geometric Anchor Preference Optimization (GAPO), which replaces the fixed reference with a dynamic, geometry-aware anchor: an adversarial local perturbation of the current policy within a small radius that serves as a pessimistic baseline. This anchor enables an adaptive reweighting mechanism, modulating the importance of each preference pair based on its local sensitivity. We further introduce the Anchor Gap, the reward discrepancy between the policy and its anchor, and show under smoothness conditions that it approximates worst-case local margin degradation. Optimizing a logistic objective weighted by this gap downweights geometrically brittle instances while emphasizing robust preference signals. Across diverse noise settings, GAPO consistently improves robustness while matching or improving performance on standard LLM alignment and reasoning benchmarks.
[LG-100] LISA: Laplacian In-context Spectral Analysis
链接: https://arxiv.org/abs/2602.04906
作者: Julio Candanedo
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose Laplacian In-context Spectral Analysis (LISA), a method for inference-time adaptation of Laplacian-based time-series models using only an observed prefix. LISA combines delay-coordinate embeddings and Laplacian spectral learning to produce diffusion-coordinate state representations, together with a frozen nonlinear decoder for one-step prediction. We introduce lightweight latent-space residual adapters based on either Gaussian-process regression or an attention-like Markov operator over context windows. Across forecasting and autoregressive rollout experiments, LISA improves over the frozen baseline and is often most beneficial under changing dynamics. This work links in-context adaptation to nonparametric spectral methods for dynamical systems.
[LG-101] Mind the Performance Gap: Capability-Behavior Trade-offs in Feature Steering
链接: https://arxiv.org/abs/2602.04903
作者: Eitan Sprejer,Oscar Agustín Stanchi,María Victoria Carro,Denise Alejandra Mester,Iván Arcuschin
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures
Abstract:Feature steering has emerged as a promising approach for controlling LLM behavior through direct manipulation of internal representations, offering advantages over prompt engineering. However, its practical effectiveness in real-world applications remains poorly understood, particularly regarding potential trade-offs with output quality. We show that feature steering methods substantially degrade model performance even when successfully controlling target behaviors, a critical trade-off. Specifically, we evaluate Goodfire’s Auto Steer against prompt engineering baselines across 14 steering queries (covering innocuous and safety-relevant behaviors) on 171 Massive Multitask Language Understanding (MMLU) questions using Llama-8B and Llama-70B, measuring accuracy, coherence, and behavioral control. Our findings show that Auto Steer successfully modifies target behaviors (achieving scores of 3.33 vs. 2.98 for prompting on Llama-8B and 3.57 vs. 3.10 on Llama-70B), but causes dramatic performance degradation: accuracy on the MMLU questions drops from 66% to 46% on Llama-8B and 87% to 73% on Llama-70B, with coherence falling from 4.62 to 2.24 and 4.94 to 3.89 respectively. Simple prompting achieves the best overall balance. These findings highlight limitations of current feature steering methods for practical deployment where task performance cannot be sacrificed. More broadly, our work demonstrates that mechanistic control methods face fundamental capability-behavior trade-offs that must be empirically characterized before deployment.
[LG-102] Privacy Amplification Persists under Unlimited Synthetic Data Release
链接: https://arxiv.org/abs/2602.04895
作者: Clément Pierquin,Aurélien Bellet,Marc Tommasi,Matthieu Boussard
类目: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study privacy amplification by synthetic data release, a phenomenon in which differential privacy guarantees are improved by releasing only synthetic data rather than the private generative model itself. Recent work by Pierquin et al. (2025) established the first formal amplification guarantees for a linear generator, but they apply only in asymptotic regimes where the model dimension far exceeds the number of released synthetic records, limiting their practical relevance. In this work, we show a surprising result: under a bounded-parameter assumption, privacy amplification persists even when releasing an unbounded number of synthetic records, thereby improving upon the bounds of Pierquin et al. (2025). Our analysis provides structural insights that may guide the development of tighter privacy guarantees for more complex release mechanisms.
[LG-103] Diffusion Models Generalization Can Be Characterized by Inductive Biases toward a Data-Dependent Ridge Manifold
链接: https://arxiv.org/abs/2602.06021
作者: Ye He,Yitong Qiu,Molei Tao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR)
*备注:
Abstract:When a diffusion model is not memorizing the training data set, how does it generalize exactly? A quantitative understanding of the distribution it generates would be beneficial to, for example, an assessment of the model’s performance for downstream applications. We thus explicitly characterize what diffusion model generates, by proposing a log-density ridge manifold and quantifying how the generated data relate to this manifold as inference dynamics progresses. More precisely, inference undergoes a reach-align-slide process centered around the ridge manifold: trajectories first reach a neighborhood of the manifold, then align as being pushed toward or away from the manifold in normal directions, and finally slide along the manifold in tangent directions. Within the scope of this general behavior, different training errors will lead to different normal and tangent motions, which can be quantified, and these detailed motions characterize when inter-mode generations emerge. More detailed understanding of training dynamics will lead to more accurate quantification of the generation inductive bias, and an example of random feature model will be considered, for which we can explicitly illustrate how diffusion model’s inductive biases originate as a composition of architectural bias and training accuracy, and how they evolve with the inference dynamics. Experiments on synthetic multimodal distributions and MNIST latent diffusion support the predicted directional effects, in both low- and high-dimensions.
[LG-104] Causal Inference on Stopped Random Walks in Online Advertising
链接: https://arxiv.org/abs/2602.05997
作者: Jia Yuan Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:We consider a causal inference problem frequently encountered in online advertising systems, where a publisher (e.g., Instagram, TikTok) interacts repeatedly with human users and advertisers by sporadically displaying to each user an advertisement selected through an auction. Each treatment corresponds to a parameter value of the advertising mechanism (e.g., auction reserve-price), and we want to estimate through experiments the corresponding long-term treatment effect (e.g., annual advertising revenue). In our setting, the treatment affects not only the instantaneous revenue from showing an ad, but also changes each user’s interaction-trajectory, and each advertiser’s bidding policy – as the latter is constrained by a finite budget. In particular, each a treatment may even affect the size of the population, since users interact longer with a tolerable advertising mechanism. We drop the classical i.i.d. assumption and model the experiment measurements (e.g., advertising revenue) as a stopped random walk, and use a budget-splitting experimental design, the Anscombe Theorem, a Wald-like equation, and a Central Limit Theorem to construct confidence intervals for the long-term treatment effect.
[LG-105] ransformers Are Born Biased: Structural Inductive Biases at Random Initialization and Their Practical Consequences
链接: https://arxiv.org/abs/2602.05927
作者: Siquan Li,Yao Tong,Haonan Wang,Tianyang Hu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Transformers underpin modern large language models (LLMs) and are commonly assumed to be behaviorally unstructured at random initialization, with all meaningful preferences emerging only through large-scale training. We challenge this assumption by showing that randomly initialized transformers already exhibit strong and systematic structural biases. In particular, untrained models display extreme token preferences: across random input sequences, certain tokens are predicted with probabilities orders of magnitude larger. We provide a mechanistic explanation for this phenomenon by dissecting the transformer architecture at initialization. We show that extreme token preference arises from a contraction of token representations along a random seed-dependent direction. This contraction is driven by two interacting forces: (i) asymmetric nonlinear activations in MLP sublayers induce global (inter-sequence) representation concentration, and (ii) self-attention further amplifies this effect through local (intra-sequence) aggregation. Together, these mechanisms align hidden representations along a direction determined solely by the random initialization, producing highly non-uniform next-token predictions. Beyond mechanistic insight, we demonstrate that these initialization-induced biases persist throughout training, forming a stable and intrinsic model identity. Leveraging this property, we introduce SeedPrint, a fingerprinting method that can reliably distinguish models that differ only in their random initialization, even after extensive training and under substantial distribution shift. Finally, we identify a fundamental positional discrepancy inherent to the attention mechanism’s intra-sequence contraction that is causally linked to the attention-sink phenomenon. This discovery provides a principled explanation for the emergence of sinks and offers a pathway for their control. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2602.05927 [stat.ML] (or arXiv:2602.05927v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.05927 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-106] Universal approximation with signatures of non-geometric rough paths
链接: https://arxiv.org/abs/2602.05898
作者: Mihriban Ceylan,Anna P. Kwossek,David J. Prömel
类目: Probability (math.PR); Machine Learning (cs.LG); Mathematical Finance (q-fin.MF)
*备注:
Abstract:We establish a universal approximation theorem for signatures of rough paths that are not necessarily weakly geometric. By extending the path with time and its rough path bracket terms, we prove that linear functionals of the signature of the resulting rough paths approximate continuous functionals on rough path spaces uniformly on compact sets. Moreover, we construct the signature of a path extended by its pathwise quadratic variation terms based on general pathwise stochastic integration à la Föllmer, in particular, allowing for pathwise Itô, Stratonovich, and backward Itô integration. In a probabilistic setting, we obtain a universal approximation result for linear functionals of the signature of continuous semimartingales extended by the quadratic variation terms, defined via stochastic Itô integration. Numerical examples illustrate the use of signatures when the path is extended by time and quadratic variation in the context of model calibration and option pricing in mathematical finance.
[LG-107] Wedge Sampling: Efficient Tensor Completion with Nearly-Linear Sample Complexity
链接: https://arxiv.org/abs/2602.05869
作者: Hengrui Luo,Anna Ma,Ludovic Stephan,Yizhe Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Statistics Theory (math.ST)
*备注: 58 pages, 3 figures
Abstract:We introduce Wedge Sampling, a new non-adaptive sampling scheme for low-rank tensor completion. We study recovery of an order- k low-rank tensor of dimension n \times \cdots \times n from a subset of its entries. Unlike the standard uniform entry model (i.e., i.i.d. samples from [n]^k ), wedge sampling allocates observations to structured length-two patterns (wedges) in an associated bipartite sampling graph. By directly promoting these length-two connections, the sampling design strengthens the spectral signal that underlies efficient initialization, in regimes where uniform sampling is too sparse to generate enough informative correlations. Our main result shows that this change in sampling paradigm enables polynomial-time algorithms to achieve both weak and exact recovery with nearly linear sample complexity in n . The approach is also plug-and-play: wedge-sampling-based spectral initialization can be combined with existing refinement procedures (e.g., spectral or gradient-based methods) using only an additional \tildeO(n) uniformly sampled entries, substantially improving over the \tildeO(n^k/2) sample complexity typically required under uniform entry sampling for efficient methods. Overall, our results suggest that the statistical-to-computational gap highlighted in Barak and Moitra (2022) is, to a large extent, a consequence of the uniform entry sampling model for tensor completion, and that alternative non-adaptive measurement designs that guarantee a strong initialization can overcome this barrier. Comments: 58 pages, 3 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Statistics Theory (math.ST) Cite as: arXiv:2602.05869 [stat.ML] (or arXiv:2602.05869v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.05869 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-108] Distribution-free two-sample testing with blurred total variation distance
链接: https://arxiv.org/abs/2602.05862
作者: Rohan Hore,Rina Foygel Barber
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 47 pages, 4 figures
Abstract:Two-sample testing, where we aim to determine whether two distributions are equal or not equal based on samples from each one, is challenging if we cannot place assumptions on the properties of the two distributions. In particular, certifying equality of distributions, or even providing a tight upper bound on the total variation (TV) distance between the distributions, is impossible to achieve in a distribution-free regime. In this work, we examine the blurred TV distance, a relaxation of TV distance that enables us to perform inference without assumptions on the distributions. We provide theoretical guarantees for distribution-free upper and lower bounds on the blurred TV distance, and examine its properties in high dimensions.
[LG-109] Optimal scaling laws in learning hierarchical multi-index models
链接: https://arxiv.org/abs/2602.05846
作者: Leonardo Defilippis,Florent Krzakala,Bruno Loureiro,Antoine Maillard
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In this work, we provide a sharp theory of scaling laws for two-layer neural networks trained on a class of hierarchical multi-index targets, in a genuinely representation-limited regime. We derive exact information-theoretic scaling laws for subspace recovery and prediction error, revealing how the hierarchical features of the target are sequentially learned through a cascade of phase transitions. We further show that these optimal rates are achieved by a simple, target-agnostic spectral estimator, which can be interpreted as the small learning-rate limit of gradient descent on the first-layer weights. Once an adapted representation is identified, the readout can be learned statistically optimally, using an efficient procedure. As a consequence, we provide a unified and rigorous explanation of scaling laws, plateau phenomena, and spectral structure in shallow neural networks trained on such hierarchical targets.
[LG-110] Non-Stationary Inventory Control with Lead Times
链接: https://arxiv.org/abs/2602.05799
作者: Nele H. Amiri,Sean R. Sinclair,Maximiliano Udenio
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study non-stationary single-item, periodic-review inventory control problems in which the demand distribution is unknown and may change over time. We analyze how demand non-stationarity affects learning performance across inventory models, including systems with demand backlogging or lost-sales, both with and without lead times. For each setting, we propose an adaptive online algorithm that optimizes over the class of base-stock policies and establish performance guarantees in terms of dynamic regret relative to the optimal base-stock policy at each time step. Our results reveal a sharp separation across inventory models. In backlogging systems and lost-sales models with zero lead time, we show that it is possible to adapt to demand changes without incurring additional performance loss in stationary environments, even without prior knowledge of the demand distributions or the number of demand shifts. In contrast, for lost-sales systems with positive lead times, we establish weaker guarantees that reflect fundamental limitations imposed by delayed replenishment in combination with censored feedback. Our algorithms leverage the convexity and one-sided feedback structure of inventory costs to enable counterfactual policy evaluation despite demand censoring. We complement the theoretical analysis with simulation results showing that our methods significantly outperform existing benchmarks.
[LG-111] Learning False Discovery Rate Control via Model-Based Neural Networks ICASSP
链接: https://arxiv.org/abs/2602.05798
作者: Arnau Vilella,Jasin Machkour,Michael Muma,Daniel P. Palomar
类目: Methodology (stat.ME); Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: Accepted to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Abstract:Controlling the false discovery rate (FDR) in high-dimensional variable selection requires balancing rigorous error control with statistical power. Existing methods with provable guarantees are often overly conservative, creating a persistent gap between the realized false discovery proportion (FDP) and the target FDR level. We introduce a learning-augmented enhancement of the T-Rex Selector framework that narrows this gap. Our approach replaces the analytical FDP estimator with a neural network trained solely on diverse synthetic datasets, enabling a substantially tighter and more accurate approximation of the FDP. This refinement allows the procedure to operate much closer to the desired FDR level, thereby increasing discovery power while maintaining effective approximate control. Through extensive simulations and a challenging synthetic genome-wide association study (GWAS), we demonstrate that our method achieves superior detection of true variables compared to existing approaches.
[LG-112] PMT Waveform Simulation and Reconstruction with Conditional Diffusion Network
链接: https://arxiv.org/abs/2602.05767
作者: Kainan Liu,Jingyu Huang,Guihong Huang,Jianyi Luo
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注:
Abstract:Photomultiplier tubes (PMTs) are widely employed in particle and nuclear physics experiments. The accuracy of PMT waveform reconstruction directly impacts the detector’s spatial and energy resolution. A key challenge arises when multiple photons arrive within a few nanoseconds, making it difficult to resolve individual photoelectrons (PEs). Although supervised deep learning methods have surpassed traditional methods in performance, their practical applicability is limited by the lack of ground-truth PE labels in real data. To address this issue, we propose an innovative weakly supervised waveform simulation and reconstruction approach based on a bidirectional conditional diffusion network framework. The method is fully data-driven and requires only raw waveforms and coarse estimates of PE information as input. It first employs a PE-conditioned diffusion model to simulate realistic waveforms from PE sequences, thereby learning the features of overlapping waveforms. Subsequently, these simulated waveforms are used to train a waveform-conditioned diffusion model to reconstruct the PE sequences from waveforms, reinforcing the learning of features of overlapping waveforms. Through iterative refinement between the two conditional diffusion processes, the model progressively improves reconstruction accuracy. Experimental results demonstrate that the proposed method achieves 99% of the normalized PE-number resolution averaged over 1-5 p.e. and 80% of the timing resolution attained by fully supervised learning.
[LG-113] Fast Rates for Nonstationary Weighted Risk Minimization
链接: https://arxiv.org/abs/2602.05742
作者: Tobias Brock,Thomas Nagler
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Weighted empirical risk minimization is a common approach to prediction under distribution drift. This article studies its out-of-sample prediction error under nonstationarity. We provide a general decomposition of the excess risk into a learning term and an error term associated with distribution drift, and prove oracle inequalities for the learning error under mixing conditions. The learning bound holds uniformly over arbitrary weight classes and accounts for the effective sample size induced by the weight vector, the complexity of the weight and hypothesis classes, and potential data dependence. We illustrate the applicability and sharpness of our results in (auto-) regression problems with linear models, basis approximations, and neural networks, recovering minimax-optimal rates (up to logarithmic factors) when specialized to unweighted and stationary settings.
[LG-114] Broken neural scaling laws in materials science
链接: https://arxiv.org/abs/2602.05702
作者: Max Großmann,Malte Grunert,Erich Runge
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:In materials science, data are scarce and expensive to generate, whether computationally or experimentally. Therefore, it is crucial to identify how model performance scales with dataset size and model capacity to distinguish between data- and model-limited regimes. Neural scaling laws provide a framework for quantifying this behavior and guide the design of materials datasets and machine learning architectures. Here, we investigate neural scaling laws for a paradigmatic materials science task: predicting the dielectric function of metals, a high-dimensional response that governs how solids interact with light. Using over 200,000 dielectric functions from high-throughput ab initio calculations, we study two multi-objective graph neural networks trained to predict the frequency-dependent complex interband dielectric function and the Drude frequency. We observe broken neural scaling laws with respect to dataset size, whereas scaling with the number of model parameters follows a simple power law that rapidly saturates.
[LG-115] Efficient Algorithms for Robust Markov Decision Processes with s-Rectangular Ambiguity Sets
链接: https://arxiv.org/abs/2602.05591
作者: Chin Pang Ho,Marek Petrik,Wolfram Wiesemann
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Robust Markov decision processes (MDPs) have attracted significant interest due to their ability to protect MDPs from poor out-of-sample performance in the presence of ambiguity. In contrast to classical MDPs, which account for stochasticity by modeling the dynamics through a stochastic process with a known transition kernel, a robust MDP additionally accounts for ambiguity by optimizing against the most adverse transition kernel from an ambiguity set constructed via historical data. In this paper, we develop a unified solution framework for a broad class of robust MDPs with s -rectangular ambiguity sets, where the most adverse transition probabilities are considered independently for each state. Using our algorithms, we show that s -rectangular robust MDPs with 1 - and 2 -norm as well as \phi -divergence ambiguity sets can be solved several orders of magnitude faster than with state-of-the-art commercial solvers, and often only a logarithmic factor slower than classical MDPs. We demonstrate the favorable scaling properties of our algorithms on a range of synthetically generated as well as standard benchmark instances.
[LG-116] Reducing the Complexity of Matrix Multiplication to O(N2log_2N) by an Asymptotically Optimal Quantum Algorithm
链接: https://arxiv.org/abs/2602.05541
作者: Jiaqi Yao,Ding Liu
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:
Abstract:Matrix multiplication is a fundamental classical computing operation whose efficiency becomes a major challenge at scale, especially for machine learning applications. Quantum computing, with its inherent parallelism and exponential storage capacity, offers a potential solution to these limitations. This work presents a quantum kernel-based matrix multiplication algorithm (QKMM) that achieves an asymptotically optimal computational complexity of O(N^2 \log_2 N) , outperforming the classical optimal complexity of O(N^2.371552) , where N denotes the matrix dimension. Through noiseless and noisy quantum simulation experiments, we demonstrate that the proposed algorithm not only exhibits superior theoretical efficiency but also shows practical advantages in runtime performance and stability.
[LG-117] Solving Stochastic Variational Inequalities without the Bounded Variance Assumption
链接: https://arxiv.org/abs/2602.05531
作者: Ahmet Alacaoglu,Jun-Hyun Kim
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We analyze algorithms for solving stochastic variational inequalities (VI) without the bounded variance or bounded domain assumptions, where our main focus is min-max optimization with possibly unbounded constraint sets. We focus on two classes of problems: monotone VIs; and structured nonmonotone VIs that admit a solution to the weak Minty VI. The latter assumption allows us to solve structured nonconvex-nonconcave min-max problems. For both classes of VIs, to make the expected residual norm less than \varepsilon , we show an oracle complexity of \widetildeO(\varepsilon^-4) , which is the best-known for constrained VIs. In our setting, this complexity had been obtained with the bounded variance assumption in the literature, which is not even satisfied for bilinear min-max problems with an unbounded domain. We obtain this complexity for stochastic oracles whose variance can grow as fast as the squared norm of the optimization variable.
[LG-118] Convergence Rate of the Last Iterate of Stochastic Proximal Algorithms
链接: https://arxiv.org/abs/2602.05489
作者: Kevin Kurian Thomas Vaidyan,Michael P. Friedlander,Ahmet Alacaoglu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We analyze two classical algorithms for solving additively composite convex optimization problems where the objective is the sum of a smooth term and a nonsmooth regularizer: proximal stochastic gradient method for a single regularizer; and the randomized incremental proximal method, which uses the proximal operator of a randomly selected function when the regularizer is given as the sum of many nonsmooth functions. We focus on relaxing the bounded variance assumption that is common, yet stringent, for getting last iterate convergence rates. We prove the \widetildeO(1/\sqrtT) rate of convergence for the last iterate of both algorithms under componentwise convexity and smoothness, which is optimal up to log terms. Our results apply directly to graph-guided regularizers that arise in multi-task and federated learning, where the regularizer decomposes as a sum over edges of a collaboration graph.
[LG-119] Variance Reduction Based Experience Replay for Policy Optimization
链接: https://arxiv.org/abs/2602.05379
作者: Hua Zheng,Wei Xie,M. Ben Feng,Keilung Choy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 4 figures. arXiv admin note: text overlap with arXiv:2208.12341
Abstract:Effective reinforcement learning (RL) for complex stochastic systems requires leveraging historical data collected in previous iterations to accelerate policy optimization. Classical experience replay treats all past observations uniformly and fails to account for their varying contributions to learning. To overcome this limitation, we propose Variance Reduction Experience Replay (VRER), a principled framework that selectively reuses informative samples to reduce variance in policy gradient estimation. VRER is algorithm-agnostic and integrates seamlessly with existing policy optimization methods, forming the basis of our sample-efficient off-policy algorithm, Policy Gradient with VRER (PG-VRER). Motivated by the lack of rigorous theoretical analysis of experience replay, we develop a novel framework that explicitly captures dependencies introduced by Markovian dynamics and behavior-policy interactions. Using this framework, we establish finite-time convergence guarantees for PG-VRER and reveal a fundamental bias-variance trade-off: reusing older experience increases bias but simultaneously reduces gradient variance. Extensive empirical experiments demonstrate that VRER consistently accelerates policy learning and improves performance over state-of-the-art policy optimization algorithms.
[LG-120] Decision-Focused Sequential Experimental Design: A Directional Uncertainty-Guided Approach
链接: https://arxiv.org/abs/2602.05340
作者: Beichen Wan,Mo Liu,Paul Grigas,Zuo-Jun Max Shen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We consider the sequential experimental design problem in the predict-then-optimize paradigm. In this paradigm, the outputs of the prediction model are used as coefficient vectors in a downstream linear optimization problem. Traditional sequential experimental design aims to control the input variables (features) so that the improvement in prediction accuracy from each experimental outcome (label) is maximized. However, in the predict-then-optimize setting, performance is ultimately evaluated based on the decision loss induced by the downstream optimization, rather than by prediction error. This mismatch between prediction accuracy and decision loss renders traditional decision-blind designs inefficient. To address this issue, we propose a directional-based metric to quantify predictive uncertainty. This metric does not require solving an optimization oracle and is therefore computationally tractable. We show that the resulting sequential design criterion enjoys strong consistency and convergence guarantees. Under a broad class of distributions, we demonstrate that our directional uncertainty-based design attains an earlier stopping time than decision-blind designs. This advantage is further supported by real-world experiments on an LLM job allocation problem.
[LG-121] Logarithmic-time Schedules for Scaling Language Models with Momentum
链接: https://arxiv.org/abs/2602.05298
作者: Damien Ferbach,Courtney Paquette,Gauthier Gidel,Katie Everett,Elliot Paquette
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:In practice, the hyperparameters (\beta_1, \beta_2) and weight-decay \lambda in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, the answer is yes: by exploiting the power-law structure of language data, one can design time-varying schedules for (\beta_1, \beta_2, \lambda) that deliver substantial performance gains. We study logarithmic-time scheduling, in which the optimizer’s gradient memory horizon grows with training time. Although naive variants of this are unstable, we show that suitable damping mechanisms restore stability while preserving the benefits of longer memory. Based on this, we present ADANA, an AdamW-like optimizer that couples log-time schedules with explicit damping to balance stability and performance. We empirically evaluate ADANA across transformer scalings (45M to 2.6B parameters), comparing against AdamW, Muon, and AdEMAMix. When properly tuned, ADANA achieves up to 40% compute efficiency relative to a tuned AdamW, with gains that persist–and even improve–as model scale increases. We further show that similar benefits arise when applying logarithmic-time scheduling to AdEMAMix, and that logarithmic-time weight-decay alone can yield significant improvements. Finally, we present variants of ADANA that mitigate potential failure modes and improve robustness. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2602.05298 [stat.ML] (or arXiv:2602.05298v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2602.05298 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-122] Radon–Wasserstein Gradient Flows for Interacting-Particle Sampling in High Dimensions
链接: https://arxiv.org/abs/2602.05227
作者: Elias Hess-Childs,Dejan Slepčev,Lantian Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Methodology (stat.ME)
*备注: 49 pages, 7 figures
Abstract:Gradient flows of the Kullback–Leibler (KL) divergence, such as the Fokker–Planck equation and Stein Variational Gradient Descent, evolve a distribution toward a target density known only up to a normalizing constant. We introduce new gradient flows of the KL divergence with a remarkable combination of properties: they admit accurate interacting-particle approximations in high dimensions, and the per-step cost scales linearly in both the number of particles and the dimension. These gradient flows are based on new transportation-based Riemannian geometries on the space of probability measures: the Radon–Wasserstein geometry and the related Regularized Radon–Wasserstein (RRW) geometry. We define these geometries using the Radon transform so that the gradient-flow velocities depend only on one-dimensional projections. This yields interacting-particle-based algorithms whose per-step cost follows from efficient Fast Fourier Transform-based evaluation of the required 1D convolutions. We additionally provide numerical experiments that study the performance of the proposed algorithms and compare convergence behavior and quantization. Finally, we prove some theoretical results including well-posedness of the flows and long-time convergence guarantees for the RRW flow.
[LG-123] Finite-Particle Rates for Regularized Stein Variational Gradient Descent
链接: https://arxiv.org/abs/2602.05172
作者: Ye He,Krishnakumar Balasubramanian,Sayan Banerjee,Promit Ghosal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We derive finite-particle rates for the regularized Stein variational gradient descent (R-SVGD) algorithm introduced by He et al. (2024) that corrects the constant-order bias of the SVGD by applying a resolvent-type preconditioner to the kernelized Wasserstein gradient. For the resulting interacting N -particle system, we establish explicit non-asymptotic bounds for time-averaged (annealed) empirical measures, illustrating convergence in the \emphtrue (non-kernelized) Fisher information and, under a \mathrmW_1\mathrmI condition on the target, corresponding \mathrmW_1 convergence for a large class of smooth kernels. Our analysis covers both continuous- and discrete-time dynamics and yields principled tuning rules for the regularization parameter, step size, and averaging horizon that quantify the trade-off between approximating the Wasserstein gradient flow and controlling finite-particle estimation error.
[LG-124] Path Sampling for Rare Events Boosted by Machine Learning
链接: https://arxiv.org/abs/2602.05167
作者: Porhouy Minh,Sapna Sarupria
类目: Computational Physics (physics.comp-ph); Soft Condensed Matter (cond-mat.soft); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 7 pages, 1 figure
Abstract:The study by Jung et al. (Jung H, Covino R, Arjun A, et al., Nat Comput Sci. 3:334-345 (2023)) introduced Artificial Intelligence for Molecular Mechanism Discovery (AIMMD), a novel sampling algorithm that integrates machine learning to enhance the efficiency of transition path sampling (TPS). By enabling on-the-fly estimation of the committor probability and simultaneously deriving a human-interpretable reaction coordinate, AIMMD offers a robust framework for elucidating the mechanistic pathways of complex molecular processes. This commentary provides a discussion and critical analysis of the core AIMMD framework, explores its recent extensions, and offers an assessment of the method’s potential impact and limitations.
[LG-125] Learning fermionic linear optics with Heisenberg scaling and physical operations
链接: https://arxiv.org/abs/2602.05058
作者: Aria Christensen,Andrew Zhao
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 56 pages
Abstract:We revisit the problem of learning fermionic linear optics (FLO), also known as fermionic Gaussian unitaries. Given black-box query access to an unknown FLO, previous proposals required \widetilde\mathcalO(n^5 / \varepsilon^2) queries, where n is the system size and \varepsilon is the error in diamond distance. These algorithms also use unphysical operations (i.e., violating fermionic superselection rules) and/or n auxiliary modes to prepare Choi states of the FLO. In this work, we establish efficient and experimentally friendly protocols that obey superselection, use minimal ancilla (at most 1 extra mode), and exhibit improved dependence on both parameters n and \varepsilon . For arbitrary (active) FLOs this algorithm makes at most \widetilde\mathcalO(n^4 / \varepsilon) queries, while for number-conserving (passive) FLOs we show that \mathcalO(n^3 / \varepsilon) queries suffice. The complexity of the active case can be further reduced to \widetilde\mathcalO(n^3 / \varepsilon) at the cost of using n ancilla. This marks the first FLO learning algorithm that attains Heisenberg scaling in precision. As a side result, we also demonstrate an improved copy complexity of \widetilde\mathcalO(n \eta^2 / \varepsilon^2) for time-efficient state tomography of \eta -particle Slater determinants in \varepsilon trace distance, which may be of independent interest.
[LG-126] Instance-optimal high-precision shadow tomography with few-copy measurements: A metrological approach
链接: https://arxiv.org/abs/2602.04952
作者: Senrui Chen,Weiyuan Gong,Sisi Zhou
类目: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 67 pages
Abstract:We study the sample complexity of shadow tomography in the high-precision regime under realistic measurement constraints. Given an unknown d -dimensional quantum state \rho and a known set of observables \O_i_i=1^m , the goal is to estimate expectation values \mathrmtr(O_i\rho)_i=1^m to accuracy \epsilon in L_p -norm, using possibly adaptive measurements that act on O(\mathrmpolylog(d)) number of copies of \rho at a time. We focus on the regime where \epsilon is below an instance-dependent threshold. Our main contribution is an instance-optimal characterization of the sample complexity as \tilde\Theta(\Gamma_p/\epsilon^2) , where \Gamma_p is a function of \O_i_i=1^m defined via an optimization formula involving the inverse Fisher information matrix. Previously, tight bounds were known only in special cases, e.g. Pauli shadow tomography with L_\infty -norm error. Concretely, we first analyze a simpler oblivious variant where the goal is to estimate an observable of the form \sum_i=1^m \alpha_i O_i with |\alpha|q = 1 (where q is dual to p ) revealed after the measurement. For single-copy measurements, we obtain a sample complexity of \Theta(\Gamma^\mathrmob_p/\epsilon^2) . We then show \tilde\Theta(\Gamma_p/\epsilon^2) is necessary and sufficient for the original problem, with the lower bound applying to unbiased, bounded estimators. Our upper bounds rely on a two-step algorithm combining coarse tomography with local estimation. Notably, \Gamma^\mathrmob\infty = \Gamma_\infty . In both cases, allowing c -copy measurements improves the sample complexity by at most \Omega(1/c) . Our results establish a quantitative correspondence between quantum learning and metrology, unifying asymptotic metrological limits with finite-sample learning guarantees. Comments: 67 pages Subjects: Quantum Physics (quant-ph); Information Theory (cs.IT); Machine Learning (cs.LG) Cite as: arXiv:2602.04952 [quant-ph] (or arXiv:2602.04952v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2602.04952 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-127] Beyond Independent Genes: Learning Module-Inductive Representations for Gene Perturbation Prediction
链接: https://arxiv.org/abs/2602.04901
作者: Jiafa Ruan,Ruijie Quan,Zongxin Yang,Liyang Xu,Yi Yang
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
Abstract:Predicting transcriptional responses to genetic perturbations is a central problem in functional genomics. In practice, perturbation responses are rarely gene-independent but instead manifest as coordinated, program-level transcriptional changes among functionally related genes. However, most existing methods do not explicitly model such coordination, due to gene-wise modeling paradigms and reliance on static biological priors that cannot capture dynamic program reorganization. To address these limitations, we propose scBIG, a module-inductive perturbation prediction framework that explicitly models coordinated gene programs. scBIG induces coherent gene programs from data via Gene-Relation Clustering, captures inter-program interactions through a Gene-Cluster-Aware Encoder, and preserves modular coordination using structure-aware alignment objectives. These structured representations are then modeled using conditional flow matching to enable flexible and generalizable perturbation prediction. Extensive experiments on multiple single-cell perturbation benchmarks show that scBIG consistently outperforms state-of-the-art methods, particularly on unseen and combinatorial perturbation settings, achieving an average improvement of 6.7% over the strongest baselines.
信息检索
[IR-0] Agent icTagger: Structured Item Representation for Recommendation with LLM Agents
链接: https://arxiv.org/abs/2602.05945
作者: Zhouhang Xie,Bo Peng,Zhankui He,Ziqi Chen,Alice Han,Isabella Ye,Benjamin Coleman,Noveen Sachdeva,Fernando Pereira,Julian McAuley,Wang-Cheng Kang,Derek Zhiyuan Cheng,Beidou Wang,Randolph Brown
类目: Information Retrieval (cs.IR)
*备注:
Abstract:High-quality representations are a core requirement for effective recommendation. In this work, we study the problem of LLM-based descriptor generation, i.e., keyphrase-like natural language item representation generation frameworks with minimal constraints on downstream applications. We propose AgenticTagger, a framework that queries LLMs for representing items with sequences of text descriptors. However, open-ended generation provides little control over the generation space, leading to high cardinality, low-performance descriptors that renders downstream modeling challenging. To this end, AgenticTagger features two core stages: (1) a vocabulary building stage where a set of hierarchical, low-cardinality, and high-quality descriptors is identified, and (2) a vocabulary assignment stage where LLMs assign in-vocabulary descriptors to items. To effectively and efficiently ground vocabulary in the item corpus of interest, we design a multi-agent reflection mechanism where an architect LLM iteratively refines the vocabulary guided by parallelized feedback from annotator LLMs that validates the vocabulary against item data. Experiments on public and private data show AgenticTagger brings consistent improvements across diverse recommendation scenarios, including generative and term-based retrieval, ranking, and controllability-oriented, critique-based recommendation.
[IR-1] GLASS: A Generative Recommender for Long-sequence Modeling via SID-Tier and Semantic Search
链接: https://arxiv.org/abs/2602.05663
作者: Shiteng Cao,Junda She,Ji Liu,Bin Zeng,Chengcheng Guo,Kuo Cai,Qiang Luo,Ruiming Tang,Han Li,Kun Gai,Zhiheng Li,Cheng Yang
类目: Information Retrieval (cs.IR)
*备注: 10 pages,3 figures
Abstract:Leveraging long-term user behavioral patterns is a key trajectory for enhancing the accuracy of modern recommender systems. While generative recommender systems have emerged as a transformative paradigm, they face hurdles in effectively modeling extensive historical sequences. To address this challenge, we propose GLASS, a novel framework that integrates long-term user interests into the generative process via SID-Tier and Semantic Search. We first introduce SID-Tier, a module that maps long-term interactions into a unified interest vector to enhance the prediction of the initial SID token. Unlike traditional retrieval models that struggle with massive item spaces, SID-Tier leverages the compact nature of the semantic codebook to incorporate cross features between the user’s long-term history and candidate semantic codes. Furthermore, we present semantic hard search, which utilizes generated coarse-grained semantic ID as dynamic keys to extract relevant historical behaviors, which are then fused via an adaptive gated fusion module to recalibrate the trajectory of subsequent fine-grained tokens. To address the inherent data sparsity in semantic hard search, we propose two strategies: semantic neighbor augmentation and codebook resizing. Extensive experiments on two large-scale real-world datasets, TAOBAO-MM and KuaiRec, demonstrate that GLASS outperforms state-of-the-art baselines, achieving significant gains in recommendation quality. Our codes are made publicly available to facilitate further research in generative recommendation.
[IR-2] Forward Index Compression for Learned Sparse Retrieval
链接: https://arxiv.org/abs/2602.05445
作者: Sebastian Bruch,Martino Fontana,Franco Maria Nardini,Cosimo Rulli,Rossano Venturini
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Text retrieval using learned sparse representations of queries and documents has, over the years, evolved into a highly effective approach to search. It is thanks to recent advances in approximate nearest neighbor search-with the emergence of highly efficient algorithms such as the inverted index-based Seismic and the graph-based Hnsw-that retrieval with sparse representations became viable in practice. In this work, we scrutinize the efficiency of sparse retrieval algorithms and focus particularly on the size of a data structure that is common to all algorithmic flavors and that constitutes a substantial fraction of the overall index size: the forward index. In particular, we seek compression techniques to reduce the storage footprint of the forward index without compromising search quality or inner product computation latency. In our examination with various integer compression techniques, we report that StreamVByte achieves the best trade-off between memory footprint, retrieval accuracy, and latency. We then improve StreamVByte by introducing DotVByte, a new algorithm tailored to inner product computation. Experiments on MsMarco show that our improvements lead to significant space savings while maintaining retrieval efficiency.
[IR-3] Rich-Media Re-Ranker: A User Satisfaction-Driven LLM Re-ranking Framework for Rich-Media Search
链接: https://arxiv.org/abs/2602.05408
作者: Zihao Guo,Ligang Zhou,Zeyang Tang,Feicheng Li,Ying Nie,Zhiming Peng,Qingyun Sun,Jianxin Li
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Re-ranking plays a crucial role in modern information search systems by refining the ranking of initial search results to better satisfy user information needs. However, existing methods show two notable limitations in improving user search satisfaction: inadequate modeling of multifaceted user intents and neglect of rich side information such as visual perception signals. To address these challenges, we propose the Rich-Media Re-Ranker framework, which aims to enhance user search satisfaction through multi-dimensional and fine-grained modeling. Our approach begins with a Query Planner that analyzes the sequence of query refinements within a session to capture genuine search intents, decomposing the query into clear and complementary sub-queries to enable broader coverage of users’ potential intents. Subsequently, moving beyond primary text content, we integrate richer side information of candidate results, including signals modeling visual content generated by the VLM-based evaluator. These comprehensive signals are then processed alongside carefully designed re-ranking principle that considers multiple facets, including content relevance and quality, information gain, information novelty, and the visual presentation of cover images. Then, the LLM-based re-ranker performs the holistic evaluation based on these principles and integrated signals. To enhance the scenario adaptability of the VLM-based evaluator and the LLM-based re-ranker, we further enhance their capabilities through multi-task reinforcement learning. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines. Notably, the proposed framework has been deployed in a large-scale industrial search system, yielding substantial improvements in online user engagement rates and satisfaction metrics.
[IR-4] NeuCLIRTech: Chinese Monolingual and Cross-Language Information Retrieval Evaluation in a Challenging Domain
链接: https://arxiv.org/abs/2602.05334
作者: Dawn Lawrie,James Mayfield,Eugene Yang,Andrew Yates,Sean MacAvaney,Ronak Pradeep,Scott Miller,Paul McNamee,Luca Soldaini
类目: Information Retrieval (cs.IR)
*备注: 14 pages, 6 figures
Abstract:Measuring advances in retrieval requires test collections with relevance judgments that can faithfully distinguish systems. This paper presents NeuCLIRTech, an evaluation collection for cross-language retrieval over technical information. The collection consists of technical documents written natively in Chinese and those same documents machine translated into English. It includes 110 queries with relevance judgments. The collection supports two retrieval scenarios: monolingual retrieval in Chinese, and cross-language retrieval with English as the query language. NeuCLIRTech combines the TREC NeuCLIR track topics of 2023 and 2024. The 110 queries with 35,962 document judgments provide strong statistical discriminatory power when trying to distinguish retrieval approaches. A fusion baseline of strong neural retrieval systems is included so that developers of reranking algorithms are not reliant on BM25 as their first stage retriever. The dataset and artifacts are released on Huggingface Datasets
[IR-5] RAG without Forgetting: Continual Query-Infused Key Memory
链接: https://arxiv.org/abs/2602.05152
作者: Yuntong Hu,Sha Li,Naren Ramakrishnan,Liang Zhao
类目: Information Retrieval (cs.IR)
*备注: 24 pages, 12 figures
Abstract:Retrieval-augmented generation (RAG) systems commonly improve robustness via query-time adaptations such as query expansion and iterative retrieval. While effective, these approaches are inherently stateless: adaptations are recomputed for each query and discarded thereafter, precluding cumulative learning and repeatedly incurring inference-time cost. Index-side approaches like key expansion introduce persistence but rely on offline preprocessing or heuristic updates that are weakly aligned with downstream task utility, leading to semantic drift and noise accumulation. We propose Evolving Retrieval Memory (ERM), a training-free framework that transforms transient query-time gains into persistent retrieval improvements. ERM updates the retrieval index through correctness-gated feedback, selectively attributes atomic expansion signals to the document keys they benefit, and progressively evolves keys via stable, norm-bounded updates. We show that query and key expansion are theoretically equivalent under standard similarity functions and prove convergence of ERM’s selective updates, amortizing optimal query expansion into a stable index with zero inference-time overhead. Experiments on BEIR and BRIGHT across 13 domains demonstrate consistent gains in retrieval and generation, particularly on reasoning-intensive tasks, at native retrieval speed.
[IR-6] Deterministic Retrieval at Scale: Optimal-Space LCP Indexing and 308x Energy Reduction on Modern GPUs
链接: https://arxiv.org/abs/2602.04936
作者: Stanislav Byriukov
类目: Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注:
Abstract:We study deterministic top-k retrieval under Longest Common Prefix (LCP) similarity for N sequences of length L. We prove a tight Omega(N) space lower bound (cell-probe model) and present a trie-based index using O(N*L) space with O(L+k) query time. We contrast this with pairwise materialization (Theta(N^2)), which hits a practical OOM wall at scale, while our indexed approach remains O(N) in memory. We then introduce Thermal-Aware Logic (TAL), which turns prefix structure into range-bounded scans. In hardware measurements, TAL reduces energy per query by 308x (0.0145 J vs 4.46 J) and cuts p95 latency by 329x (0.114 ms vs 37.5 ms) on a 20M-item range-scan benchmark, while sustaining near-peak utilization (~99%) under long runs. The result is a deterministic retrieval primitive with receipts in regimes where approximate methods are unacceptable.

