Arxiv今日论文 | 2026-06-03

本篇博文主要内容为 2026-06-03 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共31篇(Computation and Language (cs.CL))
人工智能共58篇(Artificial Intelligence (cs.AI))
计算机视觉共49篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共48篇(Machine Learning (cs.LG))
多智能体系统共5篇(Multiagent Systems (cs.MA))
信息检索共3篇(Information Retrieval (cs.IR))
人机交互共9篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] From What to How and Why: Sharing LLM -Generated Retrospective Summaries of Older Adults Passive Tracking Data with Remote Family Members

【速读】：该论文旨在解决多模态追踪数据难以转化为对远程家庭成员（RFMs）具有实际意义的、可理解的叙述性摘要的问题。随着普适计算技术的发展，尽管多模态追踪系统能够持续采集老年人的活动数据，但如何将这些异构数据流整合为具有上下文意义的回顾性叙事内容，仍面临挑战。现有研究虽已证明大语言模型（LLM）在解析多模态数据方面的潜力，但对如何生成面向具备丰富个人认知与情感责任却缺乏日常可视性的RFMs的叙述性报告关注不足。本文的关键解决方案是构建一种分层、多代理、以洞察驱动的摘要生成框架，该框架从客观统计数据和描述出发，逐步演化为富含情境感知的叙事内容。通过原型系统Vital Insight的迭代设计与11位RFMs的访谈及对比实验，验证了新方法在提升用户满意度、感知有用性、信任度及接收意愿方面的显著优势。研究强调，未来AI生成摘要的设计应支持RFMs的认知转变：从被动接收“发生了什么”（What），转向理解“我的亲人状况如何”（How）以及“为何如此”（Why），从而增强其在照护协调中的参与感与决策能力。

链接: https://arxiv.org/abs/2606.03876
作者: Jiachen Li,Reina Szeyi Chan,Akshat Choube,Xiang Zhi Tan,Elizabeth Mynatt,Varun Mishra
机构: Northeastern University (东北大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:With the growing prevalence of modern ubiquitous computing technologies, multi-modal tracking systems hold promise for providing timely awareness and reassurance to stakeholders such as remote family members (RFMs) of older adults, who play a central role in care coordination. However, combining heterogeneous data streams into high-level, meaningful content - such as retrospective summaries - remains challenging. While recent work has demonstrated the promise of large language models (LLMs) for interpreting multi-modal tracking data, less attention has been given to generating narrative accounts for stakeholders like RFMs, who possess rich personal knowledge of older adults and strong emotional responsibility, yet have limited visibility into their daily lives and limited capacity for caregiving. In this work, we explore how LLMs can be used to generate retrospective summaries from multi-modal tracking data for RFMs of older adults. We leveraged and customized an existing system, Vital Insight, to generate initial summaries on different dates and data availability scenarios as technology probes, and conducted interviews with 11 RFMs to gather feedback. Based on these insights, we redesigned the system into a multi-layer, multi-agent, insight-driven summary approach that builds from objective statistics and descriptions to enriched, context-aware narratives. We then compared the redesigned summaries with the initial versions through a survey with the same 11 RFMs and found significant improvements in satisfaction, perceived helpfulness, trust, and willingness to receive the summaries. We conclude by presenting design implications for AI-generated summaries for RFMs and broader contexts, emphasizing the need to support RFMs’ sensemaking shift from simply presenting ‘‘What’’ data were collected, to explaining ‘‘How’’ is my loved one doing and ‘‘Why’’.

[MA-1] D2MDT: Department-aware Multidisciplinary Team Consultation with Deliberation for Efficient Clinical Prediction

【速读】：该论文旨在解决电子健康记录（EHR）在临床预测中面临的多学科临床推理支持不足的问题，现有方法或依赖相关性驱动的深度模型，或仅采用单一大型语言模型（LLM），难以有效模拟真实多学科团队（MDT）的协作决策过程。其解决方案的关键在于提出一种具有科室感知能力的多代理协商系统——D2MDT（Department-aware Multi-Disciplinary Team Consultation with Deliberation）。该方法首先构建结构化EHR证据与可咨询的语义证据，使各医生代理基于患者特定的科室视角获取互补信息；其次引入残差式反思机制，仅更新未达成共识的部分，避免重复完整讨论历史，显著提升交互效率；最终通过融合精炼后的共识报告与结构化EHR表示实现精准预测。实验结果表明，D2MDT在死亡率预测任务中同时提升了预测性能与咨询效率。

链接: https://arxiv.org/abs/2606.03543
作者: Yongqi Liang,Qidong Liu,Chunze Yang,Lei Wu,Jiusong Ge,Ni Zhang,Chen Li
机构: Xi’an Jiaotong University(西安交通大学)
类目: Multiagent Systems (cs.MA)
备注: Preprint. 17 pages

点击查看摘要

Abstract:Electronic health records (EHRs) are central to clinical prediction, but existing methods either rely on correlation-driven deep models or use single large language models (LLMs), making it difficult to support multidisciplinary clinical reasoning. Recent multi-agent systems (MAS) provide a promising alternative, yet current EHR-grounded MAS methods still suffer from weak evidence differentiation across agents and redundant multi-round interaction. We propose D2MDT, a Department-aware MultiDisciplinary Team Consultation with Deliberation for Efficient clinical prediction. D2MDT first constructs structured EHR evidence and consultation-ready semantic evidence for multi-agent consultation. It then assigns patient-specific department perspectives to doctor agents and retrieves complementary evidence for collaborative consultation. To improve efficiency, D2MDT further introduces residual deliberation, which updates only unresolved consensus rather than replaying the full discussion history. Finally, D2MDT fuses the refined consensus report with structured EHR representations for prediction. Experiments on mortality prediction show that D2MDT improves both predictive performance and consultation efficiency. We release the code online to ease the reproducibility of this paper.

[MA-2] A formal definition and meta-model for a machine theory of mind

【速读】：该论文旨在解决机器是否能够具备“心智理论”（Theory of Mind, ToM）这一核心问题，即机器能否像人类一样理解自身与他人的心理状态（如信念、意图、欲望等），并据此预测和解释行为。现有研究在构建具备类人认知能力的智能系统方面仍存在概念模糊与评估标准缺失的问题。论文的关键贡献在于首次基于认知心理学、神经科学及人工智能领域的实证证据，提出了机器心智理论的严格形式化定义，并构建了一个通用的整体性元模型（holistic meta-model），为系统设计与评估提供理论框架。同时，论文通过该框架审视当前前沿技术进展，识别出可操作的研究路径，推动实现真正具备心智理论能力的智能系统。

链接: https://arxiv.org/abs/2606.03471
作者: Fabio Cuzzolin
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neurons and Cognition (q-bio.NC)
备注: 48 pages, 2 figures

点击查看摘要

Abstract:This paper proposes, for the first time, a rigorous formal definition of the concept of Machine Theory of Mind, based on principles supported by evidence from cognitive psychology, neuroscience and artificial intelligence, and uses the above as a lens to examine state-of-the-art and current efforts in the field, driving a potential agenda for further research there able to “crack” the problem. It also advances a general holistic meta-model for Machine Theory of Mind, and examines the state of the art when it comes to empirically benchmarking such models.

[MA-3] OpenAgenet/OAN: Technical Architecture for Trust-Governed Agent Identity and Discovery

【速读】：该论文旨在解决开放智能体（Agent）之间在异构框架与多协议环境下实现可信互连的挑战。其核心问题在于如何在不依赖特定交互协议的前提下，建立一个统一且可验证的信任层，以确保智能体身份的合法性、可发现性、可验证性以及交互安全性。解决方案的关键在于提出并设计了OpenAgent / OAN（Open Agent Network）这一协议无关的信任层架构，其核心要素包括：角色架构、身份对象定义、根治理生命周期管理、根验证的包模型、授权感知的发现机制、签名可信调用、验证要求、状态转换规则、安全属性保障、实现边界划分及部署考量。该架构通过在协议交互前对智能体身份进行可信认证与验证，实现了跨框架、跨协议的智能体间安全可信连接，从而为复杂智能体生态系统的互操作性奠定了基础。

链接: https://arxiv.org/abs/2606.03163
作者: Jinliang Xu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:This paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection. It specifies the role architecture, identity objects, registration workflow, Root-governed lifecycle, Root-verified package model, authorization-aware Discovery, signed trusted invocation, verification requirements, state transitions, security properties, implementation boundaries, and deployment considerations. The design is intended to support heterogeneous Agent frameworks and interaction protocols, including MCP, A2A, ANP-like systems, and domain-specific Agent protocols. OAN does not define the entire business conversation among Agents; it defines how Agent identities become admissible, discoverable, verifiable, and safe to approach before protocol-specific interaction begins.

[MA-4] OpenAgenet/OAN: Open Infrastructure for Trusted Agent Interconnection

【速读】：该论文旨在解决在多运营商开放网络环境中，智能体（Agent）之间进行安全互连时面临的信任与身份验证难题。当智能体从孤立的应用场景走向开放、跨组织的协作网络时，其在发现、选择和调用其他智能体之前，必须能够验证对方的身份溯源、治理状态、发现授权权限、信息新鲜度以及预连接的信任证据。为应对这一挑战，论文提出OpenAgenet（OAN）——一个协议无关的可信信任层，其核心解决方案在于构建一套由根治理机构（Root-governed）控制的身份准入机制、注册机构支持的上线流程、根验证的包发布机制、授权感知的发现服务，以及经过签名的可信调用链路。该方案不替代现有的智能体交互协议、工具协议、模型编排框架或应用工作流，而是作为上层信任基础设施，通过区块链支持的授权公告（blockchain-backed authorization bulletin）实现去中心化且可审计的信任管理，从而保障跨域智能体协同的安全性与可追溯性。

链接: https://arxiv.org/abs/2606.03161
作者: Jinliang Xu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes visible when Agents move from isolated applications into open, multi-operator networks: before an Agent can safely discover, select, and invoke another Agent, it needs a way to verify identity provenance, governance state, discovery authorization, freshness, and pre-connection trust evidence. OAN is designed as a protocol-neutral trust layer. It does not replace Agent interaction protocols, tool protocols, model orchestration frameworks, or application-level workflows. Instead, it provides Root-governed identity admission, Registrar-assisted onboarding, Root-verified package publication, authorization-aware Discovery, and signed trusted invocation. This paper presents the motivation, architecture, roles, governance model, relationship with MCP, A2A, and ANP, deployment patterns, cooperation model, blockchain-backed authorization bulletin, prototype status, performance profile, and roadmap of OAN.

自然语言处理

[NLP-0] Neuron Populations Exhibit Divergent Selectivity with Scale

【速读】：该论文旨在解决大规模神经网络中神经元群体是否随模型规模可预测演化的问题，突破传统缩放定律仅适用于损失等宏观可观测量的局限。其核心解决方案在于揭示“罗塞塔神经元”（Rosetta Neurons）这一具有跨独立训练模型一致激活模式的神经元类群在规模扩展下的系统性变化规律。研究发现，罗塞塔神经元的数量虽随模型规模呈亚线性幂律增长，但占总神经元比例持续下降，同时表现出“神经元极化效应”：随着模型规模增大，罗塞塔神经元趋向更高度选择性与单义性，与日益扩大的非罗塞塔神经元群体逐渐分离。基于特征效用与神经元容量有限性的权衡，提出一个解析模型可解释该亚线性缩放及极化现象。此外，罗塞塔神经元还表现出更强的领域专业化特征，并通过针对性数据过滤实验验证其对持续预训练的敏感性。研究结果揭示了可解释、共享的神经元层级结构存在新的缩放定律，将模型规模与神经元的普适性、选择性及专业化程度的系统性演变相联系。

链接: https://arxiv.org/abs/2606.03990
作者: Amil Dravid,Yasaman Bahri,Alexei A. Efros,Yossi Gandelsman
机构: UC Berkeley; TTIC
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page and code: this https URL

点击查看摘要

Abstract:We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as loss. To probe this question, we study Rosetta Neurons, a previously characterized class of neurons whose activation patterns are similar across independently trained models (Dravid et al., 2023). In separate analyses of language models up to 30B parameters and vision models up to 5B parameters, we observe that the population of Rosetta Neurons follows a sublinear power law in model size, growing in absolute number but occupying a shrinking fraction of the total neuron count. We further observe a Neuron Polarization Effect: Rosetta Neurons become more selective and increasingly monosemantic with scale, separating from a growing non-Rosetta population that remains less selective. An analytical model balancing feature utility against limited neuron capacity explains the sublinear power-law scaling and this polarization effect. Finally, we find that Rosetta Neurons become more domain-specialized with scale and illustrate their selectivity through a targeted data-filtering case study for continued pretraining. Our results point to a scaling law for interpretable, shared neuron-level structure, linking model size to systematic changes in neuron universality, selectivity, and specialization.

[NLP-1] Language Models Compare Quantities Using Number-specific and Unit-specific Heuristics

【速读】：该论文旨在解决语言模型（Language Models, LMs）在处理带有单位的数值比较任务时的准确性问题，特别是在跨不同单位系统（如厘米与米）的复杂情境下，模型如何进行有效判断。其核心挑战在于，当两个数值接近比较边界时，微小的数值或单位差异可能导致正确答案发生突变，而此时模型的判断能力显著下降。解决方案的关键在于揭示了语言模型并非通过将所有量值统一转换到同一精确尺度（如统一换算为米）来进行比较，而是依赖于一系列基于数值差和单位尺度差的启发式规则（heuristics）。研究通过构建线性代理模型（linear surrogate models），发现模型偏好可由数值差异和单位尺度差异有效预测，并进一步通过因果干预验证了这些变量所对应的子空间对模型输出具有显著影响。这表明语言模型在数量比较中采用的是“启发式组合策略”，而非严格的符号化转换机制，从而揭示了当前生成式模型在数值推理中的局限性与可解释性缺陷。

链接: https://arxiv.org/abs/2606.03982
作者: Mutsumi Sasaki,Go kamoda,Ryosuke Takahashi,Kosuke Sato,Kentaro Inui,Keisuke Sakaguchi,Benjamin Heinzerling
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Quantities with measurement units, such as 110 cm and 1.2 m, require language models (LMs) to combine a numeral with a symbolic unit scale. Here, we study how LMs compare such quantities in controlled settings spanning several unit systems. We find that accuracy degrades near the comparison boundary, where small changes in value determine the correct answer. The resulting errors are systematic: linear surrogate models predict LM preferences from numerical-difference and unit-scale-difference cues, and causal interventions on subspaces aligned with these variables shift model’s output. The results suggest that LMs compare quantities through a bag of heuristics over numerals and units, rather than first converting both expressions to an exact shared-scale representation.

[NLP-2] Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

【速读】：该论文旨在解决当前大语言模型（LLM）后训练阶段中奖励模型（Reward Model, RM）依赖异构评价标准的问题，这些标准包括基于规则的验证器、真实参考文本、流程检查清单以及复杂的评分细则，而缺乏一个统一机制来整合各类证据。其解决方案的关键在于提出一种名为技能型奖励模型（Skill Reward Model, Skill-RM）的统一框架，将奖励建模重构为可复用的“奖励评估技能”（Reward-Evaluation Skill）的执行过程。通过将奖励计算视为结构化的智能体任务，Skill-RM提供了一致的接口，能够动态选择并聚合针对特定输入需求的多样化证据，从而实现对不同任务的自适应、透明且一致的评价。该方法突破了传统静态评价范式，实验证明其在多个奖励基准及下游应用（如Best-of-N选择和强化学习）中均显著优于传统判官基线，表明通过策略性与动态化地协调多源证据，Skill-RM不仅实现了奖励建模的统一，还提升了整体性能。

链接: https://arxiv.org/abs/2606.03980
作者: Tao Chen,Gangwei Jiang,Pengyu Cheng,Siyuan Huang,Yihao Liu,Jingwei Ni,Jiaqi Guo,Mengyu Zhou,Kai Tang,Junling Liu,Qinliang Su,Xiaoxi Jiang,Guanjun Jiang
机构: Alibaba(阿里巴巴); Sun Yat-sen University (中山大学); The Chinese University of Hong Kong (香港中文大学); Peking University (北京大学); ETH Zürich (苏黎世联邦理工学院); University of Zurich (苏黎世大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at this https URL.

[NLP-3] Quantifying Faithful Confidence Expression in Large Reasoning Models

【速读】：该论文旨在解决大语言模型（LLM）在生成长链思维（Chain-of-Thought, CoT）推理过程时，其内在不确定性与语言表达出的置信度之间难以实现忠实校准（Faithful Calibration, FC）的问题。核心挑战在于，当前主流的FC评估范式无法有效适配长推理轨迹——这类轨迹常缺乏清晰的步骤边界、结构不一致且蕴含复杂的条件依赖关系，导致内在置信度难以准确估计。为此，论文提出一种新型系统性框架，通过分析语言上的决断性与三个内部不确定性来源（词元概率、隐藏状态、采样响应一致性）之间的关联，并引入前缀条件采样方法以控制轨迹间的条件与结构变异，从而实现对LRMs（大推理模型）FC的量化评估。研究发现，推理行为本身并不自动提升FC表现，且针对非推理模型的有效提示干预在推理场景中亦无法改善置信度表达的忠实性；不同置信度估计器对同一推理轨迹的评估结果差异显著，暴露出现有评估方法的脆弱性。综上，该工作将忠实校准确立为大推理模型在高风险应用场景中必须达成的独立可靠性与对齐目标。

链接: https://arxiv.org/abs/2606.03969
作者: Areeb Gani,Asal Meskin,Gabrielle Kaili-May Liu,Arman Cohan
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code: this https URL

点击查看摘要

Abstract:Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)–the alignment between models’ intrinsic and (linguistically) expressed confidence–is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain-of-thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace–complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix-conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non-reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high-stakes contexts.

[NLP-4] QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

【速读】：该论文旨在解决基于评分标准的强化学习（Rubric-based RL）在处理非可验证奖励任务时面临的有效性瓶颈问题。现有方法在优化评分标准时固定查询分布，导致评分质量受限于查询结构：开放性查询生成模糊的评分标准，而过度简化查询则引入无法被模型验证的虚构参考答案，致使所有响应均失败，训练过程失去奖励信号。本文提出QUBRIC框架，其核心在于协同设计查询与评分标准。通过教师模型提取的关键点，将开放性查询重写为基于场景、可评估的问题；利用对比式评分标准生成机制，将教师策略与模型策略之间的差距转化为查询级别的评价准则；并通过可学习性过滤保留具有信息量的查询-评分对用于广义奖励策略优化（GRPO）训练。实验表明，QUBRIC在ArenaHard上相较监督微调（SFT）基线提升5.5分，并在三个未见基准（涵盖法律、道德和叙事推理）上平均提升6.3分，尤其在推理相关维度表现突出。结果证明，协同设计查询与评分标准能够使基于评分标准的强化学习成为超越严格可验证任务的实用补充，拓展了强化学习在复杂推理任务中的应用边界。

链接: https://arxiv.org/abs/2606.03968
作者: Rongzhi Zhang,Rui Feng,Zhihan Zhang,Jingfeng Yang,Qingyu Yin,Xin Liu,Zixuan Zhang,Priyanka Nigam,Bing Yin,Tuo Zhao,Chao Zhang
机构: Amazon(亚马逊); Georgia Institute of Technology(佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.

[NLP-5] AlignAtt4LLM : Fast AlignAtt for Decoder-Only LLM s at IWSLT 2026 Simultaneous Speech Translation Task

【速读】：该论文旨在解决低延迟场景下同步语音翻译（Simultaneous Speech Translation, SST）的实时性与翻译质量之间的权衡问题，特别是在英译德、英译意及英译中任务中实现高精度且低延迟的端到端翻译。其核心挑战在于：在语音流持续输入的过程中，如何有效利用不完整源语句（prefix）进行即时翻译，同时保持与理想全句翻译相当的准确性。解决方案的关键在于提出一种适用于仅解码器架构的大型语言模型（decoder-only LLM）的AlignAtt策略——即AlignAtt4LLM。该方法通过四项关键技术实现：（1）在提示词（prompt）中显式引入源文本片段以增强对齐感知；（2）离线筛选针对特定翻译任务的注意力头（alignment heads）；（3）选择性地重播草稿-源文注意力块中的查询/键（qk-fast replay），以保留关键对齐信息；（4）运行时精确捕获查询与键向量，确保模型输出比特级一致。这一设计使原本依赖编码器-解码器交叉注意力机制的AlignAtt策略成功迁移至无编码器结构的decoder-only LLM，从而在IWSLT 2026开发集上显著优于基线系统，尤其在2秒左右低延迟和低于4秒的高延迟场景下表现优异。此外，该方法具有良好的可扩展性，不依赖特定模型架构，可适配更强的翻译专用解码器模型，适用于非欧洲语言如中文的翻译任务。

链接: https://arxiv.org/abs/2606.03967
作者: Quentin Fuxa,Dominik Macháček
机构: Charles University, MFF, ÚFAL (查尔斯大学, 数学与物理学学院, 语言学研究所); University of Edinburgh (爱丁堡大学); Google DeepMind (谷歌深脑)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to IWSLT 2026

点击查看摘要

Abstract:We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated source transcript, and Gemma-4 E4B-it translates that prefix under an MT-side AlignAtt policy. To our knowledge, this is the first application of AlignAtt to a decoder-only LLM, where the encoder-decoder cross-attention used by earlier AlignAtt systems is absent. We recover a usable policy by proposing (1) an explicit source span in the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture that preserves model outputs bit-identically. On the IWSLT 2026 development set, AlignAtt4LLM outperforms the supplied baselines for the European target languages, English to German and English to Italian, in both the low-latency regime around 2 seconds and the high-latency regime below 4 seconds CU-LongYAAL. Results for English to Chinese are more mixed, but the method is not tied to Gemma-4: because AlignAtt4LLM only requires a deterministic prompt layout, calibrated attention heads, and query/key capture, the same policy can be reapplied to stronger translation-focused decoder-only MT backbones for non-European target languages. Comments: Accepted to IWSLT 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.03967 [cs.CL] (or arXiv:2606.03967v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.03967 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-6] Agent ic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

【速读】：该论文旨在解决大语言模型在链式思维（Chain-of-Thought, CoT）推理过程中存在推理效率低下、生成令牌（token）浪费严重以及推理过程缺乏运行时控制能力的问题。现有高效推理方法通常通过截断、提前终止或压缩推理轨迹来控制思考长度，但这些方法使模型的推理策略隐含不可控，难以实现精确的资源与性能权衡。本文提出一种名为代理式链式思维引导（Agentic Chain-of-Thought Steering, ACTS）的新框架，其核心在于将推理引导建模为一个马尔可夫决策过程（Markov Decision Process），由一个控制器代理（controller agent）在推理过程中动态调控冻结的推理模型（frozen reasoner）。在每一步推理中，控制器基于当前的推理轨迹和剩余的思考预算，生成包含推理策略与引导短语（steering phrase）的行动指令，从而显式地指导下一阶段的推理方向。该方法实现了对推理策略的显式、预算感知的可控性，同时保持了推理生成的连贯性。作者通过构建带有多种预算增强的合成引导轨迹初始化控制器，并采用基于预算条件奖励塑造的强化学习进行优化。实验结果表明，ACTS在多个基准测试中能够以显著减少的令牌消耗达到与完整思考相当的准确率，并支持在不同推理模型和任务之间灵活调节准确率与效率的权衡。

链接: https://arxiv.org/abs/2606.03965
作者: Yu Xia,Zhouhang Xie,Xin Xu,Byungkyu Kang,Prarit Lamba,Xiang Gao,Julian McAuley
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner’s generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at this https URL.

[NLP-7] Efficient ASR Training with Conversations that Never Happened

【速读】：该论文旨在解决低资源语言及特定领域中对话式自动语音识别（Conversational ASR）因缺乏匹配领域、多说话人训练数据而面临的性能瓶颈问题。其核心解决方案是提出一种生成增强流水线，通过大语言模型（LLM）生成具备场景级对话结构与参与者元数据的合成对话，将说话人属性映射至文本转语音（TTS）声线模型，并合成具有说话人感知特性的虚拟对话。关键创新在于利用不同规模和配置的LLM生成合成数据，结合固定预算混合与规模扩展策略，在统一的FastConformer-Large训练框架下进行系统评估。实验基于匈牙利语BEA-Dialogue基准数据集，结果表明：合成对话可持续提升语音识别性能，且生成器选择与数据构成显著影响增益效果；仅使用67小时真实对话与636小时模拟数据的训练配置，即超越了在2700小时真实语音上进行零样本训练的基线模型。研究证实，结合LLM与TTS生成的合成对话数据，可作为真实对话语料的有效补充，为低资源场景下的对话式语音识别提供可行路径。

链接: https://arxiv.org/abs/2606.03957
作者: Máté Gedeon,Péter Mihajlik
机构: Budapest University of Technology and Economics (布达佩斯科技经济大学); SpeechTex Ltd. (语音文本有限公司); ELTE Research Centre for Linguistics (埃塞特语言学研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.

[NLP-8] A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026

【速读】：该论文旨在解决低延迟条件下实现高质量同时翻译（simultaneous speech translation） 的核心挑战，特别是在资源受限场景下兼顾翻译精度与实时性。其解决方案的关键在于：采用基于离线直接语音到文本翻译架构的 Canary 模型，并引入当前最先进的对齐策略 AlignAtt，通过在多语言环境下进行高效建模，实现了在极低计算开销（仅10亿参数）的前提下，支持25种源语言与25种目标语言的多语种同时翻译。该方案在计算无感知模拟中表现出色，无论在低延迟还是高延迟场景下均优于同等规模基线模型，显著提升了翻译质量与系统实用性。

链接: https://arxiv.org/abs/2606.03948
作者: Aziz Sharipov Ortega,Dominik Macháček
机构: Charles University, MFF, ÚFAL; University of Edinburgh
类目: Computation and Language (cs.CL)
备注: IWSLT 2026

点击查看摘要

Abstract:We implement simultaneous translation capability with the offline direct speech-to-text translation model Canary, using the state-of-the-art policy AlignAtt, and submit it to IWSLT 2026 Simultaneous Speech Translation Shared task for Czech to English and English to German and Italian. The strengths of our system are: (1) high translation quality, outperforming similarly sized baselines both in low- and high-latency regimes in computationally unaware simulations; (2) low computational requirements, as the model has only 1B parameters; (3) multilinguality – support of 25 source and 25 target languages. Comments: IWSLT 2026 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.03948 [cs.CL] (or arXiv:2606.03948v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.03948 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-9] Value-Aware Stochastic KV Cache Eviction for Reasoning Models

【速读】：该论文旨在解决生成式推理模型在长序列推理过程中因输出过长而导致的内存与计算瓶颈问题。现有基于键值缓存（KV cache）淘汰的方法虽能降低资源开销，但其精度通常低于基于选择的稀疏注意力方法，后者虽保留完整缓存却难以实现高效压缩。本文识别出影响KV缓存淘汰准确性的两个关键因素：其一，少数值状态（value states）具有异常大的幅值，若被错误淘汰将导致模型陷入重复推理循环；其二，淘汰过程引入随机性可提升缓存多样性，从而改善推理表现。基于此，作者提出无需训练的**价值感知随机KV缓存淘汰（Value-aware Stochastic KV Cache Eviction, VaSE）**策略，通过保护大幅值值状态并增强淘汰决策的多样性，在不牺牲精度的前提下实现4倍缓存压缩。实验表明，相较于当前最优的选择式稀疏注意力方法，使用VaSE的Qwen3模型在相同稀疏度下平均准确率更高，且相比最强的淘汰方法性能提升超过4%，有效弥合了效率与精度之间的差距，并兼容FlashAttention2，支持静态内存占用，为高效推理模型部署提供了可行方案。

链接: https://arxiv.org/abs/2606.03928
作者: Ting-Yun Chang,Harvey Yiyun Fu,Deqing Fu,Chenghao Yang,Jesse Thomason,Robin Jia
机构: University of Southern California (南加州大学); University of Chicago (芝加哥大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Codes: this https URL

点击查看摘要

Abstract:Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy. First, a small fraction of value states have abnormally large magnitudes, and evicting them causes catastrophic failure where models enter repetitive reasoning loops. Second, introducing stochasticity during eviction improves accuracy by increasing cache diversity. Based on these findings, we propose Value-aware Stochastic KV Cache Eviction (VaSE), a training-free recipe that protects large-magnitude value states and promotes diverse eviction decisions. Across six reasoning tasks, Qwen3 models using VaSE with 4x KV cache compression yield higher average accuracies than SOTA selection method at the same sparsity, while outperforming the strongest eviction method by more than 4%. Overall, VaSE bridges the gap between efficiency and accuracy, supporting FlashAttention2 and enabling a static memory footprint for reasoning models.

[NLP-10] Knowledge Editing in Masked Diffusion Language Models

【速读】：该论文旨在解决在掩码扩散模型（Masked Diffusion Models, MDMs）中实现知识编辑的可行性与有效性问题，尤其关注现有基于“定位-编辑”（locate-then-edit）范式的方法是否适用于非自回归生成机制的MDMs。其核心挑战在于：尽管自回归模型（Autoregressive Models, ARMs）中的知识编辑依赖于对特定位置权重的精准修改并取得良好效果，但MDMs采用双向建模与迭代去噪生成方式，其内部表征动态与生成路径与ARMs存在本质差异，导致已有编辑策略难以直接迁移。解决方案的关键在于通过系统对比相同规模下的两种MDMs（LLaDA、Dream）与两种ARMs（LLaMA、Qwen），揭示出尽管编辑目标位置在两类模型中具有跨范式一致性——即均集中在最后主题词对应的早期至中期前馈网络（MLP）层——但编辑效果却显著分化：单标记编辑表现相近，而多标记目标编辑在MDMs中性能系统性退化。这一现象的根本原因在于，多标记生成需经过部分去掩码的中间状态，而这些状态未被原始编辑所优化。针对此诊断，作者提出一种简单有效的修正方法，即在训练过程中显式优化编辑对中间生成状态的影响，从而显著恢复多标记编辑性能，验证了对生成过程动态的适配是提升知识编辑鲁棒性的关键。

链接: https://arxiv.org/abs/2606.03924
作者: Haewon Park,Yohan Jo
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge editing aims to update or correct factual knowledge in a language model. A widely used approach, locate-then-edit, does this in two steps: it first localizes a fact within the model, then edits the weights there. To date, such methods have been developed exclusively on autoregressive models (ARMs). Whether their underlying assumptions hold for masked diffusion models (MDMs), which model text bidirectionally and generate by iterative denoising rather than next-token prediction, remains an open question. We address it by transferring locate-then-edit to MDMs and comparing two MDMs (LLaDA, Dream) with two ARMs (LLaMA, Qwen) at matched scale. Our central finding has two parts. First, where an edit is applied transfers across paradigms: causal tracing highlights the same early-to-mid-layer MLP at the last subject token in both, and editing is most effective there. Second, this shared location does not guarantee a shared outcome. Single-token edits succeed in both, but as targets grow longer, editing degrades systematically in the MDMs but not the ARMs. The failure stems from how the edited fact is generated: producing a multi-token target requires passing through partially unmasked intermediate states for which the edit was never optimized. Guided by this diagnosis, we introduce a simple correction that optimizes the edit for these states, substantially restoring multi-token performance.

[NLP-11] Synthesize and Reward – Reinforcement Learning for Multi-Step Tool Use in Live Environments

【速读】：该论文旨在解决大语言模型（LLM）在执行多步骤工具调用时面临的三大耦合挑战：构建真实的状态化执行环境成本高昂、合成训练数据常与服务器实际状态脱节导致工具调用失败，以及基于召回的强化学习（RL）奖励机制倾向于诱导冗长的工具调用模式。其解决方案的关键在于提出PROVE（Programmatic Rewards On Verified Environments）框架，通过三个核心创新实现突破：（1）构建了一个包含20个状态化MCP（Model Context Protocol）服务器的库，提供343个工具接口，支持会话级状态隔离的实时执行强化学习训练；（2）设计了一套自动化数据合成流水线，基于依赖图引导的对话仿真与实时采样服务器状态，生成经验证的多轮工具调用轨迹，确保所有生成查询所引用的实体均真实存在；（3）引入多组件可编程奖励机制，包括渐进式有效性评分、依赖感知覆盖度、复杂度自适应的调用预算惩罚、工具名信号及参数值匹配奖励，无需外部裁判模型即可完成端到端训练。实验表明，在相同奖励超参数下，仅通过学习率微调，使用GRPO算法对四个模型（Qwen3-4B、Qwen3-8B、Qwen2.5-7B、Granite-4.1-8B）进行约1.3万条样本训练后，PROVE在BFCL Multi-Turn、tau2-bench和T-Eval基准上分别取得最高+10.2、+6.8和+6.5的性能提升，证明了紧凑的可编程奖励机制在跨模型家族的多步工具编排任务中具有稳定且显著的增益效果。

链接: https://arxiv.org/abs/2606.03892
作者: Ibrahim Abdelaziz,Asim Munawar,Kinjal Basu,Maxwell Crouse,Chulaka Gunasekara,Suneet Katrekar,Pavan Kapanipathi
机构: IBM Research(国际商业机器公司研究部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server’s actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a library of 20 stateful MCP (Model Context Protocol) servers exposing 343 tools, enabling live-execution RL training with session-scoped state isolation; (2) an automated data synthesis pipeline that generates validated multi-turn tool-call trajectories against these servers via dependency-graph-guided conversation simulation grounded in live-sampled server state, so every generated query references entities that actually exist; and (3) a multi-component programmatic reward - graduated validity scoring, dependency-aware coverage, an adaptive efficiency penalty with a complexity-scaled call budget, a tool-name signal, and an argument-value matching bonus - requiring no external judge model. We train four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with GRPO using identical reward hyperparameters and ~13K training examples; only learning rate is tuned per model family from a three-point sweep. On BFCL Multi-Turn, tau2-bench, and T-Eval, PROVE yields improvements of up to +10.2, +6.8, and +6.5 points respectively, demonstrating that a compact programmatic reward yields consistent gains on multi-step tool orchestration across two model families.

[NLP-12] RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

【速读】：该论文旨在解决现有代理评估基准（agent benchmarks）与真实开发者-代理交互场景脱节的问题，即当前基准未能充分反映实际部署环境中代理任务的分布、多样性及现实难度。其核心挑战在于真实用户请求往往依赖本地执行环境、意图隐含或表述不完整，且需复杂的验证过程，难以直接用于可复现的评估。为此，论文提出RealClawBench——一个基于真实OpenClaw会话构建的实时评估框架，其关键解决方案包含两个机制：重构的执行环境（reconstructed execution environments）与确定性可验证评分器（deterministic verifiable scorers）。前者通过还原真实运行上下文，使任务可在可控条件下复现；后者则提供自动化、一致性的评分标准，确保评估结果可重复、可比较。该框架最终释放了281个从大规模真实会话中采样的可执行任务，保持了原始请求分布特征（最大最终分布与源分布间的Jensen-Shannon散度为0.0448）。对14个主流模型的评估显示，最优系统仅能完成65.8%的任务，揭示了在真实开发场景下代理能力仍存在显著提升空间。通过将真实部署会话转化为可控、可自动评分的评估实例，RealClawBench为构建更贴近实际应用的代理评估体系提供了可行路径。

链接: https://arxiv.org/abs/2606.03889
作者: Zongwei Lv,Zhewen Tan,Yaoming Li,Yilun Yao,Yuxuan Tian,Lin Sun,Xiangzheng Zhang,Weihong Lin,Tong Yang,Guangxiang Zhao
机构: Peking University (北京大学); Qiyuan Tech (启源科技)
类目: Computation and Language (cs.CL)
备注: 19 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, and real-world difficulty of deployed agent use. Real user requests are challenging to benchmark because they often depend on local execution environments, involve implicit or underspecified intent, and require nontrivial verification. RealClawBench addresses these challenges with two core mechanisms: reconstructed execution environments and deterministic verifiable scorers, which together convert real sessions into reproducible, automatically scored tasks. The resulting release contains 281 executable tasks sampled from a much larger real-session pool while preserving the source distribution, with maximum final-vs-source Jensen-Shannon divergence of 0.0448. Evaluating 14 contemporary models shows that the best system solves only 65.8% of tasks, revealing substantial headroom on realistic developer-agent workloads. By turning real deployed sessions into controlled evaluation instances, RealClawBench provides a practical path toward benchmarks that better measure agent capability in actual use. Code is available at:this https URL.

[NLP-13] Visual Instruction Tuning Aligns Modalities through Abstraction

【速读】：该论文旨在解决生成式视觉-语言模型中视觉特征如何被嵌入到大型语言模型（LLM）分层抽象结构中的机制问题。其核心发现表明，视觉指令微调（visual instruction tuning）的关键作用在于将视觉特征直接注入到LLM的中间语义层，而非早期专注于单模态处理的浅层，从而构建跨模态语义融合的核心路径。通过探针分析与因果干预实验，研究证实这些中间层是多模态理解的语义核心，并对多种多模态基准任务表现起决定性作用。进一步分析揭示，微调过程通过扩展和强化已有抽象层级，使视觉表示与预存在的文本表示在几何结构上实现对齐。最终，通过仅在中间层进行微调的策略，即可在保持视觉主导型任务性能的同时显著降低训练成本，验证了多模态整合是一种由LLM内部抽象引擎重构所驱动的局部化现象。

链接: https://arxiv.org/abs/2606.03871
作者: Luis Palacios,Lorenzo Basile,Diego Doimo,Alberto Cazzaniga
机构: Area Science Park(区域科学园区), Trieste, Italy
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual instruction tuning effectively adapts a pre-trained Large Language Model (LLM) to process image information alongside text. Yet, it remains unclear how visual features are embedded into the layer-wise hierarchy of abstractions of the LLM backbone. Across a diverse set of vision-language architectures, we show that instruction tuning primarily serves as a bridge, embedding visual features directly into the intermediate semantic layers of the LLM, bypassing the early layers devoted to unimodal processing. With probing analyses and causal interventions, we show that these intermediate layers are the semantic core of vision-language processing and play a critical role in the performance on a broad set of multimodal benchmarks. In addition, by comparing the geometry of semantically equivalent visual and textual representations, we find that fine-tuning extends and strengthens the existing abstraction phase, aligning visual features with pre-existing textual ones. Finally, we confirm the functional role of this localized alignment by restricting fine-tuning to intermediate layers alone: this strategy preserves the performance of full fine-tuning on vision-centric benchmarks while reducing training time. Our results suggest that multimodal integration is a localized phenomenon driven by the repurposing of the internal abstraction engine of the LLM.

[NLP-14] A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLM s and Knowledge Graphs

【速读】：该论文旨在解决多文档摘要（Multi-Document Summarization, MDS）中普遍存在的三大挑战：难以捕捉文档间的复杂关联、对大规模标注数据的强依赖性以及在跨领域和跨语言场景下的泛化能力有限。其核心解决方案是提出一种无需训练的“混合智能体”（mixture-of-agents）框架，通过协同利用大语言模型（Large Language Models, LLMs）与知识图谱（Knowledge Graphs）的互补优势，将摘要任务分解为三个无需特定微调的专用智能体模块：抽取式选择、基于知识的抽象生成以及迭代优化。各模块输出通过由大语言模型引导的多视角一致性机制进行统一整合，从而实现高效且灵活的信息提炼。该方法在英文与越南语共四个数据集上的实验结果表明，其性能达到或超过现有先进水平，充分验证了该模块化设计的有效性与可迁移性。

链接: https://arxiv.org/abs/2606.03867
作者: Cuong Vuong Tuan,Trang Mai Xuan,Tien-Cuong Nguyen,Vu-Duc Ngo,Thien Van Luong
机构: Phenikaa University (菲尼卡大学); VNPT Group (越南电信集团); MobiFone Corporation (摩比富公司); National Economics University (国家经济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by Neural Computing and Applications

点击查看摘要

Abstract:Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing approaches often struggle to capture complex inter-document relationships, rely heavily on large amounts of labeled data for supervised training, or exhibit limited generalization across domains and languages. To address these limitations, we present a training-free mixture-of-agents framework for MDS that leverages the complementary strengths of large language models (LLMs) and knowledge graphs. Our approach decomposes summarization into specialized agent tasks: extractive selection, knowledge-aware abstraction, and iterative refinement, each operating without task-specific fine-tuning. We unify their outputs using a multi-perspective consistency mechanism guided by LLMs. Experiments across four datasets in English and Vietnamese demonstrate state-of-the-art or competitive performance, validating the effectiveness and adaptability of our modular design.

[NLP-15] Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在生成回答时存在事实性错误却表现出高置信度的问题，尤其针对模型缺乏显式不确定性估计导致用户难以判断输出可靠性这一关键挑战。现有不确定性量化方法多依赖于间接信号（如采样生成结果的熵值），但这些信号解释性差，且未能充分挖掘模型自我评估的能力。本文提出一种简单而有效的自评估（self-assessment）不确定性量化方法：将多次采样的生成结果按语义聚类为若干不同答案选项，构建结构化的多选题形式，并利用模型对各选项分配的概率作为置信度估计。该方法通过显式利用模型自身对答案一致性的判断能力，实现了更准确、可解释的不确定性估计。实验表明，该方法在多个模型与数据集上均显著优于基线方法，且仅需额外2个样本即可达到竞争性性能，体现出其高效性与有效性。

链接: https://arxiv.org/abs/2606.03846
作者: Qi Cao,Takeshi Kojima,Andrew Gambardella,Helinyi Peng,Yutaka Matsuo,Yusuke Iwasawa
机构: The University of Tokyo, Japan
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Findings of ACL 2026

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable performance across diverse tasks, but they often generate responses that appear plausible while being factually incorrect. This problem is compounded by the lack of explicit uncertainty estimates, which makes it difficult for users to judge the reliability of model outputs. Existing uncertainty quantification methods typically rely on indirect signals, such as entropy across sampled generations. These signals can be difficult to interpret and do not fully leverage the model’s ability to assess its own uncertainty. We propose a simple yet effective self-assessment method for uncertainty quantification in LLMs. Our approach groups sampled generations into semantically distinct clusters, converts them into answer options in a structured multiple-choice question, and uses the probability assigned by the LLM to each option as a confidence estimate. Experiments across multiple models and datasets show that our method consistently outperforms baseline approaches. Notably, it achieves competitive performance with as few as two additional samples, demonstrating both its effectiveness and efficiency.

[NLP-16] SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

【速读】：该论文旨在解决自改进语言智能体（self-improving language agents）在孤立环境中评估时，难以反映其在多智能体协作与竞争情境下实际表现的问题。传统方法仅允许智能体基于自身历史进行迭代优化，而现实中智能体常需面对其他智能体公开可见的策略与成果。因此，核心问题是：当智能体共享经验时，是否能实现单体自改进无法达成的性能突破？为此，论文提出SAGE（Social Agent Group Evolution）评估框架，对比两种计算资源匹配的条件：SocialEvo（多模型家族智能体共进化，可访问所有同伴的历史记录）与SelfEvo（每个智能体仅基于自身历史进行演化，为当前主流设置）。研究在开放式机器学习研究、长期经济规划及战略多人对战三个场景中展开多轮演化评估。结果表明，群体历史并非普适增益因子——顶尖智能体仍受限于其自进化上限；但陷入瓶颈的智能体在引入同伴经验后可实现显著突破。在竞争性环境中，反事实分析显示智能体整体能力提升，而非发展针对性对手策略。进一步发现，经过筛选的同伴轨迹或反思性摘要的表现优于原始日志，说明社会性收益的关键不在于信息暴露量，而在于从公开历史中抽象并迁移可复用知识的能力。综上，该研究揭示了同伴历史带来的收益具有智能体特异性、任务场景依赖性，并高度依赖于抽象能力的强弱。

链接: https://arxiv.org/abs/2606.03544
作者: Linyue Pan,Yaoming Zhu,Lin Qiu,Xuezhi Cao,Xunliang Cai
机构: Tsinghua University (清华大学); Meituan (美团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when does shared experience produce improvements that self-improvement alone cannot achieve? We introduce SAGE (Social Agent Group Evolution),an evaluation framework that compares two compute-matched conditions: SocialEvo, where agents from five distinct model families co-evolve with access to all peers’ histories; and SelfEvo, where each agent receives the same number of task attempts but sees only its own past, which is conventional in self-improving agent studies. We instantiate SAGE in three arenas: open-ended ML research, long-horizon economic planning, and strategic multiplayer play, evaluated across multiple evolutionary rounds. We find that group history is not a universal amplifier: the strongest agent does not exceed its self-evolution ceiling. However, agents that plateau under self-improvement can achieve significant breakthroughs when peer experience is available. In competitive settings, counterfactual controls reveal that agents improve generally rather than developing opponent-specific strategies. Across different forms of shared history, filtered peer traces and reflective summaries often outperform raw logs, indicating that social gains depend on abstraction rather than exposure volume. These findings reveal that peer-history gains are agent-specific, arena-dependent, and contingent on the capacity to abstract transferable knowledge from public traces.

[NLP-17] BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language ALT

【速读】：该论文旨在解决巴尔蒂语（Balti，ISO 639-3: bft）缺乏公开可用的自动语音识别（ASR）资源的问题，该语言属于藏缅语族，主要在巴基斯坦吉尔吉特-巴尔蒂斯坦地区使用，此前无可供训练或评估模型的高质量语音语料库。其核心解决方案是构建一个包含16.8小时读音语音的语料库——BaltiVoice，该语料库由10,060条经验证的发音片段组成，采用本地纳斯塔利克（Nastaliq）书写系统记录，并基于Mozilla Common Voice数据集进行筛选与标注。研究通过在该语料库上对OpenAI Whisper-small模型进行微调，将巴尔蒂语的词错误率（Word Error Rate, WER）从原始零样本（zero-shot）条件下的182.18%显著降低至30.07%，证明了专用语料库对低资源语言语音识别性能提升的关键作用。该数据集、微调模型及在线转录演示均已开源，发布于HuggingFace平台，为后续巴尔蒂语自然语言处理研究提供了重要基础。

链接: https://arxiv.org/abs/2606.03504
作者: Muhammad Ali
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures, 4 tables. Code and data available at this https URL

点击查看摘要

Abstract:We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. We fine-tune OpenAI Whisper-small on this corpus and report a Word Error Rate (WER) of 30.07% on a held-out validation set of 538 utterances, down from a measured zero-shot baseline of 182.18% for Whisper-small on Balti. The dataset, fine-tuned model, and a live transcription demo are publicly available on HuggingFace.

[NLP-18] AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

【速读】：该论文旨在解决临床人工智能（AI）评估中，大型语言模型（LLM）作为AI评分员在不同评估条件下评分行为缺乏量化表征的问题。其核心挑战在于：当前临床AI评价日益依赖LLM进行评分，但这些模型在面对复杂临床决策任务时，其评分表现如何受评分协议、模型类型、提示设计等因素影响尚不明确。解决方案的关键在于通过一个因子设计实验，系统评估四种开源LLM在成人2型糖尿病（T2D）药物治疗12个月随访这一复杂临床任务中的评分行为，该任务被操作化为七个评估问题。研究对比了两种评分协议——基于患者特异性评分量表的“黄金评分协议”（Gold Rubric, GR）与无量表约束的“非黄金评分协议”（Non Gold Rubric, Non-GR），并交叉分析了五类设计因素（包括CDSS模型、提示配置[文档引用生成 vs. 基线]、评分模型、提示字符特征和提示类型）及其与评分协议的交互效应。结果表明，非黄金评分协议下，所有问题的平均得分集中在狭窄区间（74–78分），显著高于黄金评分协议下的得分（平均低7.69至49.64分），且四分位距更宽（1.68至3.67倍），说明其评分一致性差、判别力弱；而黄金评分协议则显著增强了评分者对不同CDSS输出（特别是文档引用生成与基线提示）的区分能力（放大系数1.76至5.10），并揭示出评分模型间存在的实质性行为差异，这些差异在非黄金协议下被掩盖。因此，研究支持以评分量表锚定为核心的评分协议是维持临床AI评估判别力的关键，当评估需依赖患者或地域特异性标准时，仅靠参数化知识无法推断的场景下，非量表评分不可替代。

链接: https://arxiv.org/abs/2606.03198
作者: Sangwon Baek,Kyu Yeon Hur,Kyunga Kim
机构: Asclep Korea Inc. (아스클럽 코리아 인크.); Samsung Medical Center (삼성의료원); Sungkyunkwan University (성균관대학교)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 main figures, 8 supplementary figures, 9 supplementary tables

点击查看摘要

Abstract:Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes (T2D) pharmacotherapy at 12-month outpatient follow-up, a clinical task involving complex decision-making operationalized across seven evaluation questions. Four open-source LLMs served simultaneously as clinical decision support system (CDSS) models and AI raters. Each CDSS output was scored under two scoring protocols: a rubric-anchored Gold Rubric (GR) protocol incorporating a patient-specific rubric, and a rubric-free Non Gold Rubric (Non-GR) protocol. Linear mixed effects models crossed the scoring protocol factor with five design factors – CDSS model, CDSS prompt configuration (document-referenced generation [DRG] vs.\ Baseline), rater model, prompt character, and prompt type – and estimated main effects together with their protocol interactions. Across all questions, AI raters yielded consistently higher scores within a very narrow range (74–78 points on average) under Non-GR compared to those under GR (7.69 to 49.64 points lower mean scores; 1.68 to 3.67 times wider interquartile ranges). Within each question, GR amplified the AI rater’s discrimination between DRG and Baseline CDSS outputs by factors of 1.76 to 5.10, while also revealing substantial behavioral variation across rater models that Non-GR suppressed. These findings support rubric anchoring as the scoring protocol that preserves discriminative power in clinical AI evaluation; rubric-free scoring cannot substitute when questions require patient-specific or jurisdiction-specific criteria that rater models cannot infer from parametric knowledge alone.

[NLP-19] MemTrain: Self-Supervised Context Memory Training

【速读】：该论文旨在解决长时程大语言模型（LLM）智能体在复杂任务中因缺乏有效记忆能力而导致的上下文信息丢失与推理性能下降问题。现有方法依赖于下游任务的强化学习进行端到端训练，但高质量标注数据稀缺且多样性不足，难以覆盖通用的记忆行为模式。为此，论文提出MemTrain——一种基于自监督学习的训练框架，通过在无标注维基百科语料上设计两个耦合的代理任务来增强模型的上下文记忆能力：其一是端到端的掩码重建目标，要求模型在经历多轮记忆更新后恢复被掩码的实体，从而从最终结果角度促进记忆保持；其二是中间阶段的记忆回溯目标，要求模型利用中间记忆状态重构历史信息，以确保记忆压缩的准确性与过程完整性。这两个目标通过广义相对策略优化（GRPO）联合优化。大量实验表明，MemTrain在长文本问答与基于搜索的问答基准上均显著提升下游任务中的记忆密集型推理表现，相较于直接的任务特定微调，性能提升最高达17.67点，验证了其在通用记忆能力增强方面的有效性。

链接: https://arxiv.org/abs/2606.03197
作者: Ziheng Li,Xingrun Xing,Haoqing Wang,Zhi-Hong Deng,Yehui Tang
机构: Peking University (北京大学); Samsung Research (三星研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize information accumulated across extended interactions. Existing memory-agent approaches are typically trained end-to-end with reinforcement learning on downstream tasks. However, collecting high-quality annotated problems for memory-intensive scenarios is costly, and the resulting training data often lack sufficient diversity to cover general memory behaviors. In this work, we propose MemTrain, a self-supervised training framework for generally enhancing the context-memory capability of LLM agents for more effective downstream post-training. MemTrain introduces two coupled proxy tasks over unlabeled Wikipedia corpora: (1) an end-to-end masked reconstruction objective, which requires the model to recover masked entities after multiple rounds of memory updates, thereby encouraging memory maintenance from the final outcome perspective; and (2) an intermediate memory recall objective, which requires the model to reconstruct masked historical information using intermediate memory states, encouraging faithful compression and memory completeness throughout the interaction process. The two objectives are jointly optimized using GRPO. Extensive experiments on long-text QA and search-based QA benchmarks demonstrate that MemTrain consistently improves downstream memory-intensive reasoning performance across different models, achieving gains of up to 17.67 points over direct task-specific post-training.

[NLP-20] SenseJudge: Human-Centric Preference-Driven Judgment Framework ACL2026

【速读】：该论文旨在解决当前大型语言模型（LLM）作为评判者在实际应用中普遍存在的两大问题：一是现有评判方法依赖于固定偏好数据训练的评判模型，难以反映用户间多样化的主观偏好；二是现有方法在真实人机对话场景下的适应性不足，无法有效捕捉复杂多轮交互中的语义细微差异。针对上述挑战，论文提出了一种基于人类偏好的可定制化评判框架 SenseJudge，以及一个源自真实世界多轮交互的多样化、高难度指令遵循基准 SenseBench。其解决方案的关键在于：通过引入真实用户行为数据构建动态、个性化的评判标准，并利用 SenseBench 实现对 LLM 在个性化评判任务和模型排序任务中的全面评估。实验结果表明，SenseJudge 在个性化评判任务中显著优于现有方法，并实现了与真实人类判断高度一致的模型排序效果；同时，通过对位置偏差和一致性等关键因素的分析及消融实验，验证了该框架在鲁棒性与实用性方面的优越性。

链接: https://arxiv.org/abs/2606.03189
作者: Rui Li,Junfeng Liu,Xiangwen Kong,Linhai Xu,Zhifang Sui
机构: Peking University (北京大学); StepFun (步履科技); Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) as judges across various scenarios such as assessing model responses is becoming an increasingly accepted paradigm. However, existing judgment approaches often rely on trained judgers using fixed preference data, which tend to overlook diverse user preferences and struggle to adapt to real-world human-AI dialogue scenarios. To address these limitations, we propose SenseJudge, a customizable judgment framework driven by human preferences and SenseBench, a diverse and challenging instruction-following benchmark derived from real-world multi-turn interactions. We applied the automatic judgment framework and benchmark to two tasks: (1) LLMs as personalized judges, and (2) model ranking. We conducted extensive experiments, and the results demonstrate that the SenseJudge framework surpasses other judgment methods and models in the LLMs-as-personalized-judges task and achieves model ranking that aligns with real human sense. Additionally, we conducted analyses on position bias and consistency, alongside ablation studies, which affirmed the robustness of SenseJudge.

[NLP-21] GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

【速读】：该论文旨在解决放射影像领域视觉-语言模型（Vision-Language Models, VLMs）中存在的跨模态对齐尺度不匹配问题：尽管临床中生成的图像-报告配对数据具有全局监督信号，但医学影像中的病理发现通常仅占据图像局部区域，而现有方法在对齐过程中将注意力均匀分布于所有图像块（patch），未能聚焦于与文本查询相关的稀疏区域。其解决方案的关键在于提出一种名为GLINT（Gated Language-Image alignment）的新框架，核心创新包括：1）稀疏门控对齐（Sparsely Gated Alignment）——通过引入独立门控嵌入空间的sigmoid门控机制，仅激活与当前文本查询相关的目标图像块，实现显式的稀疏性建模；2）密集特征正则化（Dense Feature Regularization）——将可训练编码器的中间特征锚定至冻结的自监督学习（SSL）教师模型，以保留细粒度图像块表征能力，从而支撑门控机制的有效运作。该方法在2D胸片（CXR）和3D胸部CT两种模态上均取得显著性能提升，首次实现了无需掩码监督的3D CT零样本分割，并在零样本定位与分割任务中表现尤为突出，验证了其设计目标的合理性。

链接: https://arxiv.org/abs/2606.03180
作者: Jonggwon Park,Seongeun Lee,Junhyun Park,Hannah Yun,Hyunwoong Kim,Sohyun Jeong,Hyewon Kang,Byungmu Yoon,Kyoyun Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) for radiology have emerged as a scalable paradigm by leveraging image-report pairs naturally produced in clinical workflows. However, this pairing reveals a mismatch in scale: each finding occupies only a small region of the image, yet supervision is provided only at the global image-report level. This poses a central challenge: prior approaches spread weight densely across all patches rather than concentrating on the sparse subset relevant to a given query. To address this, we present GLINT (Gated Language-Image alignmeNT), a framework that explicitly models this sparse correspondence. On the alignment side, we introduce Sparsely Gated Alignment, a novel architecture in which a sigmoid gate over a separate gate embedding space activates only the patches relevant to each textual query, enforcing explicit sparsity. On the representation side, we add Dense Feature Regularization, which anchors the trainable encoder’s intermediate features to a frozen self-supervised learning (SSL) teacher, preserving the fine-grained patch features that the gate relies on. The same recipe applies to both 2D chest X-ray (CXR) and 3D chest computed tomography (CT), built with DINOv3 and V-JEPA 2.1, respectively. GLINT enables zero-shot classification, grounding, and segmentation from free-text queries, and to our knowledge is the first to demonstrate zero-shot segmentation on 3D CT volumes without mask supervision. Notably, the most pronounced gains arise on zero-shot grounding and segmentation, where sparse, query-specific localization is required, consistent with our design intent. In downstream evaluation, GLINT outperforms both SSL encoders and medical VLMs on classification, report generation, and segmentation.

[NLP-22] HyperPatch: Sequential Knowledge Editing Under n-ary Structural Drift KDD KDD2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在动态环境中进行知识编辑（Knowledge Editing, KE）时，因复杂多元关系（n-ary relations）被强制二元化为三元组所引发的高阶结构漂移（N-ary Structural Drift）问题。这一现象破坏了事件的内在关联完整性，导致检索器在推理过程中出现系统性误锚定（Structure-Conditioned Knowledge Transfer Failure），常被误判为参数幻觉。其解决方案的关键在于提出一种参数保持型框架 HyperPatch，将连续知识编辑建模为超图流形（hypergraph manifolds）上的稳定性问题。HyperPatch 通过三个核心阶段实现结构感知的知识更新：（i）基于对比学习的超图神经网络（HGNN）进行结构先验初始化，捕获高阶关联；（ii）采用双阶段拓扑编辑机制，结合 SimHash 的拓扑对齐实现快速冲突消解，并通过拓扑 LoRA 适配追踪结构漂移而无需重训练主干模型；（iii）融合语言与结构流形的全局一致性证据，实现结构条件推理。实验表明，HyperPatch 在 MQuAKE-CF 与 MQuAKE-T 基准上分别相较最强基线提升 96.24% 和 21.06% 的逐跳准确率（H-Acc），且在持续的 n-ary 更新流下表现出显著更高的可靠性，而传统基于知识图谱（KG-based）的方法则因结构错位导致最高达 88.3% 的准确率崩溃。

链接: https://arxiv.org/abs/2606.03179
作者: Yu-Kai Chan,Wen-Sheng Lien,Dong-Ting Yao,Bo-Kai Ruan,Kwan-Yeung Lin,Hong-Han Shuai,Meng-Fen Chiang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computation and Language (cs.CL)
备注: Accepted to Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

点击查看摘要

Abstract:Large Language Models (LLMs) rely on Knowledge Editing (KE) to maintain temporal validity, yet real-world knowledge is inherently n-ary. We demonstrate that in non-stationary environments, sequential updates to complex relations induce N-ary Structural Drift, a phenomenon where the binary reification of n-ary events into triples fractures relational atomicity. This precipitates Structure-Conditioned Knowledge Transfer Failure, a systematic mis-grounding of the retriever frequently misdiagnosed as parametric hallucination. To tackle this, we propose HyperPatch, a parameter-preserving framework that reformulates sequential KE as a stability problem over hypergraph manifolds. HyperPatch preserves event integrity through three phases: (i) Structural Prior Initialization, establishing a topology-aware embedding space via contrastive learning on a Hypergraph Neural Network (HGNN) to capture high-order correlations; (ii) Sequential Topology Editing, utilizing a dual-stage mechanism that employs SimHash-based Topological Alignment for rapid conflict resolution and Topological LoRA Adaptation to track drift without backbone retraining; and (iii) Structure-Conditioned Reasoning, which integrates globally consistent evidence from fused linguistic and structural manifolds. On the MQuAKE-CF and MQuAKE-T benchmarks, HyperPatch achieves relative gains in Hop-wise Accuracy (H-Acc) of 96.24% and 21.06% over the strongest baseline, respectively. Further ablations demonstrate superior reliability under continuous n-ary update streams, whereas the standard KG-based variant suffers H-Acc collapses of up to 88.3% due to structural misalignment.

[NLP-23] Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

【速读】：该论文旨在解决生成式 AI（Generative AI）在对话系统中语言表达与人类预期之间存在的偏差（misalignment）问题，尤其关注科学英语领域中的词汇使用过度现象及其成因。现有研究虽已揭示部分偏差表现及与人类偏好学习（human preference learning）阶段的关联，但依赖人工标注，存在成本高、可扩展性差的问题。本文提出两种无需人工标注、假设极少的评估指标：词汇对齐度量（Lexical Alignment Score）用于识别词汇过度使用，三角化偏好偏移（Triangulated Preference Shift）则量化偏好学习对词汇偏移的贡献程度。基于PubMed摘要数据集，通过六种模型家族（Falcon、Gemma、Llama、Mistral、OLMo、Yi）生成文本，并利用窗口化文档频率分析，实现了自动识别“suggest”“additionally”“strategy”等高频误用词汇，并评估其与偏好学习的相关性。实验结果验证了已有发现，且在不同参数设置、随机种子及外部数据上保持稳定。该方法具备良好的可扩展性，可系统化地应用于非科学英语及多语言场景下的词汇对齐性研究，为未来模型对齐优化及偏差根源理解提供了可复现、自动化的新范式。

链接: https://arxiv.org/abs/2606.03165
作者: Thomas Stephan Juzek,Xiaoyang Ming,Jose A. Hernandez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures, 10 tables

点击查看摘要

Abstract:The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking them to the training stage of human preference learning. Yet, existing approaches rely on manual curation. This paper introduces two curation-free, assumption-light evaluation metrics: the Lexical Alignment Score, which identifies lexical overuse, and the Triangulated Preference Shift, which quantifies how much of such shifts can be attributed to human preference learning. Using PubMed abstracts, continuations were generated and measured using windowed document prevalence across six model families (Falcon, Gemma, Llama, Mistral, OLMo, Yi). The procedure identifies, without manual intervention, overused items such as ‘suggest’, ‘additionally’, and ‘strategy’, and estimates their link to preference learning. Our findings replicate prior work and remain stable across parameter settings, random seeds, and evaluation on further data. The approach scales readily and enables systematic study of lexical (mis)alignment beyond Scientific English and across languages, and as such, the metrics have the potential to contribute to improved alignment for future models and understanding of its origins.

[NLP-24] A cross-domain tropical species dataset with Chinese vernacular names and CITES source links

【速读】：该论文旨在解决跨域生物多样性数据在商业与监管生命周期中因分类体系分散、命名信息不一致及中文俗名覆盖不足而导致的整合与可用性难题。其核心挑战在于如何在不同界门类纲目科属种层级结构的生物多样性基础设施之间，实现热带物种数据的统一表征与语义对齐。解决方案的关键在于构建一个版本化的跨域数据集，通过引入三个原创数据层：一是基于贸易与饲养情境重构分类单元的跨域本体（ontology），实现了从传统分类学向应用语境的语义转换；二是包含明确来源标注的中文俗名层，采用四级可信度类型体系排除未经验证的机器生成名称，确保命名可靠性；三是与CITES Species+数据库的源链接层，实现每个分类单元的权威溯源。该数据集以99.50%的中文俗名覆盖率（408,456/410,499）显著提升了中文使用者对热带物种数据的可及性与实用性，同时通过稳定标识符保障原始贡献层的可追溯性，并支持CC-BY 4.0协议下的再利用。

链接: https://arxiv.org/abs/2606.03156
作者: Jeff Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 4 figures, 4 tables. Dataset descriptor for the Tropical Species Encyclopedia. Companion to the methodology paper arXiv:2606.00994 . Dataset deposited at Zenodo (doi: https://doi.org/10.5281/zenodo.20377811%29%3B canonical preprint-of-record at Zenodo (doi: https://doi.org/10.5281/zenodo.20424981 )

点击查看摘要

Abstract:We describe a versioned cross-domain dataset of 410,499 active tropical species (working snapshot 2026-04-20) spanning three applied subdomains – tropical_plants, tropical_aquatic, and tropical_pets – that share a commercial and regulatory life cycle but are distributed across kingdom-organised biodiversity infrastructures. The resource joins taxonomic identifiers from GBIF, Plants of the World Online, iNaturalist, NCBI Taxonomy, the Catalogue of Life and the Encyclopedia of Life, and adds three original layers: a cross-domain ontology that re-segments taxa along trade and husbandry contexts; a Chinese vernacular layer with explicit per-name provenance under a typology that excludes unverified machine-generated proposals; and a CITES source-linkage layer connecting each taxon to its Species+ entry. Chinese vernacular coverage – the proportion of taxa carrying a CJK Chinese name distinct from the scientific binomial – reaches 99.50 percent (408,456 of 410,499; full-population count). Coverage characterises completeness, not name-translation accuracy; the latter is bounded by the four-level provenance typology and is the subject of a preliminary internal review reported here, with a blind external audit identified as the principal open item. Upstream content is referenced by stable identifier only for the original-contribution layers, supporting CC-BY 4.0 reuse. The dataset is deposited on Zenodo (https://doi.org/10.5281/zenodo.20377811). This preprint is the canonical v1.0 description of the dataset’s current state; future Data Descriptor submission is anticipated but is contingent on the validation and release-engineering items listed in the Limitations.

[NLP-25] FederatedSkill: Federated Learning for Agent ic Skill Evolution

【速读】：该论文旨在解决大语言模型（LLM）智能体在复杂任务处理中因单一用户任务流数据多样性不足而导致技能库演化受限的问题，同时克服现有跨用户协作方法在隐私泄露与客户端异构性适配方面的局限。其核心解决方案是提出一种名为FederatedSkill的隐私保护协同智能体演化框架，关键创新在于摒弃传统的原始轨迹共享机制，转而采用语义技能差异（semantic skill diffs）作为通信的基本单元——即对本地技能库进行结构化增量更新的补丁（structured patches）。服务器端通过聚合这些补丁，构建客户端特定的能力边界动态模型，从而实现严格个性化的技能演化，而非强制统一的全局平均策略。实验表明，在20个不同任务族上的评估中，FederatedSkill相较于自演化基线显著提升了44.4%的成功率，并降低了37.5%的计算开销。

链接: https://arxiv.org/abs/2606.03143
作者: Jingbo Yang,Guanyu Yao,Yang Zhang,Ramana Rao Kompella,Gaowen Liu,Shiyu Chang
机构: UC Santa Barbara; MIT-IBM Watson AI Lab; Cisco Research
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern LLM agents increasingly rely on skill libraries to handle complex tasks, making skill evolution a primary driver of self-improvement. However, isolated single-user task streams lack the diversity required to build comprehensive skills. While cross-user collaboration can overcome this data bottleneck, current trajectory-sharing approaches compromise user privacy and impose a uniform global library that fails to accommodate client heterogeneity. We introduce FederatedSkill, a privacy-preserving framework for collaborative agent evolution. Moving beyond raw trajectory sharing, FederatedSkill utilizes semantic skill diffs, structured patches over local libraries, as the fundamental unit of communication. On the server side, an evolution agent aggregates these patches to dynamically model client-specific capability boundaries, facilitating strictly personalized skill evolution rather than a suboptimal global average. Evaluated across 20 distinct agent task families, FederatedSkill demonstrates substantial gains over self-evolving baselines, achieving up to a 44.4% increase in success rate and a 37.5% reduction in computational cost.

[NLP-26] PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多轮越狱攻击（multi-turn jailbreak attacks）中现有防护机制的局限性问题。当前的防御措施通常基于单轮对话内容进行检测，而越狱攻击本质上是贯穿整个对话流的动态过程，导致传统方法难以有效识别潜在威胁。为此，论文提出从内容审查转向对对话动态特征的建模，将对话视为嵌入空间中的轨迹路径，并探究对抗性意图是否在早期几何结构中即已编码。其核心解决方案是提出PsychoPass框架，通过提取对话轨迹在嵌入空间中的几何特征，实现对潜在攻击的早期预测。实验表明，这些几何特征在简单分类器中可达到近乎完美的检测性能，但主要受对话轮数这一混杂因素影响；剔除该因素后，仍存在显著且稳定的几何信号，其分类性能不受编码器选择的影响。更重要的是，该信号在对话初期即显现，仅依赖短前缀即可实现高于随机水平的检测能力，优于传统基线防护机制。理论分析进一步揭示了长度与形状的解耦效应、基于前缀长度的检测上限以及编码器不变性等原理，证实对抗性对话会在表示空间中留下早期、鲁棒且可监测的几何指纹，为在线实时监控提供了可行路径。

链接: https://arxiv.org/abs/2606.03136
作者: Muberra Ozmen,Subhabrata Majumdar
机构: Coveo(柯沃); Indian Institute of Management Bangalore(印度管理学院班加罗尔分校)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-turn jailbreak attacks on large language models (LLMs) reveal a mismatch in current guardrails: they operate on individual turns, while attacks unfold as trajectories across conversations. We propose a shift from content to dynamics, modeling conversations as paths in representation space and asking whether adversarial intent is encoded early in their geometry. We introduce PsychoPass, a framework that extracts geometric features from conversation trajectories in embedding space to predict a potential attack before harmful content is produced. These features achieve near-perfect performance in naïve classifiers, which is largely explained by the inclusion of number of turns as a feature. After removing this confound, a smaller but consistent geometric signal remains, with classification performance that does not depend meaningfully on encoder choice. Crucially, this signal appears early in the conversation: attack outcomes remain above chance from short prefixes alone, more reliably than baseline guardrails. A supporting theoretical analysis explains these findings via a decomposition of length and shape, a detection bound based on prefix length, and encoder invariance. Together, these results show that adversarial conversations leave an early, representation-robust geometric fingerprint suitable for online monitoring.

[NLP-27] DMT-CBT: Longitudinal Therapeutic State Modeling for CBT Counseling

【速读】：该论文旨在解决现有大语言模型（Large Language Models, LLMs）在认知行为疗法（Cognitive Behavioral Therapy, CBT）辅导中将治疗过程简化为局部响应生成问题的局限性。传统方法仅关注单轮、纯文本、短时对话中的共情回应，未能体现真实心理治疗的长期性、动态演化性和多模态特征。针对这一问题，本文提出一种名为DMT-CBT（Dynamic Modeling of Therapeutic states in CBT）的框架，其核心在于构建跨会话持续演化的结构化治疗状态表示，通过融合多模态行为基底（如图像引导的客户行为）与工具增强型干预机制，实现对治疗状态的动态建模与自适应推理。该框架的关键创新在于引入纵向治疗状态演化机制，在部分可观测条件下支持延迟的跨会话干预效应建模，从而更真实地模拟临床CBT的实践过程。基于此框架，研究进一步构建了DMTCorpus——一个具有动态治疗状态演化、图像-行为对齐及跨会话干预连续性的合成多会话多模态CBT数据集。实验结果表明，相较于后处理提取方法，DMT-CBT显著提升了辅导保真度、治疗联盟质量，实现了更优的长期情绪轨迹，并更准确地维持了治疗状态的一致性。

链接: https://arxiv.org/abs/2606.03132
作者: Chang Liu,Shuyi Zhang,Changsheng Ma,Yongfeng Tao,Minqiang Yang,Bin Hu
机构: Lanzhou University (兰州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown growing potential for Cognitive Behavioral Therapy (CBT) counseling. However, most existing approaches still formulate counseling as a local response generation problem, focusing on empathetic replies within short, text-only, or single-session interactions. We argue that this formulation fundamentally mismatches the nature of real psychotherapy. In clinical CBT, therapy is a longitudinal process in which therapists continuously infer, update, and intervene on evolving therapeutic states across sessions. Realistic CBT further involves multimodal inference and delayed cross-session intervention effects, requiring models to capture longitudinal therapeutic state evolution under partial observability. We propose DMT-CBT, a framework for Dynamic Modeling of evolving Therapeutic states in CBT counseling. DMT-CBT maintains structured therapeutic states across sessions while incorporating multimodal behavioral grounding and tool-augmented intervention to support adaptive therapeutic reasoning. Based on this framework, we construct DMTCorpus, a synthetic multi-session multimodal CBT counseling dataset featuring evolving therapeutic states, image-grounded client behaviors, and cross-session intervention continuity. Experimental results show that DMT-CBT improves counseling fidelity and therapeutic alliance, produces more favorable longitudinal affective trajectories, and preserves therapeutic states more faithfully than post-hoc extraction approaches.

[NLP-28] Decoupled Smart Contract Audits: Lightweight LLM Framework via Distillation and Aggregation

【速读】：该论文旨在解决智能合约在去中心化网络服务中面临的严重安全挑战，特别是现有基于大语言模型（LLM）的自动化漏洞检测方法在漏洞严重性评估与可操作修复建议方面存在不足，且普遍伴随过高的计算开销。其解决方案的关键在于提出一种高效、端到端的智能合约安全审计框架，采用轻量级开源LLM（0.6B–4B参数），通过将审计任务解耦为漏洞检测、解释生成、严重性分类和修复建议四个相互关联的模块，并引入秩稳定低秩适配器（rsLoRA）、知识蒸馏及自定义的链式验证（Chain-of-Verification, CoVe）聚合策略，实现对模型多轮生成结果的系统性筛选与整合，从而在保持高精度的同时显著降低计算资源消耗。实验表明，该框架在漏洞检测上达到98.25%的准确率，解释生成任务的对齐得分高达0.4375，且优于参数规模达7B至34B的主流开源代码型密集LLM，同时通过消融实验验证了分步解耦架构的优势，并揭示了新的严重性中心性偏差现象，为未来基于生成式AI（Generative AI）的审计研究提供了关键基准。

链接: https://arxiv.org/abs/2606.03128
作者: Bagus Rakadyanto Oktavianto Putra,Muhamad Risqi Utama Saputra,Widyawan,Guntur Dharma Putra
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 12 pages, 4 figures, 5 tables. Accepted to IEEE ICWS 2026

点击查看摘要

Abstract:Smart contracts face critical security challenges that require thorough auditing in decentralized web services. While Large Language Models (LLMs) have shown promise in automated vulnerability detection, existing approaches lack severity evaluations with actionable remediation and demand unnecessarily massive computational overhead. In this study, we introduce an efficient end-to-end smart contract security audit framework utilizing lightweight, highly optimized open-source LLMs (0.6B-4B parameters). Our framework decouples comprehensive audit tasks into four interconnected components: vulnerability detection, explanation, severity classification, and remediation recommendation. To maintain high accuracy without massive parameters, we implement Rank-Stabilized Low-Rank Adapters (rsLoRA), knowledge distillation, and a custom Chain-of-Verification (CoVe) aggregation strategy to systematically screen and consolidate multiple draft responses from the model into a highly accurate audit report. Experimental results demonstrate that our lightweight pipeline consistently outperforms state-of-the-art open-source coder dense LLMs (7B to 34B parameters), achieving 98.25% accuracy in vulnerability detection and an alignment score of 0.4375 in generative explanation tasks. Furthermore, our extensive ablation studies empirically validate the superiority of our decoupled audit processes over unified prompting and uncover a novel severity centrality bias, establishing a critical benchmark for future research in LLM-assisted auditing.

信息检索

[IR-0] aiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM -Enhanced Recommendation

链接: https://arxiv.org/abs/2606.03866
作者: Yuecheng Li,Zeyu Song,Jing Yao,Chi Lu,Peng Jiang,Kun Gai
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry. However, aligning the LLM’s semantic space with the recommender’s ID space via post-training (e.g., SFT and RL) remains challenging. Existing LLM4Rec paradigms are bottlenecked by two main issues: (1) the difficulty of measuring and improving chain-of-thought (CoT) quality in open-domain recommendation during SFT, and (2) the neglect of the trade-off between LLM semantic rewards and recommendation preference rewards during RL alignment. Inspired by these challenges, we present Taiji, a novel LLM-as-Enhancer framework designed for industrial recommender systems. To overcome the SFT bottleneck, we utilize reverse-engineered reasoning and open-ended rejection sampling to generate high-quality, domain-specific CoT data. To resolve the RL alignment issue, we propose Pareto Optimal Policy Optimization (POPO), which adaptively adjusts cross-domain reward weights. Theoretically, it achieves an optimal trade-off between the semantic world knowledge of LLMs and the collaborative ID features representing online user preferences. Extensive offline evaluations and online A/B tests validate the effectiveness of Taiji. Deployed on Kuaishou’s advertising platform since May 2026, Taiji currently serves over 400 million users daily, yielding significant commercial revenue and demonstrating its robust scalability in web-scale environments.

[IR-1] Can LLM Rerankers Predict Their Own Ranking Performance?

链接: https://arxiv.org/abs/2606.03535
作者: Shiyu Ni,Keping Bi,Jiafeng Guo,Jingtong Wu,Zengxin Han,Xueqi Cheng
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval effectiveness varies substantially across queries, making it important to estimate ranking quality before relevance judgments are available. Query performance prediction (QPP) addresses this need, but most existing methods rely on external predictors after retrieval or reranking. In this paper, we study \textitreranker-internal QPP: can an LLM reranker estimate the quality of the ranking it has just produced? We investigate both training-free and training-based approaches. For training-free estimation, we examine metric-specific self-consistency across sampled rankings and verbalized confidence produced directly by the reranker. Experiments on TREC Deep Learning 2019–2022 with four LLMs show that self-consistency is competitive with the state-of-the-art (SOTA) approach and better calibrated in almost all settings, while direct verbalized confidence is severely overconfident. To improve verbalized confidence, we propose two supervised methods, Verb-Num and Verb-List, which enable LLM rerankers to produce calibrated ranking-quality estimates with only a few additional output tokens.

[IR-2] Section-Weighted Hybrid Approach for Legal Case Retrieval

链接: https://arxiv.org/abs/2606.03138
作者: Rajith Arulanandam,Nisansa de Silva
类目: Information Retrieval (cs.IR)
备注: 10 pages, 4 figures. Accepted to the International Conference on Natural Language Processing (ICNLP 2026)

点击查看摘要

Abstract:Finding truly analogous precedents requires capturing legal reasoning beyond surface word overlap. We present a two-stage, section-aware framework for legal case retrieval that first segments raw judgments into facts, issues, decision, and reasoning using a deterministic large language model (LLM) offline. In Stage 1, we combine parallel lexical (BM25) and semantic (dense ANN) whole-document searches via Reciprocal Rank Fusion (RRF) to form a high-recall candidate pool. In Stage 2, we perform fine-grained, like-for-like comparisons (e.g., query reasoning vs. candidate reasoning). To address the scale mismatch between unbounded lexical scores and cosine similarities, we apply query-wise Z-score normalization before aggregating signals with learned section weights. For the top results, the system returns the relevant section text with a concise, grounded rationale and party-stance labels. We evaluate on a jurisdiction-scale benchmark, demonstrating consistent gains over strong lexical and neural baselines while maintaining high candidate coverage

人机交互

[HC-0] DiffUNet2: Bidirectional Prediction Probabilistic Generation and Collaborative Visual Discovery for Scientific Data

链接: https://arxiv.org/abs/2606.03926
作者: Mengdi Chu,Jiaxin Yang,Angus G. Forbes,Nathan Debardeleben,Earl Lawrence,Ayan Biswas,Han-Wei Shen
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 12 pages, 20 figures

点击查看摘要

Abstract:Modeling temporal evolution is important to analyzing and reasoning about scientific phenomena, yet most machine learning methods provide deterministic forward predictions that overlook multiple plausible outcomes and rarely support backward reasoning, limiting their usefulness in practical scientific workflows. We present a framework that integrates diffusion-based generative modeling with interactive visual analytics for scientific exploration. We introduce DiffUNet^2, a conditional diffusion model that enables bidirectional, any-to-any generation across time and captures distributions of plausible system evolutions. Built upon the model, our interactive system supports branching timeline exploration, user-guided state editing, and probability-space navigation, enabling scientists to actively explore alternative hypotheses rather than passively observe predictions. We evaluate the model on 5 datasets across different scientific domains to validate its predictive accuracy and probability-space ensemble quality. In collaboration with domain experts, we demonstrate the effectiveness of our approach in supporting practical scientific temporal data analysis workflows. By integrating modeling and visual interaction, our approach enables scientists to interactively explore system dynamics, transforming generative models into tools for hypothesis-driven scientific analysis.

[HC-1] he Impact of Configuring Agent ic AI Coding Tools on Build-vs-Buy Decisions: A Study Protocol

链接: https://arxiv.org/abs/2606.03907
作者: Jai Lal Lulla,Matthias Galster,Jie M. Zhang,Sebastian Baltes,Christoph Treude
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 14 pages, 1 table. Accepted at the 20th International Symposium on Empirical Software Engineering and Measurement (ESEM 2026), Registered Reports track

点击查看摘要

Abstract:Agentic AI coding tools write code with increasing autonomy and in doing so decide when to import a library and when to implement functionality from scratch. These decisions, whether to build functionality from scratch or buy into an external library, hereafter build-versus-buy, carry direct consequences for software security, licensing compliance, performance, and long-term maintainability. Yet no controlled experimental study has examined what governs build-versus-buy decisions in agentic AI coding tools. Configuration mechanisms, i.e., the means by which developers tailor agentic AI coding tool behavior to a project or workflow, are one of the primary means by which practitioners can influence these decisions. However, it is unclear which configuration mechanisms influence build-versus-buy decisions most effectively. We present a pre-registered protocol to study how configuration mechanisms alter build-versus-buy behavior in two popular agentic AI coding tools: Claude Code and OpenAI Codex. We will execute controlled programming tasks drawn from a benchmark of staged projects, each constructed around identifiable build-versus-buy points, and will manipulate the configuration supplied to each tool, ranging from no configuration, through context files with soft preferences and explicit prohibitions, to Skills (instructions that can be autonomously discovered), MCP-enabled library discovery tools, and permission controls, measuring which libraries the tool selects, whether it discloses newly introduced libraries, and whether those disclosures are complete and accurate. Nine pre-registered hypotheses structure the protocol. The resulting benchmark dataset and analysis pipeline will be released as a reusable artifact for evaluating build-versus-buy behavior in agentic AI coding tools.

[HC-2] CLI-Anything: Towards Agent -Native Computer Use

链接: https://arxiv.org/abs/2606.03854
作者: Yuhao Yang,Tianyu Fan,Chao Huang
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As large language models advance in reasoning and tool use capabilities, researchers increasingly seek to leverage them for computer use agents that can interact with existing software. The dominant approach develops GUI agents that control applications through visual interfaces: interpreting screenshots, locating UI elements, and executing mouse clicks to mimic human interaction. This GUI-centric paradigm fundamentally misaligns with agent capabilities. Current GUI agents struggle with brittle pixel-level interactions, timing dependencies, and coordinate-based actions that break with interface changes. They force agents to emulate human perceptual limitations rather than leverage their computational strengths in structured data processing and programmatic control. CLI-Anything argues for agent-native computer use design. Instead of forcing agents to navigate visual layouts, we create interfaces aligned with how agents naturally operate: through structured commands, explicit state representations, and deterministic feedback. We transform existing applications into command-line harnesses that preserve functionality while exposing machine-readable protocols optimized for AI-native interaction. This eliminates the lossy visual-to-computational translation that plagues GUI agents. Rather than building sophisticated screen readers and click simulators, we should redesign interaction paradigms around agent strengths: precise programmatic control and deterministic execution. We examine the methodology, architecture, evidence, and future directions for this agent-native transformation of computer use. We have built CLI-Hub as a comprehensive platform that operationalizes this agent-native computer use vision. The platform provides methodology, architecture, and infrastructure for this fundamental transformation of computer use.

[HC-3] Formalizing all indexed mathematics as a benchmark for general reasoning with the example of implementing dilatations of categories

链接: https://arxiv.org/abs/2606.03835
作者: A. Mayeux
类目: Databases (cs.DB); Human-Computer Interaction (cs.HC); Category Theory (math.CT)
备注: Accepted for publication in Lecture Notes in Networks and Systems (Springer)

点击查看摘要

Abstract:Formal rigor distinguishes mathematics from other disciplines, in the sense that mathematical statements are derived from explicit axioms by logically verifiable steps. Interactive theorem provers support this by expressing definitions, theorems, and proofs in a fully formal language and verifying them mechanically. We consider the benchmark problem of formalizing all published mathematics as a machine verifiable and continuously updated corpus of mathematical knowledge. This viewpoint treats mathematics as a structured database of interdependent results and raises questions about scalability and organization of large formal libraries. As a case study, we present an ongoing formalization in categorical algebra, namely dilatations of categories, extending classical localizations and illustrating what such an implementation looks like in practice.

[HC-4] he Attention-Aware Pipeline: Design Tensions from Making Attention Visible in XR

链接: https://arxiv.org/abs/2606.03492
作者: Arvind Srinivasan,Niklas Elmqvist
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Where people look during shared activity carries coordination cues that speech and gesture cannot replace, but these patterns remain invisible to participants. XR headsets make gaze available as real-time input, yet few systems feed it back visually. We frame our work using the Attention-Aware Pipeline (Capture, Record, Revisualize), whose feedback loop means the systems visual response alters what users attend to next, triggering further responses. This generates design tensions whose form depends on each stages configuration. We trace the pipeline through three systems casting attention as a mirror (reflecting gaze history), a medium (sharing it across collaborators), and a mediator (intervening through diminished reality). Each encountered a tension the loop predicted, motivating the next. A formative eye-tracking study of four musicians surfaced attentional tunneling and near-total disconnection, confirming the need for intervention. We present these tensions and a next step: testing whether subtractive intervention reduces tunneling for a single sight-reader.

[HC-5] Analyzing Visual Attention Patterns During Band Rehearsal with Mobile Eye Tracking

链接: https://arxiv.org/abs/2606.03485
作者: Arvind Srinivasan,Tobias Rau,Michael Sedlmair
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Visual attention is central to ensemble coordination, yet how musicians allocate gaze during naturalistic rehearsal remains poorly understood. We present a pilot study using mobile eye tracking to examine gaze behaviour in a four-member band across three songs, each practiced twice. Musicians wore Pupil Labs Neon eye trackers, and YOLOv8-assisted scene annotations mapped fixations to ensemble members and objects in view. Analyzing fixation matrices, transition matrices, temporal scarf plots, and dwell-transition correlations, we uncover a hub-and-spoke attention topology: the session leader was the dominant gaze target for all members, while the learning guitarist concentrated up to 97% of interpersonal dwell on this single reference. Between attempts, gaze transitions decreased by up to 65% on average for unfamiliar material (up to 82% for individual participants) as scanning stabilized. Scarf plots reveal how teaching breakdowns fragment attention and uninterrupted runs consolidate it. Post-session participant reflections align with the quantitative patterns, and we discuss implications for gaze-aware tools in ensemble pedagogy.

[HC-6] Focused on the User Overlooking the Risks: Security and Privacy Understandings Practices and Challenges of Independent Chinese AI Agent Developers

链接: https://arxiv.org/abs/2606.03190
作者: Shuning Zhang,Mingyao Xu,Zhixin Huang,Yutong Jiang,Rongjun Ma,Yuting Yang,Xin Yi,Kanye Ye Wang,Hewu Li
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The proliferation of AI agents empowers independent developers, defined as individual or small groups who self-initiate projects rather than fulfill client-based contracts, to create sophisticated autonomous systems, but also introduces novel security and privacy (SP) challenges beyond traditional corporate structures. We conducted an interview study (N=28) with Chinese developers, whose extensive use of global LLM services offer valuable insights into this population. We investigate their understandings, practices and challenges of SP challenges in their developed AI agent products. We revealed that independent developers frequently think and act from their users’ perspective. They focused on user-facing safety risks such as harmful content while exhibiting low awareness of security vulnerabilities. Consequently, developers rely almost exclusively on ad-hoc, manually crafted safeguards and informal communication, with an absence of formal tools or processes for SP practices. We found these actions are driven by various inhibitors, primarily a lack of formal training on SP related skills, accessible security tools and actionable guidance from platforms. Our work contributed the first exploration of independent AI agent developers’ SP understanding, outlining opportunities for tailored security tooling.

[HC-7] Pulse Focus: Validation of the Focus Performance Score as a Behavioral Signal for Human Attentional State Modeling Toward Attention-Aware AI

链接: https://arxiv.org/abs/2606.03164
作者: Yisak Debele,Israel Goytom,Anwar Misbah
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Artificial intelligence systems that model and support human cognition require reliable measures of cognitive state. We present the Focus Performance Score (FPS) from the Pulse Focus mobile Stroop application and evaluate whether it measures attentional control during color-word conflict resolution. We conduct behavioral, neural, and formula validation analyses. Behavioral results (N=466, 111,133 trials) show that FPS captures the Stroop interference effect, tracks individual differences in attentional control, and demonstrates strong test-retest reliability. Neural validation using the DMCC55B fMRI dataset (N=55) shows that the primary FPS component, mean incongruent reaction time, is significantly associated with anterior cingulate cortex activation, a key neural substrate of conflict monitoring. Formula validation identifies and resolves structural redundancy within the scoring framework and provides convergent support for the weighting design. Together, these findings establish FPS as a behaviorally valid, reliable, and neurally grounded measure of attentional control. FPS provides a defensible behavioral signal for evaluating human attentional state and supports future work on attention-aware human-AI interaction and physiological state modeling.

计算机视觉

[CV-0] SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image

链接: https://arxiv.org/abs/2606.03994
作者: Inhee Lee,Sangwon Baik,Sungjoo Kim,Hyeonwoo Kim,Hyunsoo Cha,Hanbyul Joo
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:Reconstructing interactive, simulation-ready 3D scenes from a single image is a critical bottleneck for robotic manipulation. While recent single-image lifters recover plausible per-object shapes, composing them yields scenes that collapse under physical simulation due to interpenetrating, hovering, or sinking objects. Existing physics-aware methods address this strictly as a post-hoc layout correction, leaving the underlying geometric errors unresolved. To address this, we introduce SimuScene, a compositional 3D reconstruction pipeline that puts physics in the loop of shape and layout estimation. Rather than using physics merely for layout cleanup, we utilize the physics engine as a diagnostic measurement tool during the generative process itself. By diagnostically simulating reconstructed objects under gravity, we convert penetration and support failures into quantitative correction signals that drive gravity-axis stretching and amodal shape resampling. This physics-informed feedback loop mitigates accumulated reconstruction errors and produces a stable, simulation-ready compositional 3D scene. Extensive experiments demonstrate state-of-the-art performance on physical stability and geometric alignment benchmarks. We further highlight SimuScene’s utility by deploying reconstructed environments in humanoid control and robot-arm manipulation tasks.

[CV-1] Exploring Easy Boosts for Lidar Semantic Scene Completion ICIP2026

链接: https://arxiv.org/abs/2606.03992
作者: Tetiana Martyniuk,Jonathan Seele,Alexandre Boulch,Gilles Puy,Renaud Marlet,Raoul de Charette
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICIP 2026

点击查看摘要

Abstract:This paper investigates “free lunch” strategies to boost the performance of lidar semantic scene completion (SSC) without requiring complex architectural redesigns. We first demonstrate that endowing input point clouds with semantic pseudo-labels from off-the-shelf segmentors significantly improves the performance of existing architectures. By evaluating these models against an oracle, we establish that high-quality semantic priors are a primary driver of mIoU gains. Furthermore, we equip the input lidar scan with visibility information that distinguishes between empty and unknown spaces, which provides a secondary performance boost across the tested architectures. Using these simple enhancements, we observe that older models remain competitive with state-of-the-art systems, and can even outperform them. Our code is available at this https URL.

[CV-2] PixVOD: Pixel-Distributed Direct Visual Odometry and Depth Estimation

链接: https://arxiv.org/abs/2606.03989
作者: Shinjeong Kim,Ignacio Alzugaray,Callum Rhodes,Paul H. J. Kelly,Andrew J. Davison
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Images composed of 2D pixel arrays are the standard input to computer vision algorithms, yet many underlying computations can be distributed across pixels. Transmitting raw, redundant, and noisy pixel data off the sensor remains inefficient, motivating a shift toward focal-plane sensor-processors that perform a significant part of the computation directly within each pixel. We envision pixels synthesizing higher-level signals locally, reducing downstream load, and providing richer inputs for higher-level vision tasks. We propose a fully parallelizable form of visual odometry and depth estimation across pixels, where sensor-processors exchange information through Gaussian Belief Propagation (GBP) to achieve consensus about camera motion and infer depth from per-pixel photometric observations and a surface normal prior. To maintain geometric stability during optimization, we introduce a keyframe-like anchoring mechanism that regulates the effective baseline between frames, enabling consistent motion and depth updates. Our method is evaluated on realistic datasets, demonstrating the feasibility of GBP-based pixel-level distributed odometry and depth estimation with keyframe anchoring on-sensor. Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.03989 [cs.CV] (or arXiv:2606.03989v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.03989 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-3] NewtPhys: Do Foundation Models Understand Newtonian Physics?

链接: https://arxiv.org/abs/2606.03986
作者: Sebastian Cavada,Soumava Paul,Tuan-Hung Vu,Andrei Bursuc,Raoul de Charette
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps – including 3D forces and amodal per-pixel quantities covering physics, tracking, semantics and geometry – bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 56 VLMs, including 54 open-weight models and 2 closed-source frontier models, and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations. Code and datasets are available at this https URL.

[CV-4] Humanoid-GPT : Scaling Data and Structure for Zero-Shot Motion Tracking CVPR2026

链接: https://arxiv.org/abs/2606.03985
作者: Zekun Qi,Xuchuan Chen,Dairu Liu,Chenghuai Lin,Yunrui Lian,Sikai Liang,Zhikai Zhang,Yu Guan,Jilong Wang,Wenyao Zhang,Xinqiang Yu,He Wang,Li Yi
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.

[CV-5] Formalizing the Binding Problem ICML2026

链接: https://arxiv.org/abs/2606.03976
作者: Lianghuan Huang,Yihao Li,Saeed Salehi,Yingshan Chang,Ansh Soni,Konrad P. Kording
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn to exhibit binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with objects sharing features. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments on ViTs, measuring binding from different components of the architecture, such as the image summary token [CLS] or the spatial tokens. We use datasets with different binding challenges, such as feature sharing, occlusion, and natural features, while comparing the performance of several pre-trained ViTs. Overall, our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.

[CV-6] AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation ICML2026

链接: https://arxiv.org/abs/2606.03972
作者: Haobo Li,Yanhong Zeng,Yunhong Lu,Jiapeng Zhu,Hao Ouyang,Qiuyu Wang,Ka Leong Cheng,Yujun Shen,Zhipeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026. Project page: \url{ this https URL }

点击查看摘要

Abstract:We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video generation. State-of-the-art methods adopt adversarial distillation but suffer from motion collapse and training instability, resulting in static videos. AAD-1 addresses these challenges through two key designs in architecture and training strategy. Our key architectural insight is to break the symmetry between generator and discriminator. While the generator remains causal to preserve autoregressive sampling capability, the discriminator attends bidirectionally over the full spatiotemporal context and produces a single holistic realism score for the entire video sequence. This asymmetric design enables the discriminator to effectively detect global temporal failures and long-range drift that cause motion collapse in autoregressive generation. To stabilize training, we introduce a phased strategy that first uses distribution matching to bootstrap a stable one-step generator, providing a warm-up phase that brings the student distribution closer to the teacher before adversarial distillation begins. Extensive experiments on VBench demonstrate that AAD-1 achieves state-of-the-art performance in one-step autoregressive video generation.

[CV-7] Video-Mirai: Autoregressive Video Diffusion Models Need Foresight

链接: https://arxiv.org/abs/2606.03971
作者: Yonghao Yu,Lang Huang,Runyi Li,Zerun Wang,Toshihiko Yamasaki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Causal video generators must predict from the past, but they need not learn only from it. In streaming autoregressive video diffusion, each emitted segment becomes a commitment that future segments must preserve. Standard training, however, only asks each causal state to explain the present. This creates what we call a representation-level planning gap: states that fit the current segment may discard identity, layout, and motion information needed for a consistent future. We introduce Video-Mirai, a training-only method that closes this gap without changing causal inference: the generator rolls out causally, a frozen foresight encoder reads the completed rollout non-causally, and a lightweight predictor distills the resulting stopped-gradient targets into causal states. Future frames supervise representations, never generator inputs. At inference, the encoder and predictor are discarded, leaving the original architecture, per-step FLOPs, and KV-cache behavior unchanged. Video-Mirai improves a strong Causal-Forcing baseline on 5-second VBench from 83.8 to 84.6 in terms of Total Score. On 30-second rollouts beyond the training horizon, subject consistency improves from 84.9 to 88.5 and background consistency from 90.2 to 91.9. Ablations identify future-conditioned targets as the key ingredient, and probes show that future frames become more decodable from current features. Causality should constrain inference, not representation supervision. Our study highlights that visual autoregressive models need foresight. Project page: this https URL.

[CV-8] VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring

链接: https://arxiv.org/abs/2606.03954
作者: Hanjiang Hu,Yiyuan Pan,Jiaxing Li,Xusheng Luo,Alexander Robey,Na Li,Yebin Wang,Changliu Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 18 pages, 5 tables, 5 figures

点击查看摘要

Abstract:As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount – physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. VLESA addresses intent-dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal-conditioned safety annotations is introduced, enabling a goal-conditioned safety Q-filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent-action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground-truth frame compared to baselines, while the GRPO-trained Q-filter improves action safety by over 41 percentage points through goal-conditioned constrained decoding. Code is available at this https URL.

[CV-9] Demo2Tutorial: From Human Experience to Multimodal Software Tutorials CVPR2026

链接: https://arxiv.org/abs/2606.03951
作者: Zechen Bai,Zhiheng Chen,Yiqi Lin,Kevin Qinghong Lin,Difei Gao,Xiangwu Guo,Xin Wang,Mike Zheng Shou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Human experience in digital environments offers a vast, underexplored resource of authentic, untrimmed interactions that contain rich procedural knowledge. We introduce Demo2Tutorial, a framework that transforms this experience captured via screen recordings and interaction logs into structured, multimodal software tutorials for teaching both humans and agents. Demo2Tutorial first collects human experience via a dedicated recorder, then parses raw experience using a multimodal Action Parser to reconstruct perception, action, and intent. A Step Planner then abstracts these steps into hierarchical task graphs representing goals and steps. Finally, a Tutorial Composer transforms the parsed experience into structured, reusable image-text instructions. We evaluate the tutorial generation quality on a new benchmark derived from official software documentation. We further demonstrate that this distilled representation benefits (i) human learning, by automatically generating multimodal tutorials, and (ii) agent learning, by improving downstream GUI-agent planning and generalization. Experiments show Demo2Tutorial produces high-quality tutorials that surpass human-authored ones and significantly outperform baseline methods, while enabling both faster human task completion and improved GUI agent planning, demonstrating that structured tutorials distilled from human experience can serve as effective knowledge representations for advancing both human learning and agent capabilities. Code and data will be available at this https URL.

[CV-10] Adaptive Causal Alignment for High-Confidence Adversarial Training

链接: https://arxiv.org/abs/2606.03925
作者: Zhiming Luo,Kejia Zhang,Yingxin Lai,Junwei Wu,Juanjuan Weng,Shaozi Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inverse adversarial training leverages high-confidence predictions to stabilize robust learning, yet we uncover a critical paradox: high confidence often stems from overfitting to non-causal background correlations rather than intrinsic object semantics. Our investigation reveals that visual context functions as a dual-natured signal, serving as either a necessary supportive prior or a spurious confounder. This insight renders existing blind suppression strategies flawed, as they inevitably lead to severe Feature Loss. To resolve this, we propose High-Confidence Causally Aligned Training (HICAT), a unified framework that establishes a Semantic Equilibrium. Operating on a ``Measure-Debias-Align’’ pipeline, HICAT integrates a Learnable Background-Bias Estimator (LBBE) to adaptively diagnose context utility. Guided by this diagnosis, an Adaptive Debiasing mechanism performs surgical logit rectification, complemented by a geometrically grounded Foreground Logit Orthogonal Enhancement (FLOE) loss to enforce rigorous feature disentanglement. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet-1K demonstrate that HICAT consistently improves over matched baselines across diverse architectures (CNNs and ViTs) while significantly reducing the robust generalization gap.

[CV-11] GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

链接: https://arxiv.org/abs/2606.03921
作者: Jiahao Sun,Dingkun Wei,Zehong Shen,Hongyu Zhou,Yujun Shen,Liang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Converting multi-view RGB observations into simulation-ready 3D environments remains challenging because current reconstruction pipelines produce monolithic scene representations without explicit physical structure. They are typically defined up to an arbitrary global rotation and entangle rigid foreground objects with background geometry, which hinders stable physical interaction. Existing solutions often recover interactivity by replacing reconstructed objects with retrieved CAD assets, but this introduces a slow retrieval-and-replacement stage and weakens scene-specific geometric fidelity. We propose GARDEN, an RGB-only framework that reformulates reconstruction as physically-grounded scene factorization and outputs a structured hybrid scene representation. The key idea is to use gravity as a universal physical prior: we first align the reconstruction to a unified Gravity-View frame to resolve gauge ambiguity, then recover object-centric rigid meshes with accurate 6-DoF placement, and finally remove duplicate object geometry from the background through conditional 3D point classification. The resulting representation combines explicit rigid bodies with a decoupled background, enabling direct physics simulation while preserving visual realism. Experiments on both simulated and real multi-view scenes show that GARDEN improves object placement reliability, disentanglement quality, and rendering-simulation efficiency compared with retrieval-based baselines.

[CV-12] Benchmarking Visual State Tracking in Multimodal Video Understanding

链接: https://arxiv.org/abs/2606.03920
作者: Sihyun Yu,Nanye Ma,Pinzhi Huang,Hyunseok Lee,Shusheng Yang,June Suk Choi,Ellis Brown,Oscar Michel,Boyang Zheng,Jinwoo Shin,Saining Xie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL

点击查看摘要

Abstract:Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs’ thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.

[CV-13] PatchScene: Patch-based Voxel Diffusion for Large-Scale Scene Completion

链接: https://arxiv.org/abs/2606.03915
作者: Qingdong Xu,Jiajun Zhu,Shilin Zhu,Xinjing He,Chao Lu,Huanran Wang,Jiyao Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 5 tables

点击查看摘要

Abstract:We propose PatchScene, a novel diffusion-based framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-the-art performance across all standard metrics, surpassing previous approaches in both geometric accuracy and temporal consistency. Remarkably, the model trained on 20 m LiDAR ranges generalizes effectively to 50 m scenes without retraining, highlighting its strong scalability and generalization capability for real-world autonomous driving applications.

[CV-14] Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching ICML2026

链接: https://arxiv.org/abs/2606.03911
作者: Yoad Tewel,Yuval Atzmon,Gal Chechik,Lior Wolf
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026. Project page is at this https URL

点击查看摘要

Abstract:Modern generative models possess a deep understanding of visual content, yet training them for image editing typically requires massive datasets of paired examples. This limits scalability, especially for video editing where collecting paired data is prohibitively expensive. We propose Bootstrap Your Generator (ByG), a general framework for unpaired training of flow matching editing models. It leverages the base model’s knowledge without any external signal. Our approach pairs instruction-following cues extracted from the frozen model with cycle-consistency for structure preservation. To make this tractable, we propose to route gradients from downstream losses over clean predictions to noisy training states. We demonstrate state-of-the-art results on challenging data-scarce image and video editing scenarios. Extensive evaluations and user studies show that our method effectively generalizes to unseen domains and outperforms supervised baselines trained on millions of samples. Analysis reveals that our gradient routing bridges the train-inference gap, and extracting semantic cues from a base model provides a robust training signal that obviates the need for external reward models.

[CV-15] SparseStreet: Sparse Gaussian Splatting for Real-Time Street Scene Simulation

链接: https://arxiv.org/abs/2606.03909
作者: Qingpo Wuwu,Xiaobao Wei,Peng Chen,Nan Huang,Zhongyu Zhao,Hao Wang,Ming Lu,Ningning Ma,Shanghang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While 3D Gaussian Splatting has shown promising results in street scene reconstruction, existing methods require massive numbers of Gaussian primitives to capture fine details, leading to prohibitive storage costs and slow rendering speeds. We observe that dynamic objects (e.g., vehicles and pedestrians) demand high-fidelity representations to maintain temporal consistency, while static background regions often contain substantial redundancy. Motivated by this, we propose SparseStreet, a general compression framework specifically designed for street scenes. First, we introduce a node-based learnable pruning strategy that systematically removes low-contributing Gaussian primitives while preserving visually critical regions. Second, after the scene representation stabilizes, we apply background compression, further reducing redundancy in static regions. Our method effectively preserves the geometry and appearance of dynamic objects while significantly reducing the total number of Gaussian primitives. Extensive experiments on the Waymo and nuScenes demonstrate that SparseStreet achieves up to 80% compression ratio with minimal quality degradation, enabling resource-efficient, high-fidelity dynamic scene reconstruction. Project website: this https URL.

[CV-16] MAdam: Metric-Aware Multi-Objective Adam

链接: https://arxiv.org/abs/2606.03904
作者: Fengbei Liu,Rachit Saluja,Sunwoo Kwak,Ruibo Wang,Ruining Deng,Heejong Kim,Johannes C. Paetzold,Mert R. Sabuncu
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-objective optimization (MOO) underlies many machine learning problems, yet MOO solvers across the loss-balancing, gradient-balancing, and Pareto-based families almost universally hand their reconciled directions to Adam~\citekingma2015adam. We show this coupling introduces two systematic gaps between the solver’s intent and the optimizer’s execution. The first is a \emphweighting mismatch: Adam’s second-moment denominator entangles the time-varying preference vector with gradient statistics, marginalizing the preference into a history average and collapsing distinct Pareto trade-offs toward a near-uniform mixture. The second is a \emphgeometric mismatch: Adam’s adaptive metric distorts the Euclidean geometry MOO solvers assume, turning aligned objectives into apparent conflicts. To resolve both jointly, we introduce \textbfMAdam (Metric-Aware Multi-Objective Adam), a drop-in wrapper that leaves both solver and optimizer unchanged. MAdam preconditions the reconciled direction by the preference-conditioned curvature of the scalarized objective; on this whitened input, Adam’s second moment collapses to identity, so the realized update is governed by the preference-conditioned metric. Across multi-task learning, Pareto-front recovery, physics-informed neural networks, and medical imaging, MAdam consistently improves over Adam for every solver family.

[CV-17] An Attention-Based Denoising Model for Diffusion Weighted Imaging

链接: https://arxiv.org/abs/2606.03903
作者: Prithviraj Verma,Pawan Kumar,Chandan Deshani,Prasun Chandra Tripathi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-weighted imaging (DWI) is used for whole-body cancer screening, but it typically requires a long acquisition time. When the scan time is reduced, the image quality often suffers, leading to increased noise in the scans. Magnitude reconstruction in DWI introduces signal-dependent Rician noise, which makes denoising more challenging for conventional convolution-based methods. To address this limitation, we propose a noise-aware attention-driven denoising framework that integrates hierarchical Swin Transformer window attention with transformer-based multi-dimensional gated refinement for DWI restoration. The model incorporates explicit noise-level conditioning and residual reconstruction to enable adaptive suppression of heteroscedastic noise across a wide range of corruption levels. Experimental evaluation on corrupted DWI scans demonstrates strong restoration performance. Our model achieves a mean PSNR of 33.69~dB and SSIM of 0.8539 across noise levels from 1% to 15%, while maintaining stable behavior under severe noise conditions. These results indicate that attention-guided contextual modeling combined with channel-adaptive refinement provides a robust and generalizable solution for DWI denoising.

[CV-18] Electromagnetic Navigation for Femoral Osteotomy Using High-Accuracy X-ray-to-CT Registration

链接: https://arxiv.org/abs/2606.03893
作者: Roman Flepp,Arend Nieuwland,Bastian Sigrist,Philipp Fürnstahl,Lilian Calvet,Thomas Dreher
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Will be published in the International Journal of Computer Assisted Radiology and Surgery

点击查看摘要

Abstract:Accurate execution of preoperative plans in corrective femoral osteotomies remains challenging. Current techniques are limited by variable accuracy, invasiveness, and radiation exposure, with free-hand methods and patient-specific instrumentation (PSI) often requiring 30 and 6 fluoroscopic images, respectively. We present an integrated, electromagnetic tracking (EMT)-based navigation system for femoral osteotomies that minimizes dissection and intraoperative fluoroscopy. The system couples CT-based preoperative planning with one-time intraoperative C-arm calibration and accurate X-ray-to-CT registration from two fluoroscopic images acquired at initialization. This enables real-time, fluoroscopy-free EMT navigation of the saw blade and bone fragments relative to the preoperative plan, and is compatible with uniplanar and biplanar osteotomies. In a feasibility study using 18 synthetic femora, EMT guidance significantly outperformed free-hand execution in total angular error ( (3.05 \pm 0.75)^\circ vs.\ (6.32 \pm 2.36)^\circ , p=0.031 ), assuming the same minimal surgical exposure for both. No EMT-guided trials exceeded the 5° clinical threshold, whereas free-hand produced 4 outliers of 6 trials. The system achieved statistical equivalence ( \pm 2^\circ , \pm 2,\textmm ) to PSI for total angular ( p \le 0.02 ) and total translational ( p=0.048 ) errors, with no significant differences in user questionnaire scores. By transferring preoperative plans using only two fluoroscopic images while matching PSI accuracy without additional surgical exposure, the proposed system motivates subsequent cadaveric and clinical validation.

[CV-19] OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLM s

链接: https://arxiv.org/abs/2606.03890
作者: Yifei Li,Pengyiang Liu,Yuhang Zang,Zhongyue Shi,Qi Fu,Hongye Hao,Jiwen Lu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 48 pages, 12 figures, 15 tables. Project page: this https URL

点击查看摘要

Abstract:Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators, each also serving as a blind cross-reviewer, across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points, 59.2 vs. 86.6, with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.

[CV-20] CoralBay: A Self-Supervised CT Foundation Model

链接: https://arxiv.org/abs/2606.03888
作者: Ioannis Gatopoulos,Nicolas Känzig,Sebastian Otálora,Fei Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-supervised learning has enabled large-scale pre-training on 2D natural images, producing general-purpose visual representations that transfer effectively across tasks. However, many medical imaging modalities, such as CT scans, are inherently three-dimensional and differ fundamentally from natural images in both structure and semantics. Volumetric modalities capture spatial continuity, organ anatomy, and intensity-based tissue properties (e.g., Hounsfield Units), which are not adequately modeled by 2D pre-training. To bridge this gap, we introduce CoralBay, a self-distillation framework that extends DINO by using a hierarchical 3D Swin backbone and applying self-distillation to concatenated multi-scale features, enabling data-efficient self-supervised learning of rich spatial representations that encode both global semantics and fine-grained local structure. As a result, CoralBay transfers effectively to a wide range of downstream radiological tasks, demonstrating strong and consistent performance across diverse anatomical targets. In addition, we contribute to the open-source \eva framework by introducing a public, reproducible 3D radiology leaderboard that unifies multiple datasets and establishes a standardized benchmark for evaluating volumetric representation learning methods.

[CV-21] Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs

链接: https://arxiv.org/abs/2606.03879
作者: Wei Ding,Yudong Zhang,Ruobing Xie,Xingwu Sun,Jiansheng Chen,Yu Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five common vision encoders under a unified pipeline (~20k GPU-hours total), and report three findings. First, retraining each subset from scratch reveals encoder rankings that differ from those obtained by masking encoders on a fixed checkpoint, including which encoder ranks first overall. Second, we decompose each encoder’s contribution into two axes, Capacity, the score an encoder reaches on its own, and Necessity, the drop when it is removed from the full pool. The two axes are not interchangeable. Pairing the two highest-Capacity encoders is suboptimal, while pairing a high-Capacity anchor with an adaptive complement matches the full five-encoder model. Adding further encoders beyond this pair yields only marginal gains. Third, at fixed parameter count, per-encoder pre-projector effective rank explains the residual score variation. The strongest pairs combine an anchor whose rank survives joint training with a complement whose rank expands under it, suggesting that higher-rank, less-collapsed projector inputs correspond to a more favorable optimization regime at the encoder-projector interface. Together, the Capacity-Necessity decomposition and the pre-projector rank analysis, along with comprehensive evaluation through retraining, expose a methodological gap in multi-encoder LVLM design, and offer concrete primitives for closing it.

[CV-22] MLP Splatting: Object-Centric Neural Fields

链接: https://arxiv.org/abs/2606.03877
作者: Shinjeong Kim,Yuzhou Cheng,Xin Kong,Paul H. J. Kelly,Andrew J. Davison
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D representations are fundamental to scene rendering, understanding, and interaction. Recent approaches, such as 3D Gaussian Splatting and Neural Radiance Fields, achieve impressive photorealistic novel-view synthesis, but lack the ability to easily decompose scene elements into a few primitives, requiring additional segmentation or grouping for object-level manipulation. We present MLP-Splatting, a method that enables scene decomposition via a few expressive light-field primitives while providing photorealistic novel-view synthesis. MLP-Splatting models each primitive as an independent compact MLP with localized spatial support that predicts radiance and opacity. In contrast to low-level Gaussian primitives or a single global radiance field, our neural primitives provide greater expressive capacity while remaining spatially localized. Rendering is performed through efficient sparse volumetric compositing over ray-primitive interactions. Our primitives are supervised using RGB supervision alone, which yields primitives that represent local scene regions often corresponding to objects or object parts, enabling interactive object-level editing without segmentation masks by selecting a handful of primitives. Our method, augmented with optional semantic feature distillation, enables open-vocabulary scene interaction and open-set instant segmentation. Compared to state-of-the-art methods, we achieve substantially lower memory usage (1/15 \times ) and faster rendering (3 \times ), as we show in our experiments compared to semantic 3DGS methods. Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.03877 [cs.CV] (or arXiv:2606.03877v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.03877 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-23] Seg2Track: Probabilistic Track Validation and Data Association for Multi-Object Tracking and Segmentation

链接: https://arxiv.org/abs/2606.03875
作者: Diogo Mendonça,Tiago Barros,Cristiano Premebida,Urbano J. Nunes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous systems require robust Multi-Object Tracking and Segmentation (MOTS) to operate reliably in dynamic environments, ensuring consistent object identities and precise mask-level delineation. Foundation models such as SAM2 have shown strong zero-shot generalization for segmentation, but their direct application to MOTS is limited by unreliable track association and false-positive propagation. This work introduces Seg2Track++, a framework that integrates instance segmentation with SAM2 and a novel track management module to perform zero-shot MOTS with enhanced temporal consistency. Tracks are associated using Mask Centroid Distance (MCD) and Confidence-Aware Cost Modulation (CCM), while Probabilistic Track Validation (PTV) employs a Bernoulli filter to validate track existence and suppress ghost tracks. Experimental results on KITTI MOTS demonstrate improved identity preservation, reduced false-positive propagation, and robust track management without fine-tuning.

[CV-24] DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction

链接: https://arxiv.org/abs/2606.03874
作者: Koki Nagano,Hongyu Liu,Seonwook Park,Tianye Li,Amrita Mazumdar,Christian Jacobsen,Shengze Wang,Michael Stengel,Rajarshi Roy,Ka Chun Cheung,Simon See,Shalini De Mello
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:We present DyaPlex, a streaming, full-duplex speech-and-motion model designed for dyadic interaction. To capture the continuous and reciprocal nature of human communication, this full-duplex capability empowers the agent to simultaneously perceive and generate both speech and physical motion in a streaming fashion. At its core, our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction. Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features. Trained on the 4,000-hour Seamless Interaction dataset, our model effectively captures cross-speaker dependencies and establishes new state-of-the-art performance across both monadic and dyadic human interaction benchmarks.

[CV-25] Unified Video-Action Joint Denoising for Dexterous Action and Data Generation

链接: https://arxiv.org/abs/2606.03868
作者: Dingrui Wang,YuAn Wang,Jinkun Liu,Yue Zhang,Mattia Piccinini,Yu Sun,Johannes Betz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Recent world action models leverage video foundation models by aligning broad visual-dynamics priors with executable robot actions. We revisit this alignment from a distributional perspective. Existing formulations typically narrow the aligned prior into an observation-conditioned policy distribution over future actions. In contrast, we keep the distribution broader by modeling the joint space of interaction videos and executable hand trajectories under multiple conditioning regimes. We propose Donk, a unified video-action denoising model for dexterous hands. With language, an initial image, and the initial hand state, Donk samples future videos and bimanual MANO trajectories as an action policy. Without the image condition, the same denoising architecture samples paired video-action rollouts from a text-conditioned distribution, turning the aligned video prior into a data engine. Across action, video, and text-only generation evaluations, Donk improves dexterous trajectory accuracy, preserves strong video fidelity, and produces smooth text-conditioned action rollouts under the same unified training recipe.

[CV-26] Where Do We (Not) Need Temporal Context in Low-Resource Video Task Adaptation?

链接: https://arxiv.org/abs/2606.03837
作者: Luc P.J. Sträter,Hazel Doughty
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) and probing enable adaptation of foundation models using only a small number of trainable parameters, making it attractive for video understanding where annotation and computation are expensive. However, video PEFT has focused on adapting image-pretrained models, while standard PEFT methods can also be applied to video representations. These settings are rarely compared and both confine temporal reasoning to a single component of the model, leaving open how temporal context should be distributed across backbone, PEFT and probe. In this work we provide a systematic study of model adaptation strategies for video understanding. We evaluate methods across appearance-focused, motion-focused and spatially dense settings, with a particular focus on scenarios with limited data where parameter-efficiency is most beneficial. Our results provide new insights into PEFT and probing across settings and demonstrate the importance of temporal context allocation for effective video adaptation

[CV-27] Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis MICCAI

链接: https://arxiv.org/abs/2606.03827
作者: Shaokun Lan,Haoran Dou,Jinghan Huang,Arezoo Zakeri,Fengming Lin,Zherui Zhou,Jinming Duan,Alejandro F. Frangi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work has been early accepted by International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2026

点击查看摘要

Abstract:In-silico trials of medical devices require the generation of virtual populations of anatomies. In cardiovascular applications, virtual anatomy is typically represented as a 3D+t mesh sampled from a generative model. However, most existing mesh generators focus on static anatomy, while sequence models often lack explicit periodicity. To this end, we propose 4D F-MeshLDM, a conditional generative framework comprising a convolutional mesh VAE to encode meshes, a structural latent space that parameterises motion using a truncated Fourier series, and a diffusion prior that learns the latent distribution over Fourier coefficient tokens. By conditioning the diffusion process on clinical covariates via affine modulation, we enable controllable synthesis. Sampling tokens and performing inverse Fourier synthesis yield cycle-consistent latent trajectories, which can be decoded into 3D+t cardiac mesh sequences. Experiments on 5,000 UK Biobank subjects demonstrate that 4D F-MeshLDM outperforms state-of-the-art baselines in anatomical fidelity and achieves near-zero cycle closure error. Furthermore, the generated cohorts accurately preserve clinical functional indices, highlighting the potential of our framework for reliable in-silico cardiac trials.

[CV-28] Attend to Anything: Foundation Model for Unified Human Attention Modeling ICML2026

链接: https://arxiv.org/abs/2606.03540
作者: Wenzhuo Zhao,Ronghao Xian,Keren Fu,Qijun Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model that unifies attention modeling across various image, video, and audio-visual tasks and scenes. AAM reformulates attention as a cognitive entailment relationship organized in a general-to-specific hierarchy, implemented through language prompts with hierarchical embeddings in hyperbolic space. Furthermore, to unify static image and dynamic video attention, we adopt a fluid-dynamics perspective, formulating video-frame attention as a diffusive temporal evolution governed by the Fokker–Planck equation. Extensive experiments on 16 benchmarks demonstrate that AAM consistently outperforms state-of-the-art methods by an average of 6% across various scenarios, while achieving approximately a 4 \times speedup in video inference. Overall, these results demonstrate that AAM provides a principled foundation for future research on attention and saliency-related tasks. The dataset and code will be available at this https URL.

[CV-29] Knowledge-Preserved Model Tuning in Null-Space for Robust Spatio-Temporal Video Grounding ICME2026

链接: https://arxiv.org/abs/2606.03539
作者: Haoxuan Chen,Xianqin Liu,Jian-Fang Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2026

点击查看摘要

Abstract:Spatio-Temporal Video Grounding aims to localize object tubes based on textual queries. While recent methods have achieved remarkable success, they mainly focus on high-quality(HQ) inputs, neglecting the widespread presence of low-quality(LQ) videos in real-world scenarios. Although tuning methods like LoRA can adapt to degraded inputs, they inevitably disrupt pre-trained knowledge. To address this, we propose Null-Space Tuning (NST). This framework exploits the geometric property that adding vectors within the null-space of frozen weights to the layer input does not affect the output. Leveraging this, NST injects learnable residuals into input features that can be selectively invisible to the pre-trained backbone. Specifically, NST combines the Quality-Adaptive Unit and Dual-Space Reparameterization to synthesize these residuals by confining components for HQ inputs to the null-space, while directing restoration components for LQ inputs to the non-null space. As the frozen weights eliminate null-space components, we effectively rectify degraded inputs while preserving pre-trained knowledge for HQ inputs. Extensive experiments show that NST outperforms state-of-the-art methods on our Mixed-Quality benchmark.

[CV-30] EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation

链接: https://arxiv.org/abs/2606.03509
作者: Zuhao Ge,Xiaosong Jia,Chao Wu,Yuchen Zhou,Zuxuan Wu,Yu-Gang Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Building memory is essential for long-horizon planning in zero-shot embodied navigation. Detector-centric scene graphs often compress observations into sparse nodes, discarding fine-grained visual evidence and accumulating noise, while 3D reconstruction-based methods remain computationally prohibitive. We present EvoMemNav, an efficient, self-evolving, fine-grained memory framework for zero-shot embodied navigation. EvoMemNav constructs a Visual-Semantic Memory Graph (VSMGraph) that keeps raw views as first-class memory and organizes them with lightweight semantic cues and topological relations into a room-view-object hierarchy, preserving fine-grained details for disambiguation and Stop verification. To scale to growing memory, we introduce a budgeted coarse-to-fine policy: a coarse stage compresses the search space into promising regions, and a fine stage invokes a VLM only for targeted verification and decision. Beyond static memories, EvoMemNav performs reflection-driven write-back after each subtask, updating graph-attached priors that encode accumulated environmental knowledge to refine future decisions without retraining. Experiments on GOAT-Bench and HM3D across object, text-description, and image-goal modalities show consistent gains in SR/SPL, with better multi-instance disambiguation, fewer premature stops, and stronger zero-shot generalization.

[CV-31] Structure-Guided Mixed Masked Pretraining and Spatial Continuity Regularization for Printed Circuit Board Defect Detection

链接: https://arxiv.org/abs/2606.03508
作者: Peitong Wang,Nuo Wang,Enxin Qin,Chengjin Yu,Hanyu Xuan,Yuanting Yan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. 38 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Printed circuit board (PCB) defect detection is an essential part of automated optical inspection (AOI); yet it remains challenging in practice because many defects are tiny, low-contrast, and embedded in dense circuit backgrounds. To address these issues, this paper presents a two-phase PCB defect detection framework that combines structure-guided mixed masked pretraining with spatial continuity regularization. In the pretraining stage, we design a sparse convolutional masked pretraining scheme to exploit unlabeled PCB images, where structure-guided mixed masking is used to construct informative masked inputs. The sparse convolutional reconstruction pipeline suppresses invalid responses from masked regions and enables the detector backbone to infer missing PCB structures from visible conductive patterns, thereby learning PCB structural priors. In the fine-tuning stage, the pretrained backbone is transferred to the downstream defect detection task. For the task, a spatial continuity regularization term is introduced during fine-tuning. This term constrains dispersed positive predictions assigned to the same defect instance and promotes more compact localization on elongated defect regions. Experiments on the DsPCBSD+ dataset show that the proposed method achieves 85.5% mAP0.5 and 52.3% mAP0.5:0.95, outperforming several strong baseline detectors. Ablation studies and qualitative results further confirm the effectiveness of the proposed framework for robust PCB defect detection in industrial AOI scenarios.

[CV-32] AvatarMix: Identity-Preserving Cross-Avatar Composition for Outfit Personalization CVPR2026

链接: https://arxiv.org/abs/2606.03506
作者: Zhaorong Wang,Yoshihiro Kanamori,Yuki Endo
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: CVPR 2026 Findings. 16 pages, including supplementary material

点击查看摘要

Abstract:Existing 3D avatar outfit transfer methods face distinct challenges: approaches that lift 2D edits to 3D often suffer from outfit or identity quality degradation, while those that separately model body and clothing layers are prone to intersection artifacts. We introduce AvatarMix, a compositional paradigm that bypasses these issues by directly composing the head and body from two high-fidelity Gaussian avatars. While this paradigm inherently preserves outfit quality and avoids intersections, it introduces challenges in creating a seamless join and maintaining appearance fidelity after body reshaping. To this end, we propose a two-tier refinement strategy: SeamFix, a localized diffusion module that refines hair and neck to ensure an artifact-free join, and an optional full-body refinement, FullbodyFix, that restores garment appearance when retargeting degrades the clothed body. Both operate on renders from an already 3D-consistent Gaussian avatar, which limits multi-view artifacts compared to 2D-to-3D lifting. To preserve the user’s body identity, our mesh-based Gaussian representation enables the adaptation of a robust mesh retargeting technique, precisely reshaping the clothed body to the user’s physique and robustly handling diverse body shapes. Extensive experiments demonstrate that our method achieves state-of-the-art results in outfit fidelity and identity preservation, providing a new perspective for realistic 3D outfit personalization. Project page: this https URL

[CV-33] Characterizing Detectability in 3DGS Poisoning: A Stage-wise Benchmark

链接: https://arxiv.org/abs/2606.03499
作者: Quoc-Anh Bui-Huynh,Thanh Duc Ngo,Xue Geng,Kaixin Xu,Wang Zhe,Xulei Yang,Ngai-Man Cheung
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has rapidly emerged as a leading representation for real-time novel view synthesis, but recent work shows it is vulnerable to diverse poisoning attacks, including illusory object injection, computation cost amplification, and post hoc model watermarking. Despite this expanding threat surface, existing studies focus mainly on attack success, while defense and detection remain underexplored. From a detection perspective, a key challenge and opportunity arise from the multi-stage nature of the 3DGS reconstruction pipeline, which produces heterogeneous intermediate representations. Forensic signals for detecting poisoning are inherently stage dependent: an attack introduced at one stage may produce signals that emerge only at later stages. This motivates a stage-wise view of detectability that goes beyond single-stage evaluation. We introduce Poison-3DGS, a benchmark for stage-wise characterization of poisoning detection in 3DGS. It exposes stage-specific artifacts, including multi-view images, geometry, training dynamics, and Gaussian parameters, across a diverse set of scenes and attacks. Using it, we conduct a systematic study of detectability across pipeline stages. Our analysis reveals several insights. First, detectability varies significantly across stages, and no single stage consistently dominates across attack types. Second, different attacks exhibit distinct stage-specific forensic signals, so detection effectiveness depends critically on where signals are observed. Third, later-stage signals such as training dynamics and Gaussian parameter statistics provide strong cues not observable at earlier stages. Overall, our work provides a principled benchmark and the first systematic characterization of stage-dependent detectability in 3DGS, offering a foundation for future research on robust and reliable 3DGS systems.

[CV-34] Low-Frequency Shortcuts in Texture-Driven Visual Learning

链接: https://arxiv.org/abs/2606.03493
作者: Utku Şirin,Cathy Hou,David Alvarez-Melis,Stratos Idreos
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural networks suffer from shortcut learning, where learned features generalize well to the training set but not to in-distribution (ID) or out-of-distribution (OOD) test sets. Existing studies are all based on a few standard benchmarks, which are shape-driven. Numerous application domains, however, are texture-driven. In this work, we present shortcut learning analysis for texture-driven domains, and compare it with that of a standard benchmark. We show that texture-driven domains suffer from low-frequency shortcuts. They make the majority of their decisions based on a few low-frequency components (LFCs) with a skewed spectral behavior, despite that their classification information is in higher-frequency, fine-grained details. Pruning LFCs from training and test sets eliminates the shortcut and provides a more balanced spectral behavior, improving the ID accuracy by up to 8%. We show that low-frequency shortcuts make the models highly vulnerable to OOD corruptions, leading up to 70% accuracy drop compared to the ID accuracy. Pruning LFCs significantly improves robustness to low-frequency corruptions, by up to 40%, and introduces a trade-off for high-frequency corruptions; the balanced spectral behavior provides a better generalization performance, whereas the increased dependence on high-frequency features reduces it. OOD accuracy depends on the interaction between these two factors.

[CV-35] rAction: Action Recognition with Sparse Trajectories

链接: https://arxiv.org/abs/2606.03490
作者: Jan F. Meier,Felix B. Mueller,Alexander Ecker,Timo Lüddecke
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern action recognition models operate on memory- and compute-intensive dense RGB video volumes and frequently exploit appearance and background shortcuts, for example, predicting actions from objects or scenes instead of characteristic motion. We investigate an efficient alternative input modality that is largely free of such biases by construction: sparse point trajectories. To this end, we develop a simple transformer architecture for 2.5D trajectory-based recognition together with a masked-trajectory pretraining, which we show to substantially improve downstream action recognition accuracy. Despite using only a fraction of the dense RGB input, our method reaches 45% top-1 on Something-Something V2 and 54% on EPIC-Kitchens-100, and surpasses V-JEPA on time-reversal sensitivity. More importantly, we find trajectory features to be complementary to state-of-the-art appearance-based features. Fusing our pretrained model with DINOv2 and V-JEPA 2 improves top-1 accuracy on Something-Something V2 by 8.7 and 1.6 points, respectively. Code: this https URL

[CV-36] PersistGS: Differentiable Physics for Object Permanence in 4D Gaussian Splatting CVPR

链接: https://arxiv.org/abs/2606.03479
作者: Adrian Ramlal,John S. Zelek
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Workshop on Generative 3D Reconstruction

点击查看摘要

Abstract:Dynamic 3D Gaussian Splatting (3DGS) methods reconstruct time-varying scenes from synchronized multi-camera video using photometric supervision. When a moving object becomes fully occluded from all training cameras, this supervision vanishes: the Gaussians representing it receive no gradient signal and degrade. Existing approaches to incomplete observations in neural reconstruction rely on learned generative priors that prioritize visual plausibility over physical correctness. We propose \textbfPersistGS , a method that restores object permanence during occlusion by coupling differentiable rigid body simulation with 3D Gaussian Splatting. Our approach decomposes the scene into per-object Gaussians and collision meshes, estimates friction and velocity from the observed pre-occlusion trajectory via differentiable simulation, and uses the resulting SE(3) trajectory to position object Gaussians throughout the occlusion period. Because the predicted trajectory satisfies the governing equations of rigid body dynamics, it faithfully captures contact events (bounces, friction-based deceleration, direction changes) that kinematic extrapolation cannot model. We introduce a centroid silhouette loss that isolates positional gradients from appearance noise, yielding 40% lower trajectory error than photometric supervision. We evaluate using cameras withheld from training that observe the object during its occlusion. Experiments on synthetic scenes show that PersistGS outperforms constant velocity extrapolation by +2.46dB PSNR and comes within 0.19dB of a ground-truth trajectory upper bound. Comments: Accepted in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Workshop on Generative 3D Reconstruction Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) ACMclasses: I.4.8; I.3.7; I.2.9 Cite as: arXiv:2606.03479 [cs.CV] (or arXiv:2606.03479v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.03479 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 4687-4696

[CV-37] Mixed-Modality Dual Face-Hair Retrieval

链接: https://arxiv.org/abs/2606.03470
作者: Quoc-Anh Bui-Huynh,Mai-Tuyen Lam,Dai-Anh-Tuan Nguyen,Thanh Duc Ngo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Dual Face-Hair Retrieval (DFHR), a new mixed-modality dual-reference task in image retrieval where a query consists of a face image specifying identity and a hairstyle reference expressed as either an image or text. Unlike prior retrieval settings, DFHR requires cross-component reasoning between two semantically independent attributes – identity and hairstyle – originating from heterogeneous modalities. This formulation demands localized feature disentanglement, cross-modal semantic alignment, and mixed-modality composition within a unified embedding space. We construct DFHR-Bench, the first benchmark for mixed-modality face-hair retrieval, comprising over 180K annotated triplets across dual-image and image-text settings, built via a multi-stage annotation protocol ensuring semantic and identity integrity. We further propose MFHC (Multimodal Face-Hair Combiner), a unified framework that fuses disentangled identity and hairstyle embeddings through token injection and multi-view supervision. DFHR and DFHR-Bench together establish a new paradigm for identity-aware, attribute-controllable visual retrieval across modalities.

[CV-38] Reinforcement Learning from Cross-domain Videos with Video Prediction Model

链接: https://arxiv.org/abs/2606.03201
作者: Zhao Yang,Xinrui Zu,Jacob E. Kooi,Thomas Delliaux,He Liu,Shujian Yu,Kevin Sebastian Luck,Vincent François-Lavet
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the presence of domain gaps. We introduce XIPER (Cross-domain Video Prediction Reward), a reward model for learning from expert videos collected in a visually different domain, where the agent’s appearance differs due to factors such as color, morphology, or the sim-to-real gap. More specifically, XIPER trains a cross-domain video prediction model that maps agent observations into the expert domain and uses the prediction likelihood as a reward signal. Experiments on the DMC Color Suite (8 tasks) and DMC Body Suite (3 tasks) show that XIPER consistently outperforms baselines despite domain gaps such as differences in agent color and morphology. We further analyze XIPER on a sim-to-real transfer dataset, demonstrating that it produces meaningful reward signals for real-robot observations given only simulated expert videos. Code, pretrained models, datasets and video demonstrations can be found on our project webpage: this https URL

[CV-39] Inference-Time Scaling for Joint Audio-Video Generation

链接: https://arxiv.org/abs/2606.03183
作者: Jaemin Jung,Kyeongha Rho,Inkyu Shin,Joon Son Chung
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by Transactions on Machine Learning Research (TMLR). Project page: this https URL

点击查看摘要

Abstract:Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs. Synthesized samples and code are available on the project page: this https URL.

[CV-40] Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation

链接: https://arxiv.org/abs/2606.03175
作者: Xunyi Zhao,Sihao Lin,Gengze Zhou,Zerui Li,Shijie Li,Wei Tao,Jiajun Liu,Qi Wu
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Instance Goal Navigation (IGN) requires an embodied agent to find a specific object instance among distractors from an underspecified natural-language description. Such ambiguity often cannot be resolved from perception and language alone, making interaction with an oracle a natural mechanism for disambiguation. Prior interactive methods allow oracle queries but treat lightweight clarification and route-level guidance alike, letting agents boost success rate through repeated high-information questions rather than by resolving the underlying ambiguity efficiently. We recast interactive IGN as a cost-sensitive uncertainty-reduction problem, where the agent should ask the question whose answer provides the largest reduction in navigation uncertainty relative to its penalty. To this end, we apply an information-gain analysis on existing navigation corpora to identify which cues reduce navigation uncertainty, yielding a compact set of question types and data-derived this http URL, existing interactive navigation benchmarks do not model the cost of different question types or evaluate how efficiently agents use interaction, making them unsuitable for studying cost-sensitive interaction. Based on this taxonomy, we construct a benchmark for diagnosing interaction behavior and efficiency, together with a Weighted Success Rate metric that penalizes each query by its derived cost. We further propose a zero-shot MLLM navigator that selectively queries at each decision step only when the expected uncertainty reduction justifies the interaction cost.

[CV-41] JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agent ic Data Curation

链接: https://arxiv.org/abs/2606.03168
作者: Yinan Chen,Chuming Lin,Zhennan Chen,Yuxiang Zeng,Junwei Zhu,Yali Bi,Xijie Huang,Chengming Xu,Donghao Luo,Zhucun Xue,Xiaobin Hu,Chengjie Wang,Yong Liu,Jiangning Zhang,Shuicheng Yan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Equal contributions from first two authors. Project page: this https URL Code: this https URL Dataset: this https URL

点击查看摘要

Abstract:While instruction-based video editing has seen significant progress, joint audio-visual editing remains constrained by the absence of dedicated datasets and benchmarks. To bridge this gap, we present JAVEdit-100k, the first large-scale, high-quality dataset tailored for instruction-guided joint audio-visual editing. Focusing on human-centric videos, JAVEdit-100k comprises approximately 100K editing triplets spanning five distinct categories, including subject editing and speech editing. This dataset is rigorously constructed via four meticulously designed generation pipelines, seamlessly paired with an agent-in-the-loop quality control mechanism. Furthermore, to address the lack of standardized evaluation within the field, we introduce JAVEditBench, a comprehensive benchmark featuring curated source videos and human-aligned instructions across all editing categories. Finally, we propose JAVEdit, a pioneering baseline model for instruction-guided joint audio-visual editing. Experiments show that \model\ outperforms all baselines on five of six evaluation metrics.

[CV-42] SRENet: Spectral Re-Entry Network for Point Cloud Action Recognition

链接: https://arxiv.org/abs/2606.03160
作者: Qiuxia Wu,Jiarui Lan,Wenxiong Kang,Zhiyong Wang,Kun Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 11 figures. Accepted by IEEE Transactions on Circuits and Systems for Video Technology

点击查看摘要

Abstract:Recognizing human actions from point cloud sequences is critical for 3D perception driven applications such as autonomous driving and human-computer interaction. However, the irregular structure and temporal inconsistency of point clouds pose unique challenges for spatio-temporal representation learning, especially in capturing both global motion context and fine-grained temporal dynamics. We propose SRENet, a spectral-aware framework designed to explicitly learn both global context and fine-grained temporal dynamics of motion from a frequency perspective for action recognition. SRENet introduces a Spectral Decomposition Block (SDeBlock) that performs wavelet-based analysis along temporal and spatial axes, disentangling features into low- and high-frequency components with frequency-specific attention. To recover residual dynamics and re-align temporal frequency structures distorted during semantic fusion, a Spectral Re-entry Block (SReBlock) performs secondary temporal decomposition. Furthermore, a spectral-aware learning strategy is devised to enhance discriminability in both frequency subspaces via contrastive loss and a curriculum schedule that gradually shifts focus from low- to high-frequency spaces in line with coarse to detailed motion patterns. Extensive experiments on MSR-Action3D, NTU-RGBD and NTU-RGBD120 demonstrate that SRENet achieves state-of-the-art performance, validating the effectiveness of frequency modeling in point cloud-based action understanding.

[CV-43] NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

链接: https://arxiv.org/abs/2606.03159
作者: NVIDIA:Aarti Basant,Amlan Kar,Despoina Paschalidou,Fangyin Wei,Francesco Ferroni,Guillermo Garcia Cobo,Haithem Turki,Huan Ling,Jaewoo Seo,James Lucas,Jay Zhangjie Wu,Jialiang Wang,Jonathan Lorraine,Jun Gao,Kai He,Katarina Tothova,Kevin Xie,Michał Tyszkiewicz,Qi Wu,Riccardo de Lutio,Ruilong Li,Sanja Fidler,Seung Wook Kim,Tianchang Shen,Tianshi Cao,Tobias Pfaff,William Lew,Xindi Wu,Xuanchi Ren,Yifan Lu,Yuxuan Zhang,Zan Gojcic,Zian Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. In closed-loop simulation, the driving policy model actively interacts with the environment, where its actions dynamically update the simulator state and directly influence the next set of generated sensor observations. While recent reconstruction-based neural simulators offer photorealism, they are fundamentally constrained by their initial captured data and struggle to generalize to highly dynamic or novel scenes. To overcome these limitations, we introduce OmniDreams, a foundation generative world model mid- and post-trained from the Cosmos diffusion model to autoregressively generate action-conditioned videos in real time. By leveraging the rich visual priors of Cosmos and mid- and post-training on 21k hours of driving scenarios, OmniDreams synthesizes complex, unobserved phenomena that are hard for traditional simulators to capture, such as extreme weather and unpredictable dynamic agent behaviors. Crucially, it autoregressively conditions its photorealistic sensor generation on past frames, the current simulator state, and immediate driving actions. Deployed in a closed-loop system with the Alpamayo 1 policy model and AlpaSim orchestrator, OmniDreams acts as a highly responsive, reactive environment, providing a scalable and comprehensive solution for training and evaluating next-generation autonomous driving policies. We additionally show preliminary results indicating that a world-action model (WAM) post-trained from OmniDreams achieves strong performance on the Physical AI Autonomous Vehicles NuRec dataset, surpassing the VLA-based Alpamayo 1.5 research policy model while using only 1/5 the total parameters. These results highlight the potential for a real-time world model like OmniDreams to also serve as a backbone for policy architectures.

[CV-44] A2: Smaller Self-Supervised ViTs Localize Better than Larger Ones

链接: https://arxiv.org/abs/2606.03148
作者: Sreehari Rammohan,Huy Ha,Carl Vondrick
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust visual classification often depends on localizing the main foreground objects in an image while ignoring contextual distractors. Surprisingly, we find that the attention maps of smaller self-supervised ViTs localize foreground objects better than those of larger ViTs. However, we still need large ViTs, because they extract richer representations from each patch. To get the best of both worlds, good localization and rich representations, we propose A^2 , a simple method that leverages this inverse scaling finding by decoupling where to look (a small attention model) from what to extract (a large embedding model): we crop around the attention peaks of a small model and embed the crops with a larger model. A^2 uses entirely pretrained features, requires no group labels, and does not require per-dataset attention or backbone training. Across 5 benchmarks, A^2 is competitive with backbone-matched loss-level methods like DFR, and outperforms end-to-end attention training under stronger distribution shifts.

[CV-45] Disentangling Visual and Factual Correctness in LVLMs Visualization Literacy

链接: https://arxiv.org/abs/2606.03142
作者: Soohyun Lee,Jaeyoung Kim,Seokhyeon Park,Sihyeon Lee,Jiwon Song,Bohyoung Kim,Hyunjoo Song,Jinwook Seo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at IEEE Transactions on Visualization and Computer Graphics (TVCG). 23 pages, 9 figures

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) show strong visualization interpretation, yet it is unclear whether their responses reflect genuine reasoning over visual evidence or factual priors learned during training. Current evaluations mix these two sources, obscuring when correct visual interpretation is overridden by memorized facts. We present a framework that isolates visual correctness from factual correctness, revealing validity limitations in existing visualization literacy assessments. Across three experiments with 15 state-of-the-art LVLMs: (1) several models reach human-level performance on standard tests (VLAT), but this may reflect factual recall rather than visual understanding, while randomized-data tests (reVLAT) underestimate literacy when correct visual interpretation is superseded by factual priors. (2) Using our Counterfactual Visualization Literacy Assessment Test (CVLAT) with capability-normalized arbitration metrics, we classify models by the sign of their visual-factual reliance index (VFRI), revealing a visualization-oriented majority and a factual knowledge-oriented minority, though several near-zero cases warrant caution. A human baseline (N=30) on the same counterfactual items confirms that people overwhelmingly follow the chart under conflict, providing a human reference point. (3) Prompt-based intervention can shift prioritization, but its effectiveness is highly model-dependent and direction-asymmetric, and high chart-reading capability does not predict prompt-controllability. Overall, high visualization accuracy is not sufficient evidence of faithful visual reasoning: reliable integration into visual analytics requires evaluating not only visualization literacy but also how models arbitrate between visual evidence and factual priors when the two diverge. Benchmark and code: this https URL

人工智能

[AI-0] Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

链接: https://arxiv.org/abs/2606.03988
作者: Mahtab Bigverdi,Lindsey Li,Weikai Huang,Yiming Liu,Jaemin Cho,Jieyu Zhang,Tuhin Kundu,Chris Dangjoo Kim,Zelun Luo,Linda Shapiro,Ranjay Krishna
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under alternative spatial configurations while remaining consistent with the observed input. To study this capability, we formulate three tasks, Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC), and construct datasets of approximately 20K examples with ground truth imaginations, answers, and evaluation benchmarks. Using the unified VLM BAGEL as the backbone, IPT supervision consistently improves spatial reasoning and often outperforms textual chain of thought training, even without generating images at inference time. On MVC, IPT improves accuracy by 3.4% and achieves competitive performance with strong closed-source models on PT. We further find that combining IPT and label-only supervision yields additional gains, whereas textual chain of thought can substantially degrade performance, suggesting a modality mismatch when spatial computation is forced through language. Overall, IPT provides a principled supervision signal for reasoning about unobserved spatial structure, improving generalization while producing interpretable intermediate representations. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.03988 [cs.AI] (or arXiv:2606.03988v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.03988 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-1] Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

链接: https://arxiv.org/abs/2606.03979
作者: Ali Behrouz,Farnoosh Hashemi,Vahab Mirrokni
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: A version of this work has been publicly available from September 2025 on OpenReview

点击查看摘要

Abstract:The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific shallow models to more general deep Large Language Models (LLMs). Despite showing promising results in tasks that require instant prediction or in-context learning, existing models lack the ability to continually learn and effectively transfer their temporal in-context knowledge to their long-term parameters. Inspired by human learning process, we introduce a ‘‘Sleep’’ paradigm that allows the models to continually learn, distill their short-term fragile memories into stable long-term knowledge with replay, and recursively improve themselves with ‘‘Dreaming’’ process. In more detail, sleep consists of two stages: (1) Memory Consolidation: an upward distillation process, called Knowledge Seeding, where the memories of a smaller-self are distilled into a larger network to provide more capacity while preserving the knowledge. As a proof of concept, we present a new Generalized Distillation process for Knowledge Seeding (i.e., the combination of on-policy distillation with Reinforcement Learning (RL)-based imitation learning); (2) Dreaming: a self-improvement phase, where the model uses RL to generate a curriculum of synthetic data to rehearse new knowledge and refine existing capabilities without human supervision. Our experiments on long-horizon, continual learning, knowledge incorporation, and few-shot generalization tasks support the importance of the sleep stage.

[AI-2] Self-Refining Agent ic Reinforcement Learning for Vision-Conditioned UAV Navigation

链接: https://arxiv.org/abs/2606.03963
作者: Roohan Ahmed Khan,Yasheerah Yaqoot,Muhammad Ahsan Mustafa,Dzmitry Tsetserukou
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fine tuning, which is time consuming and does not guarantee high success in the desired task. This paper presents AgenticRL, agent guided reinforcement learning framework that increases autonomy in reward design, policy refinement, and real world deployment for unmanned aerial vehicles (UAV) navigation tasks. AgenticRL uses a multimodal generative pre-trained tansformer (GPT) agent to interpret task information and visual scene observations, generate task specific reward functions, train policies using Proximal Policy Optimization (PPO) algorithm, and then act as a critic by evaluating the trained policy through diagnosis packets to generate feedback. Based on this feedback, the agent identifies failure modes and refines the reward function in a closed loop self improvement process. To further leverage the multimodal GPT agent during inference, AgenticRL uses real world images and natural language task information to automatically identify the active scenario and select the appropriate trained policy for execution. The framework is evaluated on multiple navigational tasks, including gate traversal, obstacle avoidance, wall barrier crossing with landing, trajectory following, and motion behavior learning. Experimental results show that the closed loop refinement process improves policy behavior compared with initial rewards by 71%. We also demonstrate sim-to-real transfer of the proposed framework, achieving a real world success rate of 91% and a sim-to-real accuracy of 94%.

[AI-3] Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

链接: https://arxiv.org/abs/2606.03962
作者: Anthony GX-Chen,Ankit Anand,Gheorghe Comanici,Zaheer Abbas,Eser Aygün,David Smalling,Shibl Mourad,Doina Precup,André Barreto,Mark Rowland
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Core contributors: Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, André Barreto, Mark Rowland

点击查看摘要

Abstract:Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization or diversity bonuses often require fragile trade-offs that sacrifice performance for stochasticity or rely on heuristic metrics that can misalign policy rankings. We argue that diversity is more naturally understood as the rational response to uncertainty in the reward. When the reward function is not perfectly known–as is the case with ambiguous preferences or imperfect reward models–committing to a single action can be sub-optimal. Building on this, we propose a fundamental reformulation of the RL objective by replacing the scalar reward with a distribution over reward functions, and applying a non-linear objective over sets of actions. The result is a framework in which calibrated behavioural diversity emerges naturally, remains controllable through the reward function distribution, and is obtained without sacrificing expected reward. Focusing on the contextual bandit setting, we derive a principled gradient estimator for this objective and prove that our formulation naturally generalizes both vanilla policy gradient and more recently developed action-set approaches. Our empirical results demonstrate that this framework offers a robust and theoretically grounded alternative for complex RL tasks where the traditional formulation of the problem fails to induce the desired breadth of agent behaviour.

[AI-4] FlashbackCL: Mitigating Temporal Forgetting in Federated Learning

链接: https://arxiv.org/abs/2606.03939
作者: Mubarak A. Ojewale,Adriana E. Chis,Jorge M. Cortes-Mendoza,Bernardo Pulido-Gaytan,Horacio Gonzalez-Velez
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Federated Learning (FL) of foundation and edge models increasingly targets deployments where client data distributions drift over time, yet existing forgetting-mitigation methods assume each client’s distribution is stationary. Flashback, the strongest recent FL method against cross-client (spatial) forgetting, uses monotonically accumulating per-class label counts as a knowledge proxy; this proxy becomes miscalibrated under temporal distribution shift and anchors the global model to an outdated class balance. We formalise temporal forgetting in FL with a per-phase metric isolated from protocol-level fluctuations and propose Flashback Continual Learning (FlashbackCL), a drop-in extension of Flashback with (i) temporally-decayed label counts; (ii) a device-aware replay buffer with Class-Balanced Reservoir Sampling (CBRS); and (iii) server-side active coreset curation on the public distillation set. The results show that FlashbackCL achieves 6.9% to 10.0% relative improvement relative to Flashback, on CIFAR-10 with 50 clients and three controlled temporal shift modes, while simultaneously reducing temporal forgetting by up to 68%. A 5-variant ablation identifies CBRS replay as the critical component. FlashbackCL also improves Flashback by 3.5 points on stationary CIFAR-100, suggesting that class-balanced replay regularises spatial heterogeneity as well as temporal shift.

[AI-5] q0: Primitives for Hyper-Epoch Pretraining

链接: https://arxiv.org/abs/2606.03938
作者: Bishwas Mandal,Shmuel Berman,Akshay Vegesna,Samip Dahal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions. We introduce hyper-epoch pretraining (q0), which turns a multi-epoch budget into a population of diverse models whose combined predictions reach a lower validation loss than a single refined model. q0 reduces to three core primitives. A cyclic schedule with anti-correlated learning rate and weight decay collects diverse models from a few parallel trajectories. Chain distillation trains each model against its predecessor so that model quality compounds across the population. A learned prior, fit on a held out set, selects and weights members for any inference budget. On a 1.8B-parameter model trained on 100M FineWeb tokens, q0 matches a strong 256-epoch ensemble baseline using only \sim56 epochs ( \sim4.6\times fewer), or \sim67 epochs ( \sim3.8\times fewer) when matched to the baseline’s ensemble size, and continues to improve beyond it. These gains reach cumulative \sim12.9\times data efficiency under the Slowrun setting and transfer to downstream benchmarks. Crucially, the optimal allocation shifts with the budget, so we give prescriptive recipes for how to spend a given epoch budget to maximize generalization, from a single epoch up to the largest budgets.

[AI-6] Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

链接: https://arxiv.org/abs/2606.03937
作者: Senjie Jin,Peixin Wang,Boyang Liu,Xiaoran Fan,Shuo Li,Zhiheng Xi,Jiazheng Zhang,Yuhao Zhou,Tao Gui,Qi Zhang,Xuanjing Huang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While token-level entropy is commonly recognized as effective for credit assignment in text-only reinforcement learning with verifiable rewards (RLVR), it remains unclear whether this mechanism still holds in visual reasoning. Our controlled study shows that this mechanism collapses in visual reasoning due to the omission of vision-sensitive tokens with naturally low entropy. Although existing multimodal RL methods increasingly acknowledge the importance of visual perception, they struggle to satisfy the inherent demand for interleaving precise perceptual grounding with semantic reasoning, either lacking systematic visual measurements or overlooking that token entropy primarily drives semantic exploration. To address this, we introduce VEPO (Vision-Entropy token-selection for Policy Optimization), an effective RL framework explicitly integrating visual sensitivity with token entropy via a principled multiplicative coupling, where VEPO redirects gradient credit toward tokens which are simultaneously visually grounded and highly informative. Extensive experiments demonstrate VEPO’s leading performance, significantly outperforming the entropy-only baseline by 2.28 points at 7B-scale and 3.15 points at 3B-scale. Ablations further substantiate the soundness of our method.

[AI-7] FFR: Forward-Forward Learning for Regression

链接: https://arxiv.org/abs/2606.03927
作者: Xinyang Liu,Xuanyu Liang,Shiqi Ding,Boyang Li,Zhiqiang Que,Jiayang Li,Guosheng Hu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Forward-Forward (FF) algorithm offers a computationally efficient and biologically plausible alternative to backpropagation (BP) by training neural networks through purely local, layer-wise optimization. However, FF is inherently designed for classification via contrastive positive-negative sample pairs, and extending it to regression poses fundamental challenges: continuous target space lack natural “opposites” for contrastive learning, and the standard goodness function carries no information about target magnitude or ordering. We propose FFR (Forward-Forward for Regression), to our knowledge, the first framework to extend FF to real-world regression and demonstrate competitive performance across diverse real-world datasets. FFR introduces three key innovations: (1) an ordinal competitive goodness function that replaces contrastive pairs with competitive learning between partitioned neuron groups under distance-aware ordinal supervision; (2) a stratified ladder architecture where shallow layers learn coarse ordinal discrimination and deeper layers refine into fine-grained regression, with multi-scale feature aggregation for inter-layer collaboration; and (3) hierarchical prediction with uncertainty estimation, where multi-scale predictors jointly provide robust predictions and prediction confidence as a free-lunch. Extensive experimental results show FFR recovers on average 98.6% of BP’s accuracy across five real-world regression benchmarks while reducing peak training memory to only 27% of BP’s at depth 8 and 8% at depth 32, with per-iteration time around 72% of BP’s, and substantially outperforms all BP-free competitors.

[AI-8] Hedge-Bench: Benchmarking Agents on Hard Realistic Tasks Pertaining to Financial Reasoning

链接: https://arxiv.org/abs/2606.03918
作者: Eric Cho,Shawn Huang,Alice Lu,Andy Lyu
类目: Artificial Intelligence (cs.AI)
备注: Dataset and evaluation harness available at this http URL

点击查看摘要

Abstract:AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsheets. The harder, more valuable challenge is reasoning through the open-ended questions that define expert Analyst work. Existing benchmarks do not capture this class of problem, and those that attempt to evaluate open-ended reasoning rely on model-judged outputs that introduce noise and circularity. We present Hedge-Bench 1.0: a benchmark of 102 actual, on-the-job tasks grounded in the explicit reasoning traces of professional hedge fund analysts working with relevant information sources. This approach enables deterministic grading against verified expert steps. Frontier models and agents score below 16% on the benchmark. We publish the dataset and evaluation harness at this http URL.

[AI-9] NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

链接: https://arxiv.org/abs/2606.03910
作者: Mubarak Adetunji Ojewale
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly into the Time to First Token (TTFT) budget. Current schedulers route on compute load and prefix-cache locality alone, ignoring the topological distance and dynamic congestion between prefill and decode instances. We close this gap with a thin operator-to-scheduler interface, the network cost oracle, and we prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.

[AI-10] scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation

链接: https://arxiv.org/abs/2606.03906
作者: Jiabei Cheng,Jingbo Zhou,Jun Xia,Changkai Li,Zhen Lei,Chang Yu,Stan Z. Li
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of datasets, evaluation metrics, and influencing factors. To address this, we present scTranslation, a comprehensive benchmark for single-cell multi-omics modality translation tasks. It includes diverse translation datasets, integrates state-of-the-art models, and provides a comprehensive evaluation metrics. In addition, we assess model performance under different scenarios, such as feature selection, feature quality, and few-shot settings. These factors significantly affect model performance but have rarely been systematically studied before. Leveraging this benchmark, we conduct a large-scale study of current methods, report many insightful findings that open up new possibilities for future development. The benchmark is open-sourced to facilitate future research. The code is anonymously released at this https URL.

[AI-11] Agent libOS: A Library-OS-Inspired Runtime for Long-Running Capability-Controlled LLM Agents

链接: https://arxiv.org/abs/2606.03895
作者: Yingqi Zhang
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 14 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Large language model (LLM) agents are evolving from request-response assistants into long-running software actors: they maintain state across model calls, fork subtasks, wait for external events, request human authority, generate tools, and perform side effects that must be resumed and audited. This paper presents Agent libOS, a library-OS-inspired runtime substrate for LLM agents. Agent libOS runs above a conventional host operating system; it does not implement hardware drivers, kernel-mode isolation, or a POSIX-compatible operating system. Instead, it treats an agent as an AgentProcess: a schedulable execution subject with process identity, parent-child lineage, lifecycle state, a tool table derived from an AgentImage, typed Object Memory, explicit capabilities, human queues, checkpoints, events, and audit records. Its central design rule is tools are libc-like wrappers; runtime primitives are the authority boundary. Filesystem access, object access, sleeps, human approval, JIT tool registration, and external side effects are checked at primitive boundaries under explicit capabilities and policy. We describe the design, threat model, Python prototype, and safety-oriented evaluation. The current prototype implements async scheduling, namespace-local Object Memory, runtime-integrated human approval, one-shot permission grants, per-process working directories, shell and image-registration primitives, Deno/TypeScript JIT tools over a libOS syscall broker, filesystem/object bridge tools, an injectable Resource Provider Substrate, deterministic demos, real-model smoke scripts, and 123 regression tests at the time of writing. Rather than improving planner accuracy, Agent libOS demonstrates a runtime substrate in which long-running LLM agents can be scheduled, authorized, resumed, and audited without treating tool dispatch as the trust boundary. Comments: 14 pages, 1 figure, 2 tables Subjects: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) ACMclasses: D.4.6; D.4.7; I.2.11 Cite as: arXiv:2606.03895 [cs.OS] (or arXiv:2606.03895v1 [cs.OS] for this version) https://doi.org/10.48550/arXiv.2606.03895 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-12] Reasoning Structure of Large Language Models ICLR2026 ICML2026

链接: https://arxiv.org/abs/2606.03883
作者: Frédéric Berdoz,Luca A. Lanzendörfer,Fabian Farestam,Roger Wattenhofer
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICML 2026 and presented at the ICLR 2026 workshop on LLM reasoning

点击查看摘要

Abstract:Large reasoning models (LRMs) are often evaluated using metrics such as final-answer accuracy or token count. However, identical scores on these metrics can hide fundamentally different reasoning structures. To address this limitation, we introduce a scalable LRM benchmark of logic puzzles and a pipeline that converts unstructured traces into verifiable reasoning graphs of claims and dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this, we define a reasoning efficiency metric that quantifies how concentrated the model’s logical flow is. Our analysis on open-source reasoning models shows that structural measurements separate behaviors that token count and accuracy conflate, providing a practical tool for diagnosing failure modes and comparing how reasoning scales with puzzle difficulty.

[AI-13] PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

链接: https://arxiv.org/abs/2606.03858
作者: Zetian Ouyang,Linlin Wang,Gerard de Melo,Liang He
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the pivotal role of numerical reasoning as the cornerstone of mathematical capabilities in large language models (LLMs) across applications, few benchmarks evaluate LLMs by integrating numerical processing and mathematical reasoning, hindering the interpretability of failures in math tasks. We introduce PyraMathBench, a comprehensive hierarchical benchmark with 32,505 questions derived from 7,404 math word problems, spanning 4 key cognitive aspects, 14 subcategories, and 2 modalities. Experiments reveal that LLMs’ performance is severely compromised by inadequate numerical computation and weak handling of abstract numerical questions. To address this, we propose the Smart Optimization Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO), which enhance LLMs’ numerical-mathematical synergy via efficient tool calls (fuzzy matching and low-quality call rejection). Comparative experiments show Qwen-2.5 achieves a 5.0 score improvement with SOLVE and IRPO training.

[AI-14] FLARE: Fine-Grained Diagnostic Feedback for LLM Code Refinement

链接: https://arxiv.org/abs/2606.03852
作者: Yinsheng Yao,Hongxiang Zhang,Weixi Tong,Tianyi Zhang
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models often generate code with bugs. Existing methods rely on feedback signals such as test failures and self-critiques to iteratively refine the generated code. Such signals are either too coarse-grained or too high-level, which is not sufficient to inform the model where to fix the bug. In this work, we present Flare, an iterative framework with a lightweight diagnostic model that predicts line-level suspiciousness signals for bug localization and code refinement. Given the inherent uncertainty of diagnostic predictions, Flare searches over the top-k suspicious regions and selects the best candidate according to execution outcomes. Experiments on LiveCodeBench and BigCodeBench with five base LLMs show that, even without candidate search (k=1), Flare outperforms the strongest baseline with an absolute improvement from 1.72% to 7.42%. Furthermore, searching over 10 candidates yields an average improvement of 8.50% compared with no candidate search. When evaluated in isolation, our lightweight diagnostic model achieves the best performance compared with recent fault localization methods, demonstrating that it can provide reliable fine-grained guidance for code refinement.

[AI-15] Re-Evaluating Continual Learning with Few-Shot Adaptation

链接: https://arxiv.org/abs/2606.03843
作者: Amogh Inamdar,Matthew So,Vici Milenia,Richard Zemel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 16 figures

点击查看摘要

Abstract:Continual learning methods aim to maximize the stability and plasticity of machine learning models that are trained on a sequence of tasks. The standard measure of stability (i.e., forgetting) is the 0-shot performance of a model on previously learned tasks, and plasticity, the performance on the most recently learned task. However, 0-shot evaluation does not fully measure a model or method’s ability to retain learned information or adapt quickly to new information, as it requires perfect recall across multiple tasks. In this paper, we propose few-shot evaluation as a more comprehensive assessment of the stability and plasticity of a continual learning system. We conduct a fine-grained assessment on task sequences for continual image classification and find that this paradigm produces novel insights into the performance of popular continual learning strategies. Through few-shot evaluation with a novel metric – per-shot plasticity – we show that adding `foresight’ to continual learning methods via the meta-learning of a short sequence of future tasks induces learning-to-learn behavior over the task sequence.

[AI-16] EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management KDD2026

链接: https://arxiv.org/abs/2606.03841
作者: Zherui Yang,Fan Liu,Yansong Ning,Hao Liu
类目: Artificial Intelligence (cs.AI)
备注: Accepted by KDD2026

点击查看摘要

Abstract:Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS’s hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at this https URL.

[AI-17] BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

链接: https://arxiv.org/abs/2606.03829
作者: Alex Wang,Georg Meinhardt,Jacob Katz,Joseph H. Kim,Pratyush K. Chaudhary,Chase Blagden,Eric Xu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Financial-research answers are decision-relevant only when another analyst can audit how they were produced: which source was chosen, which period and accounting definition were used, which assumptions were made, and how the calculation was performed. Existing finance benchmarks largely evaluate isolated subskills or final answers, leaving the auditable derivation itself under-measured. We introduce BigFinanceBench, a 928-item expert-authored benchmark of open-ended financial-research tasks in which each item pairs a ground-truth reference answer with a point-weighted rubric that decomposes the derivation into independently checkable steps. BigFinanceBench is workflow-grounded in that it evaluates the full derivation rather than only the final output. Across 36,241 rubric points, the benchmark supports partial-credit evaluation and localization of failures across the analyst workflow. Evaluating ten current frontier and open-weight agents, we find substantial headroom: the best system reaches only 58.8% rubric score, final-answer accuracy is a useful but lossy proxy for derivation quality, and model capability varies non-uniformly across financial workflows.

[AI-18] When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation

链接: https://arxiv.org/abs/2606.03532
作者: Haowei Guo,Baolong Bi,Ruicheng Zhang,Bingqian Sun,Wentao Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher’s update schedule – which governs the \emphtemporal coupling between teacher and student – has not been systematically studied as a stability variable. Through a controlled schedule sweep on Qwen3-8B, we establish that \emphisolation periods, defined as complete teacher freezing between updates, are the key structural property enabling stable learning, not teacher age. To characterize these underlying training dynamics, we introduce a diagnostic framework of temporal KL structure, refresh shock, and length-tail risk. This framework further uncovers \emphstate-oblivious collapse: optimal short-horizon fixed schedules catastrophically fail under long-horizon training because a clock-driven refresh can copy a transiently drifting student into the teacher in a single, irreversible step. This failure mode is invisible under short-horizon evaluation and mechanistically distinct from EMA’s chronic contamination. To address this, we propose \emphConsolidation-Gated Teacher Refresh (CGTR), which preserves isolation periods while gating each refresh on joint evidence of reward improvement and length-tail safety, ensuring every teacher movement responds to genuine student consolidation rather than a clock signal. With a single shared parameter set and no per-dataset retuning, CGTR achieves \textbfzero collapse and the best final score on all four tasks (Chemistry, Biology, Physics, ToolUse), self-regulating its refresh frequency to each task’s learning dynamics.

[AI-19] High-Precision APT Malware Attribution with Out-of-Scope Resilience

链接: https://arxiv.org/abs/2606.03523
作者: Peter Williams,Adam Sobey,Erisa Karafili
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Early attribution of Advanced Persistent Threat (APT) activity can help defenders prioritise investigation, select countermeasures, and reduce the impact of an intrusion. Malware provides useful attribution evidence, but automated APT malware attribution remains difficult in practice. Existing approaches are typically trained and evaluated as closed-set classifiers over a limited number of known APT groups. In operational environments, however, classifiers are likely to encounter samples from groups not represented during training. Closed-set classifiers are then forced to assign such samples to known groups, producing unsupported and potentially misleading attributions. We present a high-precision APT malware attribution method based on ranked binary classifiers with explicit abstention. Rather than training a single multi-class classifier, our approach trains and tunes two binary classifiers per APT group, ranks the classifiers by validation performance, and applies them sequentially. A sample is attributed only when a classifier provides sufficient evidence; otherwise, it abstains. We evaluate the method on the APT Malware dataset and on a larger combined dataset designed to stress-test out-of-scope behaviour. On the APT Malware dataset, the method achieves higher precision than previously published results on the same dataset. In the most challenging setting, where 87% of test samples came from 60 APT groups excluded from training, the method abstained on 94% of out-of-scope samples while maintaining 92% precision and 95% selective accuracy on the samples it classified.

[AI-20] Post-Hoc Robustness for Model-Based Reinforcement Learning

链接: https://arxiv.org/abs/2606.03521
作者: Siemen Herremans,Ali Anwar,Siegfried Mercelis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents under adversarial environment perturbations. In this setting, a protagonist agent optimizes a policy under environmental perturbations from an adversary, resulting in a zero-sum Markov game. When adversarially robust RL is combined with model-based RL, the adversary can target a learned transition model instead of the training environment. Extending this idea, this work introduces post-hoc robustification of deep RL agents at inference time. By using the learned model in combination with a trained nominal policy, our approach performs a robust policy improvement step. The goal is to improve robustness without any additional training of neural networks. Specifically, we utilize model-predictive control under adversarial rollouts, which are approximated via projected gradient descent within a bounded uncertainty set. Furthermore, these offline rollouts are performed while considering and mitigating out-of-distribution issues. The proposed methodology is validated by demonstrating significant improvements in robustness when the algorithm is evaluated in perturbed Gymnasium MuJoCo environments, while considering the computational limitations of the post-hoc inference setting.

[AI-21] Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agent ic AI

链接: https://arxiv.org/abs/2606.03518
作者: Amjad Ibrahim,Yong Li
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 12 pages

点击查看摘要

Abstract:As AI systems evolve from passive models into autonomous active agents capable of initiating actions, collaborating, and delegating tasks, the traditional boundaries of software systems blur. Traditional authorization and delegation frameworks, built around fixed principals, explicit requests, and static scopes, are insufficient to govern agentic systems. Agentic AI demands richer authorization semantics: agents must inherit and delegate permissions, act under time-limited authority, and coordinate through shared protocols. Existing Identity and Access Management (IAM) systems fail to fully capture this notion of agency, lacking mechanisms for recursive delegation, contextual boundaries, and dynamic scoping as executable governance primitives. Unlike access delegation standards such as OAuth 2.0, we treat delegation as a contractual term rather than merely a static token-based consent credential. This paper proposes a compositional governance framework that introduces primitives indispensable for agentic AI. We define types of delegation and their permissions and accountability implications, and we introduce a notion of resource scope attenuation to bound agentic access envelopes. These concepts are expressed as general relational definitions that can be composed into existing authorization domains (e.g., financial systems). To operationalize this composition, we define a compositional operator that overlays new agentic semantics, such as recursive delegation chains, onto existing relational policies without rewriting them. We substantiate this framework through formal proofs and empirical evaluation, showing that it provides a formal yet practical foundation for accountable authorization in agentic AI systems.

[AI-22] SPADE: Sketch-guided Path Planning Augmented with Diffusion Experts

链接: https://arxiv.org/abs/2606.03512
作者: Charbel Abi Hana,Tatiana Ghantous,Mikael Khalil,Anthony Rizk
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Path planning is essential for Autonomous Mobile Robots (AMRs). Conventional methods for incorporating human preferences into planning typically rely on either complex reward engineering or hardware-intensive solutions. Recent state-of-the-art frameworks leverage imitation learning to train behavior-specific path planning models from expert demonstrations. However, these approaches face two key limitations: limited generalization to unseen environments and low robustness in demonstration collection. To address these challenges, this work introduces an enhanced framework that focuses on two main contributions: an overhauled annotation tool built on ROS 2, and a novel training strategy that integrates diffusion-based augmentation into baseline behavioral cloning models. A dataset of expert demonstrations is provided and evaluated through ablation studies to assess the robustness of the proposed solution. The enhanced approach outperforms state-of-the-art methods with 39.1% lower Absolute Pose Error (APE) and 33.5% lower Fr’echet Inception Distance (FID) while having 93.8% less trainable parameters. Moreover it attains diffusion-level generalization while preserving the real-time, on-edge properties of state-of-the-art models.

[AI-23] houghtFold: Folding Reasoning Chains via Introspective Preference Learning

链接: https://arxiv.org/abs/2606.03503
作者: Ziyan Liu,Xueda Shen,Yuzhe Gu,Songyang Gao,Kuikun Liu,Guangran Cheng,Chengqi Lyu,Dahua Lin,Wenwei Zhang,Kai Chen
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.

[AI-24] Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLM s

链接: https://arxiv.org/abs/2606.03489
作者: Wenqi Chen,Ziyan Zhang,Bing Wang,Lin Liu,Hengheng Zhang,Zhengsu Chen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 18 pages, 3 figures

点击查看摘要

Abstract:While Large Language Models (LLMs) excel in code generation, they remain prone to replicating subtle yet critical vulnerabilities endemic to their training data. Current alignment techniques, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), typically apply coarse-grained optimization at the sequence level. This approach often fails to address the localized nature of security flaws, where a single incorrect token choice can compromise an entire program. To bridge this gap, we introduce Tree-like Self-Play (TSP), a framework that reframes secure code generation as a fine-grained sequential decision process. Unlike standard methods that blindly maximize likelihood, TSP constructs a decision tree where the model explores branching trajectories–generating both secure “golden paths” and vulnerable variants. By treating code generation as a self-play game, the model learns to strictly discriminate against its own localized errors. This provides a dense, on-policy learning signal that forces self-correction precisely at the critical decision nodes where vulnerabilities typically emerge. Our experiments demonstrate that TSP fundamentally enhances model reliability. In Python security benchmarks, TSP boosts CodeLlama-7B’s pass rate (SPR@1) to 75.8%, significantly outperforming SFT (57.0%) and unstructured self-play baselines. Crucially, TSP induces robust out-of-distribution generalization: the model not only reduces vulnerabilities in unseen categories (CWEs) by 24.5% but also successfully transfers security principles learned from C/C++ to diverse languages, including Python, Go, and JavaScript. This suggests that TSP does not merely memorize patches, but internalizes abstract, language-agnostic security logic.

[AI-25] NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense ACL

链接: https://arxiv.org/abs/2606.03486
作者: Zhongyang Lin,Ziran Zhao,Feifei Zhai,Pengyuan Liu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, 17 tables. Submitted to ACL ARR

点击查看摘要

Abstract:Large language models remain vulnerable to jailbreak attacks that hide harmful intent behind seemingly ordinary requests such as role-play, translation, encoding, adversarial suffixes, and multi-turn buildup. Existing defenses still struggle to handle these attacks without over-blocking benign but sensitive requests, partly because they often apply the same action to every prompt and therefore fail to balance safety and helpfulness. We propose NeuroArmor, a white-box runtime defense that uses prompt-specific safe variants as a local safety reference for deciding when intervention is needed and, once triggered, as safe targets for intervention. For each prompt, NeuroArmor builds K safe variants, compares the prompt state against this local safe reference in hidden-state space, and routes anomalies either to a refusal branch for malicious prompts or to a helpful recovery branch for borderline benign prompts. On Llama-3-8B-Instruct, NeuroArmor reduces malicious attack success rate (ASR) from 41.56% to 1.57% while lowering benign false positive rate (FPR) on the shared benign pool from 30.26% to 22.05%; matched baselines remain substantially weaker on this trade-off. External-judge and manual behavioral evaluations further show that the remaining non-blocked outputs are much less likely to be operationally harmful. Overall, NeuroArmor provides a more effective runtime strategy for jailbreak defense by combining prompt-specific consistency checking, routing, and selective intervention.

[AI-26] Analyzing Stream Collapse in Hyper-Connections: From Diagnosis to Mitigation

链接: https://arxiv.org/abs/2606.03483
作者: Ekaterina Alimaskina,Gleb Molodtsov,Aleksandr Beznosikov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hyper-Connections (HC) replace the single Transformer residual stream with multiple streams, introducing a permutation symmetry over stream indices. We study how this symmetry is resolved in practice: whether streams specialize in a balanced way or exhibit dominant-stream usage. Using fine-grained diagnostics for HC-based language models, we trace how multi-stream representations are actually used. We find that after an early seeding stage, residual mixing often remains close to identity, limiting a core HC mechanism for exchanging information between streams. Moreover, both signal and interpretable features concentrate in a dominant stream, and the nominally multi-stream residual connection can underutilize its capacity, behaving closer to a single-stream residual pathway. Finally, we show that breaking symmetry at stream initialization reduces dominant behavior and improves performance across \textitmHC variants. Our code is publicly available.

[AI-27] StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems KDD2026

链接: https://arxiv.org/abs/2606.03467
作者: Taiyu Zhu,Yifan Wu,Weilin Jin,Ying Li,Gang Huang
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures. Accepted by KDD 2026

点击查看摘要

Abstract:LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter-efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross-step dependencies of the trajectories. Finally, the step-level error score is refined through multi-scale differences and position bias, enabling precise root cause identification. Experimental results on the WhoWhen benchmark demonstrate that StepFinder outperforms LLM-based methods in step-level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM-based method, with no text generation overhead. Our code is available at this https URL.

[AI-28] MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

链接: https://arxiv.org/abs/2606.03203
作者: Jia Yu,Zilong Wang,Xinyang Jiang,Dongsheng Li,Shuo Wang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark for clinical computer-use agents. It covers 18 clinical scenarios across 10 medical domains, reconstructed from real product manuals and open-source medical systems to capture authentic clinical interfaces while avoiding licensing and privacy constraints. Each task ships with paired intent- and step-level goals to disentangle clinical reasoning from UI execution, and is evaluated by a deterministic checker over task completion and five clinical safety dimensions. Across 23 agents, the best closed-source model reaches 54.2% strict success, while all models remain below 9% on the real OpenEMR. Open-source agents average only 2.5%, with the best reaching 16.2%. MedCUA-Bench exposes the gap between current agents and reliable clinical software use, providing a reproducible testbed for future research.

[AI-29] ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

链接: https://arxiv.org/abs/2606.03157
作者: Ruihui Hou,Siyi Zhu,Ziyue Huai,Guangya Yu,Yongqi Fan,Chunming Wang,Tong Ruan
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchmarks primarily assess LLM performance in single-course settings and lack systematic evaluation in multi-course scenarios, where a patient’s condition evolves over time. To address this gap, we propose ClinicalMC, a benchmark for multi-course clinical decision-making. It includes 1,275 Chinese and 5,804 English samples across four stages from admission to discharge. These stages cover triage, first-course examination/diagnosis/treatment, subsequent multi-course examination/assessment/treatment, and final diagnosis. In ClinicalMC, patients in the English dataset undergo an average of 5.11 clinical courses, whereas those in the Chinese dataset undergo 3.42. To assess LLM performance, we construct a multi-agent evaluation framework that includes patient, examiner, and doctor agents. Based on the benchmark and framework, we design two experimental settings – a single-turn static setting and a multi-turn dynamic setting – and assess three categories of LLMs: 1) closed-source LLMs like GPT5-mini; 2) open-source LLMs like DeepSeek-V3.2; and 3) medical LLMs like HuatuoGPT-o1. Through extensive evaluation, we aim to better understand LLM performance in the medical domain and support its effective deployment in healthcare.

[AI-30] GTBench: A Curriculum-Grounded Benchmark for Evaluating LLM s as Mathematical Research Assistants in Graph Theory

链接: https://arxiv.org/abs/2606.03144
作者: Noujoud Nader,Ibrahem Aljabea,Patrick Diehl,Deepti Gupta
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties (Group 1), algorithm tracing and structural reasoning (Group 2), and graduate-level proof construction (Group 3). Problems are sourced from verified academic materials including Diestel’s Graph Theory. We evaluate five frontier models – GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 – under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM-as-judge protocol for Group 3. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while all other models degrade substantially with difficulty, with Llama achieving 0% under human evaluation on Group 3 zero-shot. Failure mode analysis shows that correct algorithm, wrong execution errors dominate Groups 1 and 2, while Group 3 additionally surfaces incomplete reasoning failures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near-complete proofs (kappa = 0.48-0.83 across human pairs). GTBench provides the first curriculum-grounded evaluation framework for graph-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research.

[AI-31] hink-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

链接: https://arxiv.org/abs/2606.03137
作者: Kaiqi Yang,Tai-Quan Peng,Sanguk Lee,Hui Liu
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based multi-agent simulation offers a promising way to study social interaction, deliberation, and collective opinion dynamics. However, many existing dialogue simulation frameworks represent interaction mainly as observable turn exchange or aggregated outputs, leaving the internal evaluative processes behind silence, speaking intention, and public expression difficult to examine. We introduce TBS (Think-Before-Speak), an interval-based multi-agent simulation framework that separates agents’ private reasoning from public utterance generation. At each interval, all agents update structured internal states based on the shared dialogue history and their own memory. These states include dissonance-related appraisal, perceived opinion climate, perceived isolation risk, response strategy, and willingness to speak. The orchestrator then resolves competing speaking intentions and commits one utterance to the public dialogue, allowing internal evaluation and public interaction to co-evolve over time. We evaluate TBS in simulated town hall discussions on a climate-related policy issue. Results show that TBS produces coherent internal-state traces and that these traces vary systematically across turn-allocation, silence, and memory conditions. Dissonance-related appraisal increases agents’ willingness to speak, whereas silence-pressure appraisal decreases it. Once speaking intention is formed, public expression is shaped mainly by turn-allocation rules. These findings suggest that TBS supports mechanism-sensitive social simulation by making the pathway from internal evaluation to public expression observable and analyzable. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.03137 [cs.AI] (or arXiv:2606.03137v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.03137 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-32] Uncertainty-Aware Clarification in LLM Agents with Information Gain

链接: https://arxiv.org/abs/2606.03135
作者: Mengyi Deng,Zhiwei Li,Xin Li,Tingyu Zhu,Ying Zhao,Zhijiang Guo,Wei Wang
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to erroneous tool actions. To address this challenge, we propose a goal-oriented clarification framework that aligns clarification behavior with ambiguity resolution. Central to our approach is the Information Gain Reward, a metric that quantifies the utility of clarification questions by measuring the Bayesian belief update towards the ground-truth goal induced by the clarification exchange. We train the clarifier (LLM) using this reward to optimize for high information gain, ensuring that clarifications effectively reduce uncertainty and improve task completion within the agent-tool-user environment. We validate our framework within a clarification-enhanced \tau -Bench environment, conducting cross-agent evaluations across five heterogeneous backbones. Empirical results demonstrate that our method consistently improves the success rate by 3.7% over the no-clarification baseline, while adding only 0.3 total interaction steps on average.

机器学习

[LG-0] MLSkip: Data Skipping for ML Filters via Lightweight Metadata

链接: https://arxiv.org/abs/2606.03946
作者: Mihail Stoian,Mark Gerarts,Pascal Ginter,Andreas Zimmerer,Jan Van den Bussche,Andreas Kipf
类目: Databases (cs.DB); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Database vendors recently released AI functions that can be used in filter predicates. As such functions often rely on costly, black-box ML models, they unveil new data management challenges. Concretely, traditional data skipping techniques for integer and string data fail to be applicable to the new filter type. Indeed, there is no known mechanism for pruning non-qualifying row groups, e.g., when reading files from blob storage. In this work, we initiate the study of data skipping techniques for ML filters. We make the case that Parquet’s default min-max metadata is enough to enable pruning. To this end, we draw connections to two lines of research: (i) the recently proposed query language for ML models and (ii) neural network verification. Our preliminary results on ReLU architectures show that on tables from TPC-H and TPC-DS, the average pruning effectiveness for filters of selectivity below 0.1% amounts to 27.4%. Finally, inspired by research on spatial joins, we propose an enhanced metadata structure: a size-bounded 2D convex hull that verification tools can make better use of, increasing the pruning effectiveness to 38.31%, while occupying at most 45 bytes per row group and column pair. We observe an end-to-end speedup of 1.07 \times over PyTorch in DuckDB. Subjects: Databases (cs.DB); Machine Learning (cs.LG); Logic in Computer Science (cs.LO) Cite as: arXiv:2606.03946 [cs.DB] (or arXiv:2606.03946v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2606.03946 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Correcting Neural Operator Spectral Bias via Diffusion Posterior Sampling with Sparse Observations

链接: https://arxiv.org/abs/2606.03936
作者: Niccolò Perrone,Fanny Lehmann,Stefania Fresca,Filippo Gatti
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Neural operator surrogates (NO) approximate PDE solutions orders of magnitude faster than numerical solvers, but suffer from spectral bias: high-frequency content is systematically attenuated, limiting reliability where fine-scale structure matters. Sparse sensor measurements of the field are often available too, offering pointwise accuracy without spectral distortion but covering only a small fraction of the domain. We address this by treating NO predictions as auxiliary observations in a diffusion posterior sampling framework. Our method, FreqNO-DPS (this https URL), combines an unconditional score-based diffusion prior, trained on high-fidelity simulations, with diffusion posterior sampling (DPS) conditioned on sparse observations and guided by a frozen neural operator. Naive integration reintroduces the surrogate’s spectral bias; we resolve this with a closed-form, spectrally shaped guidance score that weights the surrogate by its frequency-dependent accuracy and needs no denoiser backpropagation. A distribution-free analysis bounds the approximation error across the frequency-diffusion-time plane and shows the guidance’s frequency dependence is preserved regardless of distributional assumptions. On 3D elastic wavefield prediction at 5% and 2% sensor coverage, the method reaches near-zero spectral bias across all bands, where both the surrogate and sensor-only DPS show systematic high-frequency attenuation. Isotropic guidance, the natural baseline, improves pointwise accuracy but carries the bias into the posterior nearly intact, confirming that frequency-dependent calibration is essential, not merely beneficial. The framework needs only paired surrogate/reference data and exploits no problem-specific structure beyond the residual’s approximate spectral diagonality, verifiable for new surrogates via the coherence diagnostic we provide.

[LG-2] Quadratic integrate-and-fire neurons exhibit less frag mented loss landscapes and outperform leaky integrate-and-fire neurons in spike-based gradient descent

链接: https://arxiv.org/abs/2606.03935
作者: Carlo Wenig,Raoul-Martin Memmesheimer,Christian Klos
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures (main part)

点击查看摘要

Abstract:The ability to train spiking neural networks is essential for modeling biological neural networks as well as for neuromorphic computing. However, for the extensively used leaky integrate-and-fire (LIF) neurons, arbitrarily small parameter changes can induce spike (dis)appearances that disrupt subsequent activity, leading to unstable neural representations and permanently silent neurons during exact spike-based gradient descent. Recent work shows that a class of neuron models, which includes the quadratic integrate-and-fire (QIF) neuron, avoids these discontinuities and enables continuous and even smooth spike-based gradient descent. However, it remains unclear whether these advantages translate into practice. Here, we demonstrate that they do so via a controlled comparison between networks of LIF and QIF neurons on the popular Spiking Heidelberg Digits dataset. Specifically, in a first step, we perform a thorough hyperparameter search to optimize both models, revealing a clear performance advantage of QIF neurons. In a second step, we visualize the loss and gradient landscapes. Consistent with their inferior performance, we find that the loss landscapes of LIF neurons, which are discontinuous, appear more fragmented and the related gradients more erratic. An analysis of the landscapes of single samples indicates that these features arise from changes in the temporal order of spikes, which often cause disruptive spike (dis)appearances. Overall, our results advocate replacing LIF neurons with neuron models exhibiting continuous spiking dynamics, such as QIF neurons, for gradient descent training.

[LG-3] Contrastive Neural Algorithmic Reasoning for Graph Coloring

链接: https://arxiv.org/abs/2606.03923
作者: Thien Le,Tianyu Zhao,Melanie Weber
类目: Machine Learning (cs.LG)
*备注: 52 pages, 5 figures, 45 tables

点击查看摘要

Abstract:Graph coloring seeks to assigns colors to a graph’s nodes so that adjacent nodes receive different colors, using as few colors as possible. Here, we study approximate k -coloring, where the goal is to use at most k colors while minimizing the number of monochromatic edges. This problem is central to graph theory and has applications in areas such as scheduling and resource allocation. Recent unsupervised GNN approaches optimize each instance directly, precluding generalization across graph sizes and distributions. We instead propose a contrastive learning framework that learns transferable coloring geometry where the embeddings of same-color nodes align, while adjacent nodes’ representations are pushed toward distinct directions. We analyze the resulting population objective over bounded-size graphs. For unit-norm embeddings, we show that its optima have a line-prototype structure: Representations of nodes of the same color collapse to a shared one-dimensional subspace, and edges connect orthogonal subspaces. This geometry yields stationarity conditions in the supervised setting and is preserved by projected subgradient dynamics under a balanced-coloring assumption. In an unnormalized variant, gradient descent has a max-margin bias governed by a quotient-graph hard-margin problem. Experiments on synthetic and real-world graphs show that contrastive GNN encoders generalize effectively and produce low-conflict colorings, matching and sometimes improving on greedy approaches.

[LG-4] Forecasting Conceptual Diffusion in Science: The Case of Quantum Computing

链接: https://arxiv.org/abs/2606.03919
作者: Thomas Maillart,Thibaut Chataing,David Dosu,Paul Bagourd,Julian Jang-Jaccard,Alain Mermoud
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注: 19 pages, 5 figures, 6 tables. Code and manuscript sources: this https URL . An earlier version was presented at the Global Tech Mining Conference (GTM) 2026 (submission #117)

点击查看摘要

Abstract:Understanding and anticipating scientific change requires models that distinguish between endogenous consolidation and exogenous diffusion of scientific concepts. Using the quantum computing subtree of concepts in OpenAlex, we construct a temporally resolved concept co-occurrence network and track each concept pair through its upstream citation lineage and downstream diffusion. We train LightGBM models on distributional and diversity-aware features to predict four outcomes: endogenous reinforcement, exogenous diffusion, their ratio, and diffusion entropy. After controlling for overall publication growth of the scientific body, endogenous reinforcement proves largely unpredictable in the primary quantum-computing benchmark. In contrast, exogenous diffusion and entropy are strongly predictable ( R^2 up to 0.78à) and are driven by upstream heterogeneity, citation breadth, and distributional dispersion, as shown by SHAP analyses; replications on robotics, advanced materials, and neuro implants confirm that exogenous diffusion remains the top-ranked target across fields ( R^2_test \sim 0.60-0.87 ), while endogenous predictability rises markedly in neuro implants (R^2_test = 0.83), indicating that the quantum-computing asymmetry does not generalise uniformly. Case studies reveal that sharp entropy increases coincide with the opening of new conceptual frontiers, while entropy collapses signal technological convergence or paradigm displacement. These results demonstrate that conceptual diffusion is governed by stable structural regularities embedded in semantic and citation environments. By identifying early diversity-based signals of cross-domain uptake, the approach provides a scalable foundation for anticipatory scientometrics, technology foresight, and innovation-oriented policy analysis in rapidly evolving research fields.

[LG-5] Denoise First Orthogonalize Later: Understanding Momentum in Muon via Spectral Filtering

链接: https://arxiv.org/abs/2606.03899
作者: Xianliang Li,Zihan Zhang,Weiyang Liu,Han Bao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Muon has recently demonstrated strong empirical performance in large language model training, but the theoretical role of momentum in Muon remains unclear. Existing analyses of Muon either remove momentum to study spectral updates in isolation, or retain momentum without explaining why it improves empirical performance. Our work bridges this gap by showing momentum in Muon acts as a spectral filter. Under a structured signal-plus-perturbation gradient model, we prove that momentum suppresses perturbations while preserving the dominant signal, thereby enlarging the spectral gap between them. This enlarged gap stabilizes the singular subspaces of the matrix passed to Muon’s orthogonalization step, making the resulting update more reliable. We further show that applying momentum before orthogonalization achieves provably stronger alignment with the signal component of the gradient than either reversing this order or simply removing momentum. Experiments across diverse tasks, including LLM pretraining, support our theoretical analysis. More broadly, our theory offers a starting point for understanding the benefits of momentum in other matrix-based optimizers.

[LG-6] Attribution via Distributional Paths for Information Revelation

链接: https://arxiv.org/abs/2606.03885
作者: Kieran A. Murphy,Shameen Shrestha
类目: Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:Feature attribution methods explain predictions by assigning importance scores to input features. Path-based methods such as Integrated Gradients are especially appealing because they satisfy \textitcompleteness: attributions sum to the change in model output between a reference state and the input. Yet most path methods define this trajectory in input space, explaining a model through pointwise perturbed inputs along a chosen path. An input-space path integrates the model’s raw response at each point it passes through, with no control over the resolution at which a feature is queried; the early, baseline-adjacent part of the trajectory contributes to the explanation on equal footing with the input itself. Here, we lift path attribution from input space to a space of structured probe distributions around the example of interest, and call our method Reveal-IG. Rather than traversing raw input values, Reveal-IG progressively reveals information about the input and attributes changes in the model’s expected output along this distributional path. The result is a path-attribution framework that retains completeness with respect to the expected model response, and naturally accommodates multiscale image probes and feature-wise uncertainty in tabular data. Synthetic diagnostics show that Reveal-IG avoids path artifacts that affect input-space methods, and across ImageNet classification and tabular regression it produces stable, signed attributions – leading on metrics that use attribution sign while remaining competitive on the rest.

[LG-7] Explainable Forecasting of Scientific Breakthroughs from Concept Network Dynamics

链接: https://arxiv.org/abs/2606.03864
作者: Thomas Maillart,Thibaut Chataing,Ntorina Antoni,David Dosu,Paul Bagourd,Julian Jang-Jaccard,Alain Mermoud
类目: ocial and Information Networks (cs.SI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注: 18 pages, 10 figures, 4 tables. An earlier version was presented at Global Tech Mining Conference 2026. Code and data: this https URL

点击查看摘要

Abstract:We introduce an explainable machine-learning approach that forecasts the structural precursors of scientific breakthroughs – the emergence and intensification of links between research concepts – by modelling how OpenAlex concept networks evolve over time. Using 59 semantic and topological features, a two-stage LightGBM model jointly predicts the formation and the future weight of concept pairs, adding a regression stage that quantifies expected intensity to prior link-existence forecasts. Relative to the state of the art, the approach improves accuracy and explainability at once: comparative validation across four technology and biomedical domains yields ROC-AUC in [0.954, 0.967] at all horizons without re-tuning, exceeding the roughly 0.90 of prior models, while every forecast rests on structural, auditable features rather than opaque embeddings. Classification performance is high (AUC about 0.95) and regression remains stable (RMSLE 0.45 to 0.6 over one to five years). Feature attribution shows that structural factors – particularly Adamic-Adar similarity and degree-based Hadamard measures – consistently drive accuracy, suggesting that breakthrough-relevant recombinations emerge in tightly connected sub-networks. Two expert-anchored cases, quantum annealing and AI-enabled quantum architectures, show the model surfacing technological convergence consistent with expert expectations. We then outline a three-layer decision architecture – detection, expert translation, institutional integration – that turns these forecasts into evidence-based research strategy and policy, anchored in open data and explainable features.

[LG-8] wo-Action Apple Tasting with Switching Costs

链接: https://arxiv.org/abs/2606.03851
作者: Tommaso Cesari,Roberto Colomboni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the two-action apple-tasting problem with switching costs against an oblivious adversary. In an equivalent normalized formulation, at each round the learner chooses between a revealing action and a blind action: the revealing action gives reward 0 and reveals the hidden value x_t\in[-1,1] of the blind action; the blind action gives reward x_t but reveals nothing. The learner pays one unit whenever they switches actions, and regret is measured against the best fixed action in hindsight. General feedback-graph algorithms with switching costs give \widetilde O(T^2/3) regret guarantees for this problem. The two-action apple-tasting graph was the natural candidate for the missing \Omega(T^2/3) obstruction in the switching-cost classification: such a lower bound would have transferred to a large family of still-unclassified feedback graphs. We prove that this obstruction is not there: the oblivious minimax expected regret for this problem satisfies [ \frac12\sqrt3\cdot\sqrt T \le R_T^\star \le 2\sqrt3\cdot \sqrtT. ] Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.03851 [cs.LG] (or arXiv:2606.03851v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.03851 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tommaso Cesari [view email] [v1] Tue, 2 Jun 2026 16:28:45 UTC (16 KB) Full-text links: Access Paper: View a PDF of the paper titled Two-Action Apple Tasting with Switching Costs, by Tommaso Cesari and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-9] xt-attributed Graph Condensation via Text Selection and Attribute Matching

链接: https://arxiv.org/abs/2606.03839
作者: Haowei Han,Yuxiang Wang,Guojia Wan,Hao Wang,Shanshan Feng,Hao Huang,Jiawei Jiang,Xiao Yan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-Attributed Graph (TAG) is an important type of graph structured data, where each node has a text description. TAG models usually train a Graph Neural Network (GNN) and language model jointly, which leads to high space and time consumption, especially on large datasets. To mitigate this, we propose TAGSAM, a condensation method that compresses TAGs while preserving training accuracy. TAGSAM comes with two key designs, i.e., subgraph text Selection and Attribute similarity Matching, which compress the text description and graph topology of TAG, respectively. For the texts, subgraph text selection selects and merges representative text chunks from multiple related text descriptions by maximizing mutual information. For the graph topology, popular condensation methods based on Matching Training Trajectories (MTT) suffer from high variance, which hinders accuracy. Our attribute similarity matching mitigates this issue by aligning stable similarity matrices. We evaluate TAGSAM against six state-of-the-art baselines, where it showcases superior performance. For the same compressed size, TAGSAM improves upon the best-performing baseline by an average of 4.9% in accuracy. Furthermore, it maintains competitive training accuracy even when the TAG is condensed to just 1% size. Our code is available at this https URL

[LG-10] Online Learning with Gradient-Variation Interval Regret

链接: https://arxiv.org/abs/2606.03831
作者: Yan-Feng Xie,Shuche Wang,Peng Zhao,Zhi-Hua Zhou
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper investigates non-stationary online learning using the metric of interval regret, which requires an online algorithm to perform well over every time interval. We propose the first online learning algorithm that achieves an interval regret bound scaling with gradient variation, a fundamental measure of the cumulative change in online function gradients, which relates to various problem-dependent quantities and is closely connected to stochastic optimization and other problems. Our method employs a simple and efficient two-layer online ensemble structure that achieves strong theoretical guarantees. Specifically, it enjoys a regret bound that simultaneously adapts to various problem-dependent quantities while also preserving the minimax-optimal rate in the worst case. Moreover, recognizing the challenge of hyperparameter tuning, we introduce a Lipschitz- and smoothness-agnostic variant that automatically adapts to these potentially unknown constants. This is primarily enabled by a novel Lipschitz-adaptive meta algorithm, which may be of independent interest. Beyond interval regret, our method also yields broader implications: it provides versatile bounds for interval dynamic regret, a stronger measure that competes with changing comparators over any interval, and yields the first piecewise characterization for stochastic extended adversarial optimization. Theoretical findings are validated by experiments.

[LG-11] Demystifying Pipeline Parallelism: First Theory for PipeDream

链接: https://arxiv.org/abs/2606.03498
作者: Ivan Ilin,Peter Richtárik
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 40 pages, 4 figures

点击查看摘要

Abstract:Training modern machine learning models increasingly requires computation to be distributed across many accelerators. Data parallelism remains the default choice and is often paired with tensor-parallel sharding, but model parallelism becomes unavoidable once parameters, activations, or optimizer states no longer fit on a single device. This paper studies pipeline model parallelism through the lens of PipeDream (PD) (Harlap et al., 2018). Our first contribution is theoretical: we introduce Randomized PipeDream (RPD), a stale block-SGD abstraction that yields, to our knowledge, the first clean nonconvex convergence guarantee for a PD-style method. Our second contribution is a scaling diagnosis: we prove that the delay induced by steady-state PD grows as S^2 - S/2 + O(1) for S stages, so the stale-read contribution in the convergence theorem scales as \Theta(\gamma^2 S^4) , equivalently as \Theta(S^4/K) in the tuned-rate form. Our third contribution is a comparison with LocalSGD, whose periodic model averaging trades weight staleness for synchronization bubbles. In our reported simulated-time experiments, the better-performing method depends on the objective: PD performs better on the quadratic objective and on a small language-modeling training-loss task, while for logistic regression LocalSGD becomes superior as the number of stages increases.

[LG-12] HiSE: A Lightweight Hierarchical Semantic Explainer for Heterogeneous Graph Neural Networks

链接: https://arxiv.org/abs/2606.03495
作者: Zongrui Li,Yuhang Zhao,Ying Zhao,Yuanzhao Guo,Qiang Huang,Yuan Tian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heterogeneous graph neural networks (HGNNs) have demonstrated remarkable performance in modeling complex relational data, however their interpretability in high-stakes applications remains a critical challenge. Existing explanation methods suffer from two major limitations: on the one hand, the generated explanations fail to reflect the inherent semantic hierarchy of HGNNs, resulting in a lack of fidelity to the model’s internal decision-making mechanism; on the other hand, feature explanations often rely on complex search or perturbation mechanisms, leading to excessive computational complexity and poor efficiency. To address these issues, we propose HiSE, a lightweight feature-oriented interpretable model for HGNNs. HiSE achieves semantically aware feature explanations through hierarchical semantic modeling: at the semantic level, local surrogate models based on the Least Absolute Shrinkage and Selection Operator (LASSO) are employed to learn sparse feature representations under each semantic view; at the cross-semantic level, the contributions of different semantic views are adaptively characterized via KL divergence to produce a unified explanation. Extensive experiments demonstrate that HiSE outperforms existing methods in terms of fidelity, robustness, and cross-semantic explanation capability, while its lightweight framework incurs low computational overhead, enabling efficient application to large-scale, complex real-world heterogeneous graphs.

[LG-13] Bayesian Tensor Decomposition with Diffusion Model Prior ICML2026

链接: https://arxiv.org/abs/2606.03212
作者: Zerui Tao,Qibin Zhao
类目: Machine Learning (cs.LG)
*备注: ICML 2026

点击查看摘要

Abstract:Low-rank tensor decomposition (TD) is usually effective on clean, fully observed data, but it often degrades under severe missingness or noise. Low-rankness is itself a useful but limited structural prior, and additional handcrafted priors (e.g., sparsity or smoothness) still fall short of capturing the rich statistics of real-world data. To compensate for this weak inductive bias under heavy corruption, one would like to inject a learned, data-driven prior; however, the state-of-the-art diffusion models are not readily compatible with current TD and tractable posterior inference. To address these challenges, we introduce DiffBCP, a hybrid-prior Bayesian CP decomposition framework that couples a cumulative shrinkage process prior over the CP factors for automatic rank selection with an off-the-shelf pre-trained diffusion model as an implicit data prior on the reconstructed tensor. To make posterior inference tractable despite the coupling among the likelihood, low-rank constraint, and diffusion prior, we develop a split Gibbs sampler: CP factors admit conjugate updates, while the diffusion block is sampled via low-rank-guided denoising. A noise-adaptive coupling schedule further reduces sensitivity to hand-tuned annealing. Experiments on image inpainting and denoising, including high-resolution out-of-distribution images, show consistent gains over Bayesian, nonlinear, and plug-and-play TD baselines.

[LG-14] Critical evaluation of PINN for FWD inverse analysis and differentiable FEM as an alternative

链接: https://arxiv.org/abs/2606.03210
作者: Yongjin Choi,Hyeonbin Moon,Seunghwa Ryu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Automatic-differentiation-based inverse analysis methods, including physics-informed neural networks (PINNs) and differentiable programming, have recently shown great promise due to their ability to compute accurate gradients and convergence efficiency. However, their applicability to falling weight deflectometer (FWD) backcalculation remains unexplored. This study critically evaluates PINN-based inverse analysis for a multilayer pavement system and investigates differentiable finite element method (DiffFEM) as an alternative based on a synthetic benchmark. The standard PINN does not recover layer moduli because of the sharp domain discontinuities inherent to layered pavement systems. Although we use an extended PINN with domain decomposition (XPINN), which shows better performance on discontinuous domains, its performance remains highly sensitive to loss weighting and network architecture, and degrades under measurement noise. By contrast, DiffFEM consistently achieves more accurate, stable, and computationally efficient inversion results. These results indicate that DiffFEM, which enforces the governing physics as a hard constraint, yields better accuracy, robustness, and computational efficiency than PINN-based approaches, in which the governing physics is imposed as a soft constraint through the loss function. More broadly, the findings suggest that the choice between PINN- and DiffFEM-based inverse analysis needs careful consideration, with DiffFEM offering practical advantages when an efficient and robust differentiable forward solver is available.

[LG-15] DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID Data

链接: https://arxiv.org/abs/2606.03209
作者: Yunsheng Yuan,Shaowei Li,Kai Wang,Zhongyuan Sun,Zheng Zhang,Kai Han,Jun Luo,Feng Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) in privacy-sensitive and resource-constrained environments remains challenging. Since training data are often distributed across multiple clients, decentralized fine-tuning offers a natural paradigm for collaborative adaptation without a central server. However, enabling full-parameter fine-tuning (FPFT) in this decentralized setting is difficult: FPFT provides strong adaptation capacity but incurs prohibitive resource consumption for billion-scale models. Existing decentralized LLM fine-tuning methods therefore mainly rely on parameter-efficient updates, which improve efficiency but may restrict downstream performance. Moreover, client data are typically non-IID, making decentralized optimization more vulnerable to client drift and unstable convergence. To address these challenges, we propose DECA, a resource-efficient decentralized FPFT framework for LLMs on non-IID data. DECA partitions model parameters into disjoint blocks and performs sequential block-wise Adam optimization, reducing resource consumption while preserving decentralized full-parameter adaptation. To stabilize training, DECA further introduces first- and second-order block-wise moment estimates with fresh local gradient statistics and consensus-derived discrepancy signals. We provide rigorous theoretical analysis and extensive experiments, showing that DECA achieves fast convergence, strong downstream performance, and significant resource efficiency.

[LG-16] Fast Organic Crystal Structure Prediction with Unit Cell Flow Matching

链接: https://arxiv.org/abs/2606.03199
作者: Alston Lo,Luka Mucko,Austin H. Cheng,Andy Cai,Alastair J. A. Price,Wojciech Matusik,Alán Aspuru-Guzik
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Organic crystal structure prediction (CSP) is a requirement for computational modelling of organic solids, but traditionally costs several CPU-years per molecule. Generative models such as OXtal dramatically reduce this cost by sampling stable organic crystal structures directly. However, OXtal forgoes explicit lattice parametrization in favour of modelling large crops of the bulk material with expensive triangle layers, which can incur a computational cost of minutes per molecule. In this paper, we reduce this to seconds with Clari, a large-scale flow matching model that generates redundancy-free unit cells and replaces triangle layers with pure pair-bias attention. Clari requires only atom types and bonds as input and does not need an RDKit-sanitizable input molecule, which expands its applicability to challenging chemistries such as fullerenes, metal complexes, and atom clusters. We further ablate key design choices such as auxiliary losses, timestep distributions, noise priors, and self-conditioning. On OXtal’s test sets, we surpass OXtal’s solve rate while obtaining a speedup of 15 - 30\times . Because Clari also models explicit hydrogens, it supports inference-time scaling via direct energy ranking, without any decoration or relaxation step. When generating 150 crystals and selecting the top-30 by energy, we further improve solve rate while maintaining a speedup of 5 - 8\times . We also introduce the CSD Teaching Subset as a new test split of diverse and complex molecules for future benchmarking. Our contributions enable CSP within seconds, making large-scale virtual screening of organic solids practical. Code is available at this https URL.

[LG-17] Auditing Engagement Incentives in the Kidfluencer Ecosystem: A Multimodal Weak Supervision Approach

链接: https://arxiv.org/abs/2606.03173
作者: Zijing Wei,Chao Peter Yang,Xuanjie Chen
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:The rise of `kidfluencers’ on YouTube has raised ethical concerns about child digital labor and exploitation. While emerging legislation attempts to regulate this ecosystem, empirical evidence linking exploitation to engagement remains scarce, given the difficulty of operationalizing exploitation at scale. This study presents a multimodal AI audit of 5,051 videos across 79 kidfluencer channels, using weak supervision to detect exploitation signals without large-scale manual labels. We aggregate noisy labeling functions – including LLM-based classification of titles and GPT-4 Vision analysis of thumbnails and descriptions across six literature-grounded dimensions – to assign a probabilistic exploitation score to each video. A multi-annotator validation study (N=107) shows strong agreement with human judgment (macro-average F1 = 0.911 ) and high sensitivity for overall exploitation risk (recall = 0.960 , F1 = 0.793 ). Our findings reveal a significant engagement premium for performative labor, emotional bait, and privacy violations. Exploitation scores correlate with view counts (Spearman \rho = 0.229 , p 10^-50 ), and mixed-effects regression controlling for channel-level variation shows that a one-unit increase in exploitation score yields a 4.4\times increase in views ( p 0.001 ). Within-channel analyses indicate median view boosts of +65.6% for emotional bait and +56.0% for performative content (FDR-corrected p0.001 ), with effects holding in same-year robustness checks ( p=0.030 ). Explicit commercial content (product placement), by contrast, shows no premium ( -3.8% , n.s.), suggesting the platform rewards commodification of the child’s identity and labor over traditional advertising. These findings challenge policy frameworks focused solely on financial trusts, showing that engagement is systematically tied to the intensive, performative labor of children. Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2606.03173 [cs.CY] (or arXiv:2606.03173v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2606.03173 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling

链接: https://arxiv.org/abs/2606.03169
作者: Xiaoyue Duan,Nanxing Hu,Yutang Feng,Xudong Yan,Jiatao Chen,Jinchao Zhang,Jie Zhou
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Recent song generation systems can synthesize realistic audio, yet generating complete songs remains challenging for two reasons. First, explicit song-level arrangement planning remains limited in existing methods, so models often need to organize overall arrangement development while generating low-level audio details. This often leads to incoherence in arrangements, such as weak section transitions and limited dynamic progression. Second, coarse modeling of different musical parts obscures their distinct roles and interactions, limiting arrangement richness of generated songs. In this paper, we present SketchSong, a hierarchical song generation framework that addresses these issues through song-level sketch planning and fine-grained multi-track modeling. Along the temporal dimension, SketchSong first predicts a compact sequence of high-level sketch tokens derived from compressed audio representations, and then generates audio tokens conditioned on these sketches. This coarse-to-fine process gives the model an explicit arrangement plan before detailed audio generation. Along the track dimension, SketchSong explicitly models four tracks, i.e., vocals, bass, drums and other instruments. This enables the model to capture the roles and interactions of different musical parts more precisely. Experiments on song generation benchmarks show that SketchSong consistently outperforms our baseline on both objective metrics and human listening tests. Despite not employing additional post-training for preference optimization such as lyrics and text-prompt alignments, SketchSong achieves competitive results against strong, post-trained open-source systems, demonstrating the effectiveness of our overall design.

[LG-19] How Visible Are Silent Manipulation Failures? An Observability Study of False-Success Detection in Simulated Robot Episodes

链接: https://arxiv.org/abs/2606.03134
作者: Aarav Bedi(University of California, Berkeley)
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 4 pages, 3 figures

点击查看摘要

Abstract:Imitation-learning policies for robot manipulation inherit the quality of the success labels attached to their training episodes, and those labels are usually produced by the robot’s own success check. A particularly damaging error is the false success: an episode the robot logs as a success when the task outcome was actually wrong. We ask a narrow but practical question about these episodes. Once an episode has already been flagged as a success, how much of the information needed to overturn that label is present in proprioception, and how much requires vision? We build a simulated testbed on two bimanual ALOHA tasks, induce failures through environment perturbations rather than label edits, label every episode by privileged simulator state that the detector never sees, and keep only episodes the robot flagged as successful. We then compare detectors restricted to proprioception against a vision-based detector. We find that recoverability spans a wide range: in cube transfer the false successes are almost fully recoverable from joint data alone, while in peg insertion proprioception recovers only part of them and a vision detector closes most of the gap. We also show that the proprioceptive separability we measure rests on velocity differences far below any realistic sensor noise floor, so it is best read as an optimistic upper bound that a noiseless simulator inflates. We release the generation and evaluation pipeline.

[LG-20] HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

链接: https://arxiv.org/abs/2606.03131
作者: Shuang Liu,Yuxuan Bo,Qiuyang Zhao,Caiyue Huang,Xiaorong Chen,Yanguang Liu,Mengnan Du
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reward models are central to large language model (LLM) alignment, but they remain vulnerable to reward hacking. To evaluate reward-model robustness, we introduce RewardHackBench containing 13 reward-hacking patterns covering real life high-stakes domains and general settings, and we find severe failures on specific subcategories across eight reward models. To mitigate these failures, we propose HARVE, a training-free reward-head editing method for scalar reward models. Instead of fine-tuning the reward model, HARVE identifies a multi-directional hacking subspace from residual stream directions associated with selected hacking subcategories, and removes the component of the reward-head vector aligned with that subspace. This directly reduces the reward head’s sensitivity to hacking-related features using only a small set of contrastive gold-hacked examples, without gradient updates or fine-tuning. Comprehensive experiments across eight reward models indicates that \model improves hacking robustness, outperforms fine-tuning baselines, and preserves reward-models’ general capability. Further analyses suggest that reward hacking is better captured as a multidimensional residual-space structure than by isolated surface cues.

[LG-21] Synthetic Hallucinations Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation

链接: https://arxiv.org/abs/2606.03130
作者: Mahdi Erfanian,Nelson Daniel Troncoso,Aashna Garg,Amabel Gale,Xiaoyu Liu,Pareesa Ameneh Golnari,Shengyu Fu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Small open-source code models that power IDE autocomplete still emit hallucinated Fill-in-the-Middle (FIM) completions: syntactically natural calls to methods, parameters, variables, and imports that do not exist in the surrounding project. Existing mitigations either require per-language execution sandboxes that do not apply at mid-keystroke or preference-optimisation pipelines that need large human-labelled corpora. We propose an execution-free alternative: use frontier code models to synthesise plausible-but-wrong completions as hard negatives, then leverage the contrast between these synthetic hallucinations and the ground-truth developer edit as a supervised fine-tuning signal. Our pipeline scrapes multilingual FIM contexts from public GitHub across eight languages and asks a panel of three frontier generators to produce one hard negative per context for each of four hallucination types drawn from the Delulu taxonomy, a Docker-verified multilingual FIM hallucination benchmark, yielding a paired chosen/rejected dataset. Fine-tuning Qwen2.5-Coder-7B-Instruct on a 100K-row curated subset lifts Delulu exact match by +18.8 points and edit similarity by +0.22 on every language and every type, while also improving every HumanEval-Infilling split and every SAFIM subset. The same recipe at 3B lifts Delulu by +12.8 EM with a small, characterised general-FIM trade-off. Five-axis ablations (size, type mix, language coverage, base-model family, and a difficulty-aware fool rate) plus a head-to-head SFT vs. DPO/ORPO comparison map which design choices drive the gain. We release the full pipeline source code – generation, fool-rate LLM judging, curation, and the FIM fine-tuning recipe – so that the experiments in this paper can be reproduced end-to end on any permissively licensed corpus.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2026-06-03

目录

概览 (2026-06-03)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载