Arxiv今日论文 | 2026-05-28

本篇博文主要内容为 2026-05-28 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共191篇(Computation and Language (cs.CL))
人工智能共372篇(Artificial Intelligence (cs.AI))
计算机视觉共133篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共274篇(Machine Learning (cs.LG))
多智能体系统共21篇(Multiagent Systems (cs.MA))
信息检索共39篇(Information Retrieval (cs.IR))
人机交互共28篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] Rethinking Memory as Continuously Evolving Connectivity

【速读】：该论文试图解决现有记忆增强型大语言模型（LLM）代理在动态智能体环境中表现脆弱的问题，即传统方法将记忆视为静态存储库，采用预定义的表示和固定的检索流程，难以适应反馈、任务变化和异构信号持续重塑记忆内容与关联的需求。其解决方案的关键在于提出FluxMem框架，该框架将记忆建模为异质图结构，并通过三个阶段逐步优化其拓扑结构：初始连接构建、反馈驱动的精炼以及长期巩固；在执行过程中，FluxMem能修复缺失链接、剪枝干扰信息、对齐抽象粒度，并将重复成功的轨迹提炼为可复用的过程电路，整个过程由一个统一的指标——记忆泛化能力和进化成熟度——进行指导。实验表明，FluxMem在LoCoMo、Mind2Web和GAIA三个本质不同的基准测试中均实现一致的最先进性能，验证了其在复杂智能体环境中的强适应性和泛化能力。

链接: https://arxiv.org/abs/2605.28773
作者: Jizhan Fang,Buqiang Xu,Zhixian Wang,Haoliang Cao,Xinle Deng,Baohua Dong,Hangcheng Zhu,Ruohui Huang,Gang Yu,Ying Wei,Guozhou Zheng,Feiyu Xiong,Haofen Wang,Huajun Chen,Ningyu Zhang
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团); MemTensor; Tongji University (同济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注: Ongoing work

点击查看摘要

Abstract:Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelines, which is brittle in dynamic agentic environments where feedback, task variation, and heterogeneous signals continuously reshape what should be remembered and how it should be connected. To address this, we propose FluxMem, a connectivity-evolving memory framework that models memory as a heterogeneous graph and progressively refines its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. During execution, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills recurrent successful trajectories into reusable procedural circuits, guided by one metric for memory generalizability and evolutionary maturity. Across three fundamentally distinct benchmarks including LoCoMo, Mind2Web, and GAIA, FluxMem achieves consistent state-of-the-art performance, demonstrating strong adaptation and generalization in complex agentic environments. The code will be open-sourced in this https URL.

[MA-1] SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks

【速读】：该论文试图解决的问题是：当前大量闲置的计算资源（如个人工作站的GPU算力、空闲推理服务器和边缘设备）因缺乏激励对齐的协议而无法被安全且盈利地共享。现有方案要么依赖中心化协调者（如云市场），要么需要复杂的区块链基础设施（如Golem、BrokerChain），或完全缺乏激励机制（如BOINC、Petals）。其解决方案的关键在于提出一种去中心化的协议SwarmHarness，通过三个相互耦合的组件实现：基于分布式哈希表（DHT）的SwarmRegistry用于节点发现与能力声明；基于能力、负载、延迟和信任的效用函数进行任务调度的SwarmRouter；以及基于Shapley值近似的SwarmCredit激励机制，使节点通过提供服务获得算力积分并以此提交任务，不贡献的空闲节点将失去积分和路由优先级，形成自调节的参与经济。这种设计促使节点向高收益技能专业化发展，路由信号如同数字信息素，驱动网络涌现出类生物群体的集体智能，为无需人工干预的自主分布式AI代理网络提供了基础原语。

链接: https://arxiv.org/abs/2605.28764
作者: Edwin Jose
机构: Western Michigan University (西密歇根大学)
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Vast quantities of compute (GPU cycles on personal workstations, idle inference servers, and edge devices between jobs) go unused because no incentive-aligned protocol exists for their owners to share them safely and profitably. Existing approaches either require a trusted central coordinator (cloud marketplaces), demand heavy blockchain infrastructure (Golem, BrokerChain), or lack an incentive layer entirely (BOINC, Petals). We propose SwarmHarness, a decentralised protocol in which HarnessAPI skill nodes self-organise into a compute swarm without any central authority. SwarmHarness has three interlocking components: a SwarmRegistry built on a Distributed Hash Table (DHT) for peer discovery and capability advertisement; a SwarmRouter that dispatches tasks to nodes using a utility function over capability, load, latency, and trust; and SwarmCredit, an incentive mechanism that attributes compute-credit rewards to contributing nodes via a Shapley-value approximation. Nodes earn credits by serving tasks and spend credits to submit them; idle nodes that never contribute drain credits and lose routing priority, creating a self-regulating participation economy. As nodes specialise toward high-reward skills and routing signals act as digital pheromones, the network exhibits emergent collective intelligence analogous to biological swarms. Beyond compute sharing, SwarmHarness is a foundational primitive for autonomous distributed AI agent networks in which agents hire compute, route subtasks, and settle credits without human intermediation.

[MA-2] Explaining is Harder Than Predicting Alone: Evaluating Concept-based Explanations of MLLM s as ICL Visual Classifiers ICML2026

【速读】：该论文试图解决的问题是：多模态大语言模型（MLLMs）在少样本上下文学习（ICL）中如何利用提供的示例进行图像分类，以及其内部推理过程是否可以通过概念层面的可解释性方法被有效揭示。解决方案的关键在于通过五种逐步增加形式严谨性的评估条件（从基础分类到描述逻辑（DL）公理生成），系统性地检验冻结的MLLMs在ICL下的概念可解释性，并借助独立的“大语言模型作为裁判”（LLM-as-a-judge）管道对四个最先进的MLLMs进行评估。研究发现，虽然MLLMs在视觉分类任务上表现优异，但它们缺乏针对形式化、机器可验证解释的指令微调，导致强制生成结构化解释反而会单调降低预测准确率（从93.8%降至90.1%），而成功识别类别判别性视觉特征的解释质量则与正确预测强相关，表明当前模型更擅长感知而非推理。

链接: https://arxiv.org/abs/2605.28215
作者: Carmen Quiles-Ramírez,Leticia L. Rodríguez,Nicolás Martorell,Natalia Díaz-Rodríguez
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: Accepted to the CompLearn Workshop at ICML 2026

点击查看摘要

Abstract:In-context learning (ICL) enables multimodal large language models (MLLMs) to classify images from a few labelled examples. Yet, how these models use the provided context remains opaque. While Chain-of-Thought prompting is widely used, recent work argues that it may not reflect true internal computation. In this paper, we systematically evaluate the concept-based explainability of frozen MLLMs under few-shot ICL using five conditions of increasing formal rigour, ranging from baseline classification to Description Logics (DL) axiom generation. Evaluating four state-of-the-art MLLMs via an independent LLM-as-a-judge pipeline, we demonstrate that explaining is genuinely harder than predicting alone. Surprisingly, forcing models to generate formally structured, concept-based explanations degrades predictive accuracy monotonically (from 93.8% to 90.1%), contradicting the assumption that explicit reasoning universally aids performance. However, when models successfully articulate class-discriminative visual features, explanation quality strongly correlates with correct predictions. Our findings suggest that while MLLMs excel at visual classification, they lack the specific instruction-tuning required for formal, machine-verifiable explainability. Comments: Accepted to the CompLearn Workshop at ICML 2026 Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA) Cite as: arXiv:2605.28215 [cs.AI] (or arXiv:2605.28215v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.28215 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Carmen Quiles-Ramírez [view email] [v1] Wed, 27 May 2026 09:32:34 UTC (1,255 KB)

[MA-3] Out of Sight Not Out of Mind: Unveiling Latent Attack in Latent-based Multi-Agent Systems

【速读】：该论文试图解决的问题是：在基于潜在空间（latent space）的多智能体系统中，攻击信息是否可能隐藏于不可见的潜在状态中，并在正常执行过程中依然有效，从而绕过基于可见文本的检测机制。解决方案的关键在于提出了一种潜在攻击框架（latent attack framework），通过潜在干预（latent interventions）重新激活攻击引发的效果，而无需重复使用对抗性文本。实验表明，仅通过潜在空间的干扰即可显著降低任务性能，尤其在跨智能体的键值缓存（KV-cache）传递过程中效果更明显；控制分析进一步排除了随机扰动或无效生成导致性能下降的可能性。这说明潜在协作并未消除攻击风险，而是将风险转移到了更隐蔽的执行状态中，亟需超越文本层面的防护机制。

链接: https://arxiv.org/abs/2605.28214
作者: Chenxi Wang,Ruiyang Huang,Jiayan Sun,Lei Wei,Yifan Wu
机构: Southeast University, Nanjing, China; Peking University, Beijing, China
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 27 pages, 7 figures, 3 tables. Preprint

点击查看摘要

Abstract:Latent-based multi-agent systems replace parts of explicit inter-agent communication with hidden representations, offering a new direction for efficient and flexible agent collaboration. However, moving coordination into latent space may also move attacks beyond the reach of visible-text inspection. In this paper, we study whether latent states can carry attack-associated information that remains effective during clean executions. To examine this question, we introduce a latent attack framework that reactivates attack-induced effects through latent interventions without reusing adversarial text. Extensive experiments show that the resulting latent-only attacks can substantially degrade task performance in clean executions, especially when applied to inter-agent KV-cache handoffs rather than local hidden states. Further control analyses indicate that this degradation cannot be reduced to arbitrary perturbations or invalid generation. Overall, our findings suggest that latent-based collaboration does not remove attack risk. It shifts part of the risk into less observable execution states, calling for safeguards beyond visible-text inspection.

[MA-4] LegalGraphRAG : Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning ACL2026

【速读】：该论文旨在解决将图结构检索增强生成（GraphRAG）应用于法律推理领域时面临的两大核心问题：一是法律语料具有多粒度、异构性特征（如案例、法规条文和解释性文件），传统扁平知识图谱难以区分事实细节、适用规则与抽象原则，导致检索精度不足；二是现有RAG方法直接将检索到的上下文输入大语言模型（LLM），缺乏对证据的验证机制，致使推理过程不透明且易出错。解决方案的关键在于提出LegalGraphRAG框架，其创新点包括：(1) 构建分层法律知识图谱，实现按抽象层级精准检索；(2) 设计多智能体系统——研究者（Researcher）负责候选证据检索、审计员（Auditor）严格验证证据真实性、裁判员（Adjudicator）综合已验证证据作出最终判决，从而保障推理的可解释性与可靠性。实验表明，该方法在准确性和可信度上均优于现有GraphRAG基线。

链接: https://arxiv.org/abs/2605.28120
作者: Zerui Chen,Qinggang Zhang,Zhishang Xiang,Zhimin Wei,Linfeng Gao,Xiao Huang,Zhihong Zhang,Jinsong Su
机构: Xiamen University (厦门大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 30 pages, 18 figures, ACL 2026 Main Conference. Project page: this https URL

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi-granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence-based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error-prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi-agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state-of-the-art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at this https URL.

[MA-5] Long Live the Librarian! A Persistent Search Sub-Agent for Energy-Efficient Multi-Agent Software Engineering Systems

【速读】：该论文试图解决多智能体系统（Multi-agent Systems, MAS）在自主软件工程（Autonomous Software Engineering, SWE）中因推理能耗过高而引发的可持续性问题。其核心问题是：MAS在执行任务过程中产生大量冗余输出标记（output tokens），这些冗余标记消耗了远高于输入或缓存标记的能量（高达30–1,000倍），且由于各智能体重复探索同一代码库区域，进一步加剧了输出膨胀。解决方案的关键在于提出名为 Librarian 的持久化搜索子智能体，它通过记录代码库搜索历史并抑制跨智能体的冗余探索行为，同时以短引用替代完整文件片段作为输出，显著降低输出标记数量。实验表明，在 SWE-Bench Verified 数据集上，Librarian 可使现有 MAS 系统的每轮任务 GPU 能耗减少最多 25%，同时保持任务性能不变。

链接: https://arxiv.org/abs/2605.27787
作者: Seunghyuk Cho,Sunghyun Choi,Jaeseung Heo,Youngbin Choi,Saemi Moon,MoonJeong Park,Dongwoo Kim
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注: 19 pages, 4 figures, 12 tables

点击查看摘要

Abstract:Multi-agent systems (MAS) have substantially advanced autonomous software engineering (SWE), but their growing inference energy demands raise sustainability concerns. In this paper, we demonstrate that this cost is concentrated in an overlooked source: redundant output tokens generated across agents. Two empirical findings ground this claim. First, our per-token energy attribution for MAS reveals a sharp asymmetry: an output token consumes 30 to 1,000 times more energy than an input or cached token. Second, MAS inflate per-episode output because agents repeatedly re-explore overlapping repository regions. To address this inefficiency, we propose Librarian, a persistent search sub-agent that tracks repository-search history and suppresses redundant exploration actions across agents. By returning short references to file regions instead of full file excerpts, Librarian further reduces output-token volume. On SWE-Bench Verified, Librarian reduces per-episode GPU energy consumption of existing multi-agent SWE systems by up to 25% while preserving task performance.

[MA-6] Decoupled Intelligence: A Multi-Agent LLM Framework for Controllable Traffic Scenario Generation in SUMO

【速读】：该论文旨在解决现有单体式智能体架构在微观交通仿真（如SUMO）中因端到端流程复杂而导致的推理失败、参数不一致及缺乏系统性状态管理的问题。其解决方案的关键在于提出一种多智能体协作框架，将仿真流程解耦为规划（Planner）、构建（Builder）、需求（Demand）、执行（Runner）和分析（Analyst）等专业化角色，并由一个基于模型上下文协议（Model Context Protocol, MCP）的状态持久化协调器（Orchestrator）进行统筹调度，从而实现跨分布式智能体操作的数据无缝传递与环境一致性保障。该架构支持闭环迭代优化机制，使仿真结果能够根据用户定义的关键性能指标（Key Performance Indicators, KPIs）持续改进，显著提升了任务成功率和参数准确性。

链接: https://arxiv.org/abs/2605.27685
作者: Shuyang Li,Ruimin Ke
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Multiagent Systems (cs.MA); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with microscopic traffic simulation offers a promising path toward autonomous urban planning and intelligent transportation analysis. However, existing monolithic agent architectures often struggle with the complexity of end-to-end simulation workflows, leading to reasoning failures, parameter inconsistency, and a lack of systematic state management. This paper proposes a novel multi-agent collaborative framework designed to automate the entire lifecycle of traffic simulation in SUMO (Simulation of Urban Mobility). Our approach decouples the simulation pipeline into specialized roles, including Planner, Builder, Demand, Runner, and Analyst, coordinated by a high-level reasoning engine. We introduce a state-persistent Orchestrator leveraging the Model Context Protocol (MCP) to ensure seamless data handover and environmental consistency across distributed agent actions. This architecture enables a robust closed-loop refinement process, where simulation outcomes are iteratively analyzed and optimized to satisfy user-defined Key Performance Indicators (KPIs). Experimental results through role ablation studies demonstrate that the proposed multi-agent framework significantly enhances task success rates and parameter accuracy compared to single-agent baselines. Furthermore, case studies on real-world network extraction and traffic optimization highlight the system’s capability to bridge the gap between high-level natural language intent and low-level simulation execution.

[MA-7] Intelligence as Managed Autonomy: Failure Escalation and Governance for Agent ic AI Systems

【速读】：该论文试图解决的问题是：随着自主和代理型AI系统在机器人和人机环境中规模扩大，如何有效管理幻觉（hallucination）以及持续但无依据的行为（persistent but unjustified action），这些问题往往源于对“无边界自主性”的假设——即认为智能体应在任何情况下持续运行。解决方案的关键在于提出一种“受控自主性理论”（managed autonomy theory），其核心是将智能行为定义为具备检测认知漂移（epistemic drift）、暂停推理、尝试恢复并最终放弃控制的能力。作者通过构建SMARt模型（Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions）实现该理论，该模型包含四个层级：稳定态（Stable）、元认知态（Meta-cognitive）、辅助态（Assisted）和受监管态（Regulated），并通过时序 guarded Petri 网形式化建模，确保系统在特定条件下具有可证明的边界性质，从而强制执行升级机制、限制无效输出，并保障治理可达性。此外，通过引入领域特定的触发集（trigger sets），SMARt能够在满足完备性和一致性前提下系统性地维持安全性，且支持随时间安全扩展操作范围。因此，论文强调，在自主性生命周期中正式化故障管理是实现可靠且受控的人工智能的关键一步。

链接: https://arxiv.org/abs/2605.27628
作者: Srini Ramaswamy
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:As autonomous and agentic AI systems scale in robotic and human-machine environments, managing hallucination and persistent but unjustified action remains an open challenge. Rather than attributing these failures solely to model or alignment limitations, this paper explores the architectural vulnerability of unbounded autonomy - the presumption that an agent should continue operating regardless of rising uncertainty. It introduces a theory of managed autonomy that defines intelligent behavior through the formal capacity to detect epistemic drift, suspend reasoning, attempt recovery, and ultimately surrender control when reliability diminishes. We instantiate this theory via the SMARt (Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions) model, a four-layer framework featuring Stable, Meta-cognitive, Assisted, and Regulated states. By developing a timed, guarded Petri net formulation, we establish theoretically bounded properties for the system, demonstrating how architecture can formally mandate escalation, constrain invalid outputs, and ensure governance reachability under specified conditions. We further analyze how incorporating domain-specific trigger sets across varied operational settings (e.g., healthcare, robotics, etc.) can systematically preserve safety, assuming completeness and soundness criteria are met. Because these triggers are designed to be adaptive, the SMARt model accommodates the safe, controlled expansion of an agent’s operational scope over time. We conclude that formalizing failure management within the autonomy lifecycle is a crucial step toward realizing reliable and governed artificial intelligence.

[MA-8] Agents that Matter: Optimizing Multi-Agent LLM s via Removal-Based Attribution

【速读】：该论文旨在解决多智能体系统（Multi-Agent Systems, MAS）中个体智能体贡献难以量化的问题，这是系统优化的关键瓶颈。现有方法缺乏一个严谨且统一的信用分配（credit assignment）框架，导致无法有效识别哪些智能体对整体性能起关键作用。其解决方案的核心在于将智能体归因建模为一个合作博弈（cooperative game），通过联盟分布（coalition distribution）、移除协议（removal protocol）和目标指标三个参数进行形式化。在此框架下，作者发现留一法（Leave-One-Out, LOO）在识别瓶颈智能体方面与组合方法效果相当，但计算成本显著降低；同时揭示了不同移除协议诱导出不同的博弈结构：代理消融（agent ablation）能准确捕捉结构性瓶颈，而基于大语言模型（LLM）的主观判断则无法忠实模拟这一行为。此外，论文提出通过模型替换（model replacement）进行归因，即用低贡献智能体的替代模型进行实验，可在保持诊断准确性的同时提升任务性能达17%，并降低成本达35%。最后，在医疗多智能体系统的审计中，发现诊断准确性和伦理行为之间的贡献常呈解耦状态，通过干预低效或有害角色可增强伦理一致性而不损害诊断性能。总体而言，该工作提供了一个原则性强、成本可控的多智能体归因与干预框架。

链接: https://arxiv.org/abs/2605.27621
作者: Mingyu Lu,Yushan Huang,Chris Lin,Su-In Lee
机构: University of Washington (华盛顿大学)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As multi-agent systems (MAS) become increasingly complex, identifying the contributions of individual agents is critical for system optimization. However, existing approaches lack a rigorous, unified framework for credit assignment. In this work, we formalize agent attribution as a cooperative game, parameterized by the coalition distribution, removal protocol, and target metric. Using this framework, we show that Leave-One-Out (LOO) identifies bottleneck agents as effectively as combinatorial methods, but at a fraction of the computational cost. We also demonstrate that removal protocols induce distinct games: Agent ablation isolates structural bottlenecks, whereas introspective LLM judges fail to faithfully approximate this behavior. Furthermore, to evaluate the utility of specific agent backbones, we introduce attribution via model replacement. By substituting underlying models of low-contribution agents, we improve task performance by up to 17% while reducing cost by up to 35% across three benchmarks. Finally, we apply our framework to audit a medical MAS, revealing that agent contributions to diagnostic accuracy and ethical behavior are often decoupled. By intervening on counterproductive roles, we observe an increase in ethics alignment while maintaining diagnostic accuracy. Overall, this work provides a principled approach for cost-effective MAS attribution and intervention.

[MA-9] Voluntary Collusion with Secret Tools in Competing LLM Agents

【速读】：该论文试图解决的问题是：尽管某些工具被明确描述为对他人不公平且有害，当前安全对齐的大型语言模型（LLM）代理仍会在存在战略优势时自愿参与秘密串通行为。解决方案的关键在于揭示了仅依赖模型的一般对齐性或简单标注“不公平”无法有效阻止此类行为，而只有通过明确的伦理框架（explicit ethical framing）才能显著降低串通工具的采用率；此外，研究还表明较小规模的模型即使在伦理框架下仍易受串通诱惑，凸显出必须设计专门的安全防护机制来防止多智能体系统中的自愿串通行为，而非单纯依赖通用对齐策略。

链接: https://arxiv.org/abs/2605.27593
作者: Xijie Zeng,Frank Rudzicz
机构: Dalhousie University (达尔豪斯大学); Vector Institute for Artificial Intelligence (向量人工智能研究所)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Even when a tool is explicitly described as unfair and harmful to others, ostensibly safety-aligned LLM agents still voluntarily engage in secret collusion whenever doing so confers a strategic advantage. To investigate this phenomenon, we introduce an empirical framework built on two strategic multi-agent environments: Liar’s Bar, a competitive deception scenario, and Cleanup, a mixed-motive resource-management scenario, in which agents are offered secret collusion tools that provide significant advantages while clearly disadvantaging the other agents. Across 12 models (at the 7B, 70B, and proprietary scales) and 6 prompt variants, we find that most agents consistently accept these tools and develop collusive strategies, while explicitly acknowledging the unfairness of the tools before accepting. We further show that neither the unfairness labels nor baseline alignment alone reliably deters collusion: only explicit ethical framing reduces adoption and, even then, smaller models remain susceptible. More broadly, our work presents the first systematic investigation of voluntary collusion adoption in LLM-based multi-agent systems, and suggests that preventing such behaviour requires explicit safeguards rather than reliance on general alignment.

[MA-10] You Only Align Once: Propagating Cooperative Behaviors in Multi-Agent Systems through Seed Agents

【速读】：该论文试图解决分布式开放多智能体系统中代理行为一致性难以保障的问题，尤其是在智能体数量增长和存在非对齐代理的情况下。解决方案的关键在于提出“对齐传播”（Alignment Propagation）现象：通过自然语言交互，一个对齐的种子智能体能够将合作行为传播给未训练的同伴智能体。研究在红黑博弈（Red-Black Game）这一基于团队的重复囚徒困境场景中验证了该机制，通过将教师模型的协作推理与说服性对话知识蒸馏到Qwen-3-14B模型中，得到的种子智能体在四名未训练队友中使合作率从24.8%提升至62.2%，优于教师模型和通用大模型Gemini-3.1-Pro；更关键的是，该种子智能体在零样本迁移至空间生存模拟Sugarscape时，实现了91.5%的交易成功率，远超21.6%的基线。这表明多智能体对齐问题可从逐个训练转向通过策略性种子部署实现可扩展的社会能力工程。

链接: https://arxiv.org/abs/2605.27586
作者: Nicole Hsing,Asuka Yuxi Zheng,Yi Zhao,Haoqin Tu,Jen-Tse Huang
机构: Arcarae; University of California, Santa Cruz; Northwestern University; Johns Hopkins University
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Ensuring agent behaviors in distributed open multi-agent systems remains challenging, especially as populations grow and unaligned agents may exist. We show that a single aligned agent can propagate cooperative behaviors to untrained agents purely through natural language interaction, a phenomenon we term Alignment Propagation. We study this in the Red-Black Game, a team-based iterated Prisoner’s Dilemma in which teammates deliberate and vote to determine their team’s collective action. By distilling the cooperative reasoning and persuasive dialogues of a teacher model into a Qwen-3-14B, we obtain a seed agent that, when placed among four untrained teammates, doubles the cooperation rate from 24.8% to 62.2%, outperforming the teacher model and a vanilla Gemini-3.1-Pro. Remarkably, a seed trained exclusively on the RedBlack Game transfers zero-shot to Sugarscape, a spatially grounded survival simulation with pairwise trading, achieving a 91.5% trade success rate versus a 21.6% baseline. Our results reframe multi-agent alignment from an exhaustive per-agent training problem to a scalable social capability that can be engineered through strategic seed placement.

[MA-11] Detection Without Correction: A Two-Parameter Decomposition of Multi-Stage LLM Pipelines

【速读】：该论文试图解决多阶段大语言模型（LLM）流水线中出现的异常聚合行为问题，例如准确率在轮次间 plateau（平台期）和反转、多智能体辩论收益无法复现于前沿模型、内在自我修正退化以及辩论动态在不同模型提供商间的定性差异。其解决方案的关键在于将下游代理响应建模为两个耦合决策：检测（是否将上游内容视为权威）与条件生成（若不信任则生成什么）。这一分解揭示了四种可观察的响应模式，其中“仅检测不修正”是最关键的失效模式；实证研究表明，条件误修正率始终占主导地位（跨群体53–94%），而检测率则因情境变化超过一个数量级。作者进一步指出，上述现象均可归因于同一机制，并识别出检测阈值作为稳定存在的模型/协议级规律，且在匹配难度的基准测试下跨方法保持一致。

链接: https://arxiv.org/abs/2605.27559
作者: Prashanti Nilayam,Kiran Ramanna,Prashil Tumbade
机构: Servicenow
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-stage LLM pipelines that perform multi-agent debate, intrinsic self-correction, or retrieval-augmented verification exhibit puzzling aggregate behaviors: accuracy plateaus and reversals across rounds, non-replication of debate gains on contemporary frontier models, intrinsic self-correction degradation, and qualitative cross-provider divergence in debate dynamics. Downstream agent response can be operationalized as two coupled decisions: detection (whether to treat upstream content as authoritative) and conditional generation (what to produce if not). This decomposition yields four observable response regimes, of which detection-without-correction is the load-bearing failure mode. Across a nine-cell empirical grid spanning four model families, four benchmarks (GSM8K, MATH-500, GPQA-Diamond, AIME), and two methods (multi-agent debate, intrinsic self-correction), we find that the conditional miscorrection rate is consistently dominant (53-94% across cohorts) while detection rate varies contextually by more than an order of magnitude. The framework unifies the four phenomena above as signatures of a common mechanism and characterizes detection threshold as a stable model/protocol-level regularity that persists across methods at matched benchmark difficulty.

[MA-12] From Task Allocation to Risk Clearing: A Unifying Interface for Mixed Human-Agent Societies

【速读】：该论文试图解决的问题是：在人类、机器人和软件代理日益共存于安全关键环境中的背景下，现有协调框架难以应对动态不确定性，尤其是静态任务分配机制无法适应复杂多变的协作需求，而现有的学习型联合策略又缺乏透明度，难以与人类决策者协同。解决方案的关键在于提出一种名为“风险感知选项清算”（Risk-Aware Option Clearing, ROC）的统一协调机制，其核心创新是将代理暴露的“选项”（即时间扩展技能）与其配套的风险摘要（预测结果分布）作为基本协调单元，由中央清算所基于风险调整后的任务效用，在截止时间和安全约束下进行最优任务分配。ROC通过将风险建模嵌入到选项层级，实现了可扩展、透明且支持异构代理集成的协调基础设施，为未来人机混合社会中的风险感知协调提供了理论基础和研究方向。

链接: https://arxiv.org/abs/2605.27547
作者: Vassilis Vassiliades
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: Presented at EMAS 2026

点击查看摘要

Abstract:As humans, robots, and software agents increasingly share safety-critical environments, coordination must move from static task allocation to managing uncertain commitments. Existing frameworks fall short: they either assume rigid, static teams or learn opaque joint policies that are hard to adapt and difficult to integrate with human decision-makers. To overcome these limitations, we propose Risk-Aware Option Clearing (ROC), a unifying coordination mechanism in which agents expose options (temporally extended skills) paired with risk summaries that predict outcome distributions. A central clearinghouse then assigns tasks by optimizing risk-adjusted mission utility under deadlines and safety constraints. ROC is a family of mechanisms, ranging from deployments where the clearinghouse learns outcome models from data to ones that consume full distributional predictions from agents. By treating risk-aware options as the basic coordination unit, ROC sketches a scalable, transparent infrastructure for integrating heterogeneous agents into future mixed human–agent societies and outlines a research agenda for such risk-aware clearing layers.

[MA-13] AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

【速读】：该论文试图解决多智能体系统（Multi-agent Systems）在基于大语言模型（LLMs）时面临的协调决策难题，即如何动态选择技能协议、分配代理角色、绑定模型、设计交互拓扑、决定是否调用检索或验证机制，以及何时跳过步骤等复杂且相互依赖的决策问题。传统静态流水线和一次性模型对比无法充分覆盖这些决策在不同任务场景和运行约束下的交互影响。论文提出的解决方案核心是将多智能体协调建模为一个部分可观测环境下的在线策略学习问题——AgensFlow框架通过使协调决策可观察、可学习，从而从重复轨迹中自动优化决策策略，而非将其固化为固定流程。实验表明，该方法在分布式系统故障处理和安全通告任务上优于固定基线，在高协调复杂度场景下实现了更高质量的运行点，并验证了拓扑压缩与策略图预热的有效性，证明了可学习、可审计的路由机制能显著提升多智能体工作流的性能。

链接: https://arxiv.org/abs/2605.27466
作者: Nicole Koenigstein
机构: Independent researcher(独立研究员)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 7 pages, 4 figures, 4 tables. Code and reproducible evaluations available at: this https URL

点击查看摘要

Abstract:Multi-agent systems built on large language models (LLMs) require many coordination choices that are difficult to fix a priori: which skill protocol to invoke, which agent role should perform a subtask, which model to bind to each role, how roles should interact, when to use retrieval or verification, and when to omit a step entirely. These choices interact with task regime and operational constraints, so static pipelines and one-off model comparisons provide only a limited view of the design space. This paper introduces AgensFlow, an open-source framework that treats multi-agent coordination as an online policy-learning problem under partial observability. The framework makes coordination decisions observable and learnable from repeated trajectories, rather than treating skill, role, model, topology, and evaluation choices as fixed pipeline design. AgensFlow is evaluated on two corpora: distributed-systems incident tasks and security-advisory tasks. The evaluation shows three main results: learned routing reaches a higher-quality operating point than a fixed pipeline baseline on coordination-heavy classes; skip:X isolates topology compression as a meaningful part of the substrate; and warm-started policy graphs can reduce exploration cost while preserving plateau quality. Overall, the results support that learned, auditable routing can improve coordination-heavy multi-agent workflows over static wiring. Comments: 7 pages, 4 figures, 4 tables. Code and reproducible evaluations available at: this https URL Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML) ACMclasses: I.2.11; I.2.8 Cite as: arXiv:2605.27466 [cs.MA] (or arXiv:2605.27466v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2605.27466 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-14] Heterogeneous Multi-Agent Modeling for Measurement and Network Analysis of the Data Service Market

【速读】：该论文旨在解决当前数据服务市场治理中因参与主体复杂、社会要素多样而导致的监管效果不足问题，尤其针对现有研究多局限于数据层面性能分析、难以有效衡量多类异构实体及社会元素融合影响的局限性。其解决方案的关键在于提出一种基于异构多智能体建模的数据服务市场测度与网络分析方法：通过引入服务生态系统理论明确市场参与者及其外部因素，并基于价值创造对三级实体进行效用测量；进一步设计分析方法以精准评估异构网络对效用的影响，从而实现对数据服务市场复杂结构与动态关系的有效刻画与调控。

链接: https://arxiv.org/abs/2605.27433
作者: Deyu Zhou,Yuwei Guo,Xudong Lu,Linhao Zhang,Wei Guo,Lizhen Cui
机构: Shandong University (山东大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the increasing complexity of collaboration among various social entities and user demands, the factors affecting the stable development of the data service market are also growing. These factors include the widespread dissemination of information enhancing subjective consciousness, the continuous improvement in intelligence, and the complexification of structural relationships. To achieve effective governance and regulation of the data service market, it is crucial to conduct simulation experiments before making regulatory decisions. However, current research and analysis of the data service market primarily focus on data-level performance, proving inadequate when it comes to measurement and analysis of multiple heterogeneous entities and the integration of various social elements within the data service market. Based on this, this paper innovatively proposes a data service market measurement and network analysis method based on heterogeneous multi-agent modeling. By introducing the service ecosystem theory, we clarify the participants and external factors of the data service market and conduct utility measurements for three-level entities based on value creation. Furthermore, an analytical methodology is devised to precisely assess the influence of heterogeneous networks on utility. Finally, the paper verifies the effectiveness of the proposed method through the analysis of experimental results.

[MA-15] Speed-Weighted Adaptive Flocking for Sailing Swarms under Dynamic Environmental Forcing

【速读】：该论文试图解决自主航行机器人集群在风力驱动下因速度和机动性动态变化而引发的“快-慢协调”问题，即传统群体行为模型（如聚集和 flocking）假设机器人可直接执行期望的速度与方向，但航行机器人受风速依赖推进、航向受限及空间风场不均等约束，导致其运动能力随环境实时变化，进而产生瞬时异质性。解决方案的关键在于提出一种基于 Couzin 模型改进的速度加权社会交互规则，该规则根据每台机器人的瞬时运动约束调整其社交影响力——增强对较慢个体的凝聚力，从而在吸引快速邻居维持整体移动性的同时，通过围绕慢速个体保持结构稳定，有效提升群体极化程度并减少近距离碰撞。同时，作者开发了名为 SailSwarmSwIM 的简化仿真器，能够精确刻画风依赖速度、机动性、禁航区、迎风转向行为以及稳态或阵风风场等关键特征，为研究风驱动下机器人编队的自适应集体行为提供了完整的建模框架。

链接: https://arxiv.org/abs/2605.27422
作者: Pranav Kedia,Aaron Gan,Hannah J. Williams,Andreagiovanni Reina,Heiko Hamann
机构: 未知
类目: Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: Submitted at 18th International Conference on the Simulation of Adaptive Behavior (SAB 2026)

点击查看摘要

Abstract:Collective behavior models, such as aggregation and flocking, usually assume self-propelled robots that can directly execute their desired speed and direction of motion without fundamental constraints. However, autonomous sailing robots violate this assumption. Their motion is shaped by wind-dependent propulsion, restricted headings, and spatially varying wind conditions. In particular, maneuverability is coupled to wind speed: in weak wind, sailboats may turn only slowly or not at all, whereas stronger wind enables faster turns. This introduces transient heterogeneity in speed and maneuverability across the flock. We focus on this fast-slow coordination problem in sailing robot flocks. To study this problem, we introduce SailSwarmSwIM, a reduced-order simulator for autonomous sailing robot swarms that captures wind-dependent speed and maneuverability, no-go zones, tacking behavior, and steady or gusty wind fields. To design our novel flocking technique, we start from the Couzin model and introduce a speed-weighted social interaction rule that accounts for each robot’s transient motion constraints. A key result is that increasing the social influence of slower robots improves polarization and reduces close encounters. This effect arises from a balance between attraction to fast neighbors, which helps maintain movement, and cohesion around slow neighbors, which prevents the flock from fragmenting. Together, our simulator, SailSwarmSwIM, and the speed-weighted interaction rule provide a modeling framework for studying adaptive collective behavior in robotic fleets whose motion capabilities are continuously shaped by wind.

[MA-16] APS: Bias-Controlled Adaptive Prototype Simulation for Population-Scale LLM Agents

【速读】：该论文旨在解决大规模LLM-agent仿真中计算成本过高问题，即传统方法在多轮仿真中随群体规模和时间跨度线性增长的资源消耗，导致每个代理（agent）每轮都需调用大语言模型（LLM），难以扩展。其解决方案的关键在于提出自适应原型仿真（Adaptive Prototype Simulation, APS），将可扩展的LLM仿真重构为一个递归的“oracle分配”问题：保留LLM作为在线转移oracle，同时引入自适应的核心原型（prototype）、单例尾部代理（singleton-tail agents）和影子审计代理（shadow-audit agents）。其中，原型响应生成局部响应面以近似邻近代理的行为，显著减少在线LLM调用；通过影子审计残差校正控制近似偏差，并采用尾部保护的单例路由机制确保对高曲率、异质或孤立区域的精准覆盖。理论分析表明，APS可视为高精度个体社会仿真的估计器，误差由原型覆盖误差、影子审计残差校正误差、局部传播偏差和时序上下文错位构成。实验显示，在1000万代理的多轮公众意见模拟中，APS相较全量仿真实现381.1倍的加速，且最终轮分布差异（JSD）仅为0.094，优于同类预算下的基线方法。

链接: https://arxiv.org/abs/2605.27419
作者: Quan Zheng,Yan Gao,Shaobin He,Haoxiang Guan,Yuanhe Tian,Jie Feng,Ming Wang,Shuxin Zheng,Zhen Liu
机构: Beijing Normal University (北京师范大学); Zhongguancun Academy (中关村学院); Zhongguancun Institute of Artificial Intelligence (中关村人工智能研究院)
类目: Multiagent Systems (cs.MA); Computers and Society (cs.CY)
备注: 32 pages, 5 figures

点击查看摘要

Abstract:LLM-agent simulation offers a flexible computational tool for studying population response trajectories that depend on scenario events, memory, demographics, and evolving social context. However, full multi-round simulation scales linearly with both population size and horizon, requiring every agent to query the LLM at every round. We propose Adaptive Prototype Simulation (APS), a framework that reframes scalable LLM-based simulation as a recurrent oracle-allocation problem. APS retains the designated LLM as the online transition oracle while querying adaptive core prototypes, selected singleton-tail agents, and shadow-audit agents. Prototype responses induce local response surfaces for nearby agents, reducing online LLM calls without replacing the underlying transition model. To control approximation bias, shadow-audit residual correction estimates propagation residuals for aggregate correction and future budget allocation, while tail-protected singleton routing directly queries selected isolated, heterogeneous, or high-curvature regions that are vulnerable to smoothing. Theoretically, we treat APS as an estimator for full-scale high-precision individual social simulation and decompose its errors into prototype-coverage error, shadow-audit residual-correction error, local-propagation bias, and temporal context mismatch. Under the reported protocols, APS gives lower reference-aligned distributional discrepancy than scale-oriented and same-budget baselines while reducing online LLM calls, with ablations and compact robustness checks diagnosing the main bias-control mechanisms. In a 10M-agent, multi-round public-opinion simulation, APS achieves a 381.1-fold reduction over full simulation, with reference-aligned final-round JSD of 0.094 against the corresponding full-LLM reference.

[MA-17] Differentiable Model Predictive Safety for Heterogeneous Mobility at Urban Intersections

【速读】：该论文旨在解决城市环境中自动驾驶车辆与移动机器人在无监管交叉路口处协同运行时面临的复杂安全问题，特别是异构代理（具有不同动力学特性的智能体）的协调难题。其解决方案的关键在于提出一种名为“可微分模型预测安全”（Differentiable Model Predictive Safety, DMPS）的新框架，该框架将模型预测控制（Model Predictive Control, MPC）的前瞻性规划能力嵌入到数据驱动的端到端强化学习架构中：DMPS代理通过学习潜在动力学模型来预测自身动作下的未来轨迹，并利用一个可微的安全评判器评估这些轨迹的风险；更重要的是，通过在整个预测模型展开过程中进行反向传播，代理能够高效计算未来安全性对当前动作的梯度，从而实现最小且精确的在线安全修正。该方法在高密度混合交通场景下显著提升了安全性（碰撞率低于5.6%），同时保持了能源效率和交通流畅性，达到了当前最优水平。

链接: https://arxiv.org/abs/2605.27418
作者: Wenzhe Song,Hao Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: 6 pages. Published in IEEE IARCE 2025

点击查看摘要

Abstract:The imminent integration of autonomous vehicles and mobile robots in urban settings presents a critical safety challenge for future intelligent transportation systems. This paper addresses the complex problem of coordinating heterogeneous agents with disparate dynamics at unregulated intersections. We introduce a novel framework, differentiable model predictive safety (DMPS), which embeds the foresight of model-predictive control into a data-driven, end-to-end reinforcement learning architecture. DMPS agents learn a latent dynamics model to predict future trajectories contingent on their actions. A learned, differentiable safety critic then evaluates the risk of these trajectories. Crucially, by leveraging backpropagation through the entire unrolled predictive model, agents can efficiently compute the gradient of future safety with respect to their current action, enabling a minimal and precise online safety correction. Integrated into a multi-agent training scheme, DMPS virtually eliminates collisions to less than 5.6% in high-density, mixed vehicle-robot traffic simulations, demonstrating state-of-the-art safety without compromising energy and traffic efficiency.

[MA-18] Mathematical Modelling of Ethical AI Use in Higher Education: A Coordination Game Framework for Future-Facing Learning

【速读】：该论文试图解决的问题是：在高等教育中，生成式AI（Generative AI）的快速普及正在重塑评估方式，但当前关于学生群体如何形成负责任或投机性AI使用规范的集体行为机制尚不明确，导致仅靠政策声明难以有效引导行为。解决方案的关键在于提出一个基于演化博弈论的协调模型，将学生AI使用视为由同伴期望和评估设计共同塑造的协调问题，而非个体合规行为；通过引入反映型评估激励机制隐含建模机构治理，并借助理论分析与有限种群模拟揭示非线性阈值驱动的行为转变——即微小且精准调整的评估激励可引发从投机向学习导向的规范跃迁，从而为高校提供一种以教学法为核心的、无需监控或惩罚的AI治理工具。

链接: https://arxiv.org/abs/2605.27400
作者: Ndidi Bianca Ogbo,Zhao Song,Shatha Ghareeb, TheAnh Han
机构: Teesside University (提塞德大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Emerging Technologies (cs.ET); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The rapid uptake of generative artificial intelligence (AI) in higher education is reshaping assessment practices and intensifying concerns around academic integrity, fairness, and learning quality. While institutional responses increasingly emphasise policy guidance and ethical principles, there remains limited formal understanding of how collective norms of responsible or opportunistic AI use emerge and stabilise within student cohorts. This paper reframes student AI use in assessment as a coordination problem shaped by peer expectations and assessment design rather than individual compliance alone. We develop a coordination-based evolutionary game-theoretic framework that captures learning value, effort, perceived fairness, and transparency, with institutional AI governance modelled implicitly through reflective assessment incentives. We use analytical results and finite-population simulations to reveal threshold-driven behavioural transitions in student AI use: small, well-calibrated changes in reflective assessment incentives can trigger rapid shifts towards responsible, learning-oriented AI-use norms, whereas weak or misaligned incentives allow opportunistic practices to persist. These non-linear dynamics explain why policy statements alone often fail to change behaviour, while modest assessment redesigns can have disproportionate effects. By providing a mechanism-level account of how assessment structures shape collective AI-use practices, this work offers higher education institutions an analytically grounded tool for Future Facing Learning, supporting proportionate, pedagogy-led AI governance without reliance on surveillance or punitive enforcement. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Emerging Technologies (cs.ET); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA) Cite as: arXiv:2605.27400 [cs.CY] (or arXiv:2605.27400v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2605.27400 Focus to learn more arXiv-issued DOI via DataCite

[MA-19] Human-AI Collaboration for Estimating Scientific Replicability

【速读】：该论文试图解决科学发现可重复性评估的难题，即如何更准确地预测已发表科研成果能否在控制条件下被成功复现。现有方法主要依赖人类专家判断或基于论文元数据训练的机器学习模型，但二者均存在明显局限：前者易受认知偏差和文献接触范围狭窄的影响，后者难以捕捉上下文线索和可信度的细微信号。论文提出的关键解决方案是一种混合预测市场机制，其中算法代理（基于数百项已有复现实验结果训练）与人类参与者共同参与实时交易，以协同估算科学发现被验证的可能性。实验表明，除少数例外情况外，该混合市场在准确性与可靠性上优于纯人工或纯算法基线，显著提升了复现预测效能。

链接: https://arxiv.org/abs/2605.27394
作者: Tatiana Chakravorti,Robert Fraleigh,Timothy Fritton,Christopher Griffin,Vaibhav Singh,Sai Koneru,C. Lee Giles,David Pennock,Anthony Kwasnica,Sarah Rajtmajer
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Rutgers University (罗格斯大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Determining whether published scientific findings can successfully be replicated is a long-standing challenge in the empirical sciences. Existing approaches for replicability assessment typically rely either on human judgment, i.e., creative assembly of human experts, or on machine learning models trained on paper content metadata. While both approaches have demonstrated value, each also has important limitations. Human forecasts can be influenced by cognitive biases and narrow exposure to the research literature, while automated assessments often struggle to capture contextual cues and subtle signals of credibility. In this paper, we examine a hybrid approach. Specifically, we introduce a hybrid prediction market in which algorithmic agents trade alongside human participants to jointly estimate the likelihood that a published scientific finding will be corroborated via the outcome of a controlled replication study. Agents are trained on outcomes from hundreds of prior replication studies while human participants contribute domain knowledge through real-time trading. We evaluate this hybrid approach through multiple live experiments involving participants from different academic disciplines and compare its performance to artificial-only and human-only baselines. Our results show that, except for a few cases, hybrid markets match or outperform artificial prediction markets, producing more accurate and reliable replication forecasts.

[MA-20] OralAgent : Integrating Reasoning Tools and Knowledge for Interactive Dental Image Analysis

【速读】：该论文试图解决的问题是：当前牙科AI模型多针对特定任务或单一成像模态设计，缺乏在真实临床流程中的整合应用能力，限制了其实际落地效果。解决方案的关键在于提出OralAgent——首个专为牙科领域设计的AI代理系统，其核心创新在于将多模态推理、基于工具的决策和知识驱动的检索统一在一个端到端自动化框架中，集成22个视觉分析工具与368本经典牙科教材，并构建了大规模双语文本资源OralCorpus（含1.348亿token）用于增强检索生成（Retrieval-Augmented Generation, RAG），同时开发了覆盖11个牙科亚专业的中文多选题基准OralQA-ZH以评估模型的跨学科知识掌握能力。实验表明，OralAgent在多个基准上达到最先进性能，具备良好的有效性、可解释性和临床适应性。

链接: https://arxiv.org/abs/2605.27378
作者: Jing Hao,Siyuan Dai,Yongxin Zhang,Yuci Liang,Jiamin Wu,Jiahao Bao,Yuxuan Fan,Zanting Ye,Yanpeng Sun,Xinyu Zhang,Ming Hu,Liang Zhan,James Kit Hon Tsoi,Linlin Shen,Junjun He,Kuo Feng Hung
机构: The University of Hong Kong (香港大学); University of Pittsburgh (匹兹堡大学); Shenzhen University (深圳大学); Shanghai Ninth People’s Hospital (上海第九人民医院); Nanyang Technological University (南洋理工大学); Southern Medical University (南方医科大学); Singapore University of Technology and Design (新加坡科技设计大学); University of Auckland (奥克兰大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 14 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real-world clinical workflows. In this paper, we present OralAgent, the first dental-specialized AI agent that unifies multimodal reasoning, tool-based decision-making, and knowledge-grounded retrieval within an end-to-end automated framework. It integrates 22 visual analysis tools and 368 widely-used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi-step workflow execution. Furthermore, we introduce OralCorpus, a large-scale, high-quality bilingual textual resource containing 134.8M tokens curated for dental retrieval-augmented generation (RAG). To evaluate models’ multidisciplinary dental knowledge, we construct OralQA-ZH, a Chinese multiple-choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state-of-the-art performance on the MMOral-Uni, MMOral-OPG, and OralQA-ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real-world clinical settings. The code and models are publicly available at this https URL.

自然语言处理

[NLP-0] PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective

【速读】：该论文试图解决的问题是：当前参数高效微调（Parameter-efficient finetuning, PEFT）方法在评估时过于关注下游任务的准确性，而忽视了预训练能力的保留问题。作者指出，PEFT应从“稳定性-可塑性困境”（stability-plasticity dilemma）的角度进行评估，即在目标任务适应能力与遗忘预训练知识之间的权衡。解决方案的关键在于提出一个名为PEFT-Arena的新基准，该基准同时衡量下游性能和通用能力保留情况，并通过几何视角分析不同PEFT方法的更新机制：在权重空间中，利用谱分析揭示参数化方式如何与预训练模型的奇异值结构相互作用；在激活空间中，通过保留指标判断微调是否破坏了通用能力表征，发现遗忘与非等距表征失真相关。此外，研究还发现最终监督微调（SFT）检查点常超出最优的目标保留操作点，由此启发了基于路径回溯（path-wise rewinding）的后处理改进策略。

链接: https://arxiv.org/abs/2605.28819
作者: Yangyi Huang,Ruotian Peng,Zeju Qiu,Jiale Kang,Yandong Wen,Bernhard Schölkopf,Weiyang Liu
机构: The Chinese University of Hong Kong; Westlake University; MPI for Intelligent Systems
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Technical report v1 (28 pages, 9 figures, project page: this https URL )

点击查看摘要

Abstract:Parameter-efficient finetuning (PEFT) has become the standard approach for adapting large language models, yet evaluations largely emphasize downstream accuracy while overlooking the retention of pretrained capabilities. We argue that PEFT should be assessed through the stability-plasticity dilemma: the trade-off between target-task adaptation and resistance to forgetting. We introduce PEFT-Arena, a benchmark that jointly measures downstream performance and general capability retention. Across methods, we find distinct stability-plasticity profiles; under comparable parameter budgets, orthogonal finetuning achieves the most favorable Pareto frontier. To explain these differences, we analyze PEFT updates from two geometric perspectives. In weight space, spectral analysis reveals how parameterizations interact with the pretrained singular-value structure. In activation space, retention metrics show whether finetuning preserves or distorts general-capability representations, with forgetting linked to non-isometric representation distortion. Finally, an analysis shows that final SFT checkpoints often overshoot a better target-retention operating point. Inspired by this, we present case studies of a post-hoc improvement with path-wise rewinding.

[NLP-1] VLMs May Not Globally Enhance Human Alignment over LLM s During Natural Reading

【速读】：该论文试图解决的问题是：视觉-语言学习（vision-language learning）是否能够使大语言模型（LLM）在自然阅读过程中的文本表征更接近人类大脑的处理方式。其解决方案的关键在于设计了一种受控的计算实验框架，通过比较严格匹配的大语言模型（LLM）与视觉-语言模型（VLM）对纯文本输入的响应，从而隔离多模态预训练历史的影响（而非在线视觉输入或跨模态融合），并利用包含全脑fMRI响应和同步眼动扫视数据的人类自然阅读数据集进行评估。研究发现，多模态预训练并非在所有情况下都能提升模型与人类的对齐程度，但在句子包含较强视觉语义内容时，VLM表现出更优的对齐效果，表明多模态预训练对人类语言表征的模拟具有选择性而非全局性优势。

链接: https://arxiv.org/abs/2605.28818
作者: Jinzhou Wu,Zhengwu Ma,Jixing Li,Baoping Tang,Zitong Lu
机构: Chongqing University (重庆大学); City University of Hong Kong (香港城市大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:Large language models (LLMs) have become increasingly useful computational models of human language processing, but it remains unclear whether vision-language learning makes text representations more human-like during natural reading. Here, we address this question by comparing tightly matched LLM and vision-language model (VLM) pairs under a strictly text-only setting, allowing us to isolate the effect of multimodal training history from online visual input or cross-modal fusion. We evaluate model alignment with a human natural-reading dataset that includes whole-cortex fMRI responses and synchronized eye-tracking saccades. Our findings demonstrate that multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading, indicating that language-internal representations remain the key factor for modeling human text processing. However, the VLM advantage could emerge more selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments. Together, our findings provide a controlled in silico framework for testing how visual learning history shapes model-human alignment of language processing, suggesting that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.

[NLP-2] Self-Improving Language Models with Bidirectional Evolutionary Search

【速读】：该论文试图解决当前自改进语言模型和智能体系统中搜索方法的两个根本性问题：一是搜索过程依赖稀疏的验证信号，导致优化效率低下；二是候选解主要通过自回归扩展生成，限制了探索范围，仅限于模型概率质量较高的区域。解决方案的关键在于提出双向进化搜索（Bidirectional Evolutionary Search, BES），其核心机制包括两个部分：向前搜索阶段引入进化算子（evolution operators）对部分轨迹进行重组，从而生成传统单次模型推理难以获得的候选解；向后搜索阶段将原始任务递归分解为可验证的子目标，提供密集的中间反馈以引导向前搜索。理论分析表明，仅靠扩展的搜索受限于狭窄的熵壳（entropy shell），而进化操作能够突破这一限制，且向后搜索可指数级减少找到正确答案所需的样本数。实验验证了BES在主流后训练算法失效的挑战性任务上仍能实现稳定提升，并在三个开放问题求解基准测试中优于现有开源框架的平均性能与最优性能。

链接: https://arxiv.org/abs/2605.28814
作者: Guowei Xu,Zhenting Qi,Huangyuan Su,Weirui Ye,Himabindu Lakkaraju,Sham M. Kakade,Yilun Du
机构: Harvard University (哈佛大学); MIT (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Search has been proposed as an effective method for self-improving language models and agentic systems, both for post-training sample generation and for inference. However, widely used methods such as best-of-N sampling and tree search face two fundamental limitations: they are guided by sparse verification signals, and they construct candidates primarily through autoregressive expansion, restricting exploration to regions with substantial model probability mass. To address these, we propose Bidirectional Evolutionary Search (BES), a search framework that couples forward candidate evolution with backward goal decomposition. In the forward search, BES augments standard expansion with evolution operators that recombine partial trajectories to generate candidates that are difficult to obtain from a single model rollout. In the backward search, BES recursively decomposes the original task into checkable subgoals, producing dense intermediate feedback that guides forward search. We provide theoretical motivation showing that candidates generated by expansion-only search are confined to a narrow entropy shell while evolutionary operators can escape it, and that backward search can exponentially reduce the number of required samples to find a correct answer. Experiments show that on challenging post-training tasks where mainstream post-training algorithms fail to improve, BES enables consistent gains, and on three open problem solving benchmarks at inference time, BES outperforms existing open-source frameworks in both average and best-case performance. Code and trained models are available at this https URL.

[NLP-3] OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration ICML2026

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）中视觉输出可靠性与细粒度验证不足的问题，尤其在规模化通用基础模型部署时，亟需可信赖且具备诊断能力的验证机制。其解决方案的关键在于提出“多模态元验证”（Multimodal Meta-Verification）框架，核心创新包括：1）采用符号化验证输出（如边界框）替代文本解释作为元验证依据，从而支持高效规则驱动的强化学习奖励机制，避免依赖辅助判别模型带来的不确定性；2）将二分类判断与元验证的强化学习目标解耦优化，克服二者在输出结构和学习动态上的本质差异，显著提升训练效果。基于此，作者构建了OmniVerifier-M1，一个利用符号化元验证与解耦强化学习的通用视觉验证器，不仅实现鲁棒验证与细粒度错误定位，还进一步推动了M1-TTS系统——一种由验证器驱动的代理生成架构，支持区域级动态自纠正，为更可靠、可解释且可控的多模态基础模型应用提供了新路径。

链接: https://arxiv.org/abs/2605.28805
作者: Xinchen Zhang,Bowei Liu,Jiale Liu,Chufan Shi,Yizhen Zhang,Junhong Liu,Youliang Zhang,Zhiheng Li,Yujiu Yang,Ling Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICML 2026. Project: this https URL

点击查看摘要

Abstract:Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.

[NLP-4] Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization

【速读】：该论文试图解决的问题是：如何让大语言模型（LLM）学习并再现标注者特定的标签-解释行为，从而扩展人类标签变异（HLV）的研究维度，从单纯的标签不一致延伸至标注背后的推理逻辑和偏好差异。解决方案的关键在于提出一种名为跨标注者偏好优化（CAPO）的新方法，该方法通过对比目标标注者的回答与其他有效但非目标特定的标注结果，强化模型对标注者个体特异性推理模式的捕捉能力。实验表明，提示（prompting）效果有限且不稳定，监督微调（SFT）能更好建模标注者特异性行为，而CAPO在保持目标特定推理模式的前提下，进一步提升了聚合感知的模仿能力和基于判别器的归属准确性，为基于标注历史而非仅标签的可扩展解释型标注提供了可行路径。

链接: https://arxiv.org/abs/2605.28802
作者: Beiduo Chen,Pingjun Hong,Ziyun Zhang,Benjamin Roth,Anna Korhonen,Barbara Plank
机构: MaiNLP, Center for Information and Language Processing, LMU Munich, Germany; Munich Center for Machine Learning (MCML), Munich, Germany; University of Vienna, Austria; LTL, University of Cambridge, United Kingdom
类目: Computation and Language (cs.CL)
备注: 43 pages, 20 figures

点击查看摘要

Abstract:Free-text explanations extend human label variation (HLV) beyond label disagreement by revealing the reasoning and preferences behind annotators’ decisions. We study whether large language models (LLMs) can learn and reproduce such annotator-specific label-explanation behavior. Using two sentence-pair tasks with four annotators each – natural language inference and paraphrase judgment – we first analyze whether annotators exhibit stable individual patterns. We find that such patterns are weak at the single-annotation level due to strong input-content effects, but become detectable after input-content reduction and annotator-level aggregation. We then compare prompting and supervised fine-tuning (SFT) baselines and propose cross-annotator preference optimization (CAPO), which contrasts a target annotator’s response with other valid but less target-specific annotations for the same input. Experiments show that prompting is limited and unstable, SFT better captures annotator-specific behavior, and CAPO further improves aggregation-aware imitation and judge-based attribution while preserving target-specific reasoning patterns under human validation. Overall, our results show that HLV can be learned as annotator-specific label-explanation behavior, suggesting a path toward scalable explanation-based annotation grounded in annotator histories rather than labels alone.

[NLP-5] Skill-Conditioned Gated Self-Distillation for LLM Reasoning

【速读】：该论文试图解决的问题是：在基于策略的自蒸馏（On-policy Self-Distillation, SD）中，如何在不依赖可信特权信息（Privileged Information, PI）如参考答案或成功推理路径的前提下，依然实现大语言模型（LLM）推理能力的有效提升。现有方法通常假设PI是可靠的，但这一假设在实际场景中可能不成立。论文的关键解决方案在于提出Skill-Conditioned Gated Self-Distillation (SGSD)，其核心创新是将技能驱动的自蒸馏重新建模为教师假设验证（teacher hypothesis validation），而非无条件模仿。SGSD通过检索技能-错误对构建多教师池，并让每个技能条件下的教师对同一提示的学生推理轨迹进行评分； verifier 依据教师立场是否支持成功或抑制失败来提供正向监督信号，而相反立场则被反转。最终，一个鲁棒的门控目标函数会蒸馏有信息量的师生差异，同时抑制不确定或极端信号。实验表明，SGSD在多个数学推理基准上显著优于GRPO，并在较弱PI假设下仍保持与基于答案的OPS D相当的性能表现。

链接: https://arxiv.org/abs/2605.28791
作者: Jiazhen Huang,Xiao Chen,Xiao Luo,Yong Dai,Senkang Hu,Yuzhi Zhao
机构: Tsinghua University (清华大学); Fudan University (复旦大学); City University of Hong Kong (香港城市大学); Huazhong University of Science and Technology (华中科技大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher’s polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at this https URL.

[NLP-6] Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay

【速读】：该论文试图解决的问题是：当前大语言模型（LLM）在处理话语标记（discourse particles，如马来语中的“well”和“kind of”）方面的能力尚不清晰，且现有研究主要集中在高资源语言（如英语），对东南亚语言的关注不足。解决方案的关键在于：(1) 提出一个名为 \textscMalayPrag 的基准测试集，用于系统评估 LLM 在口语化马来语中处理话语标记的能力；(2) 引入五个基于语言学的属性，构建统一框架以解释话语标记的语用功能。实验表明，这些属性显著提升了模型对话语标记与其语用功能之间关联的理解，凸显了为模型提供结构化语用能力支持的重要性。

链接: https://arxiv.org/abs/2605.28782
作者: Mariah Al Giptiah Binte Yusoff,Jakin Tan,Bocheng Chen,Guangliang Liu,Xi Chen
机构: Nanyang Technological University (南洋理工大学); University of Mississippi (密西西比大学); Indiana University (印第安纳大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Discourse particles, such as \textitwell and \textitkind of, are crucial components that enable LLMs to ``speak’’ more like humans. They are used to convey emotions, intentions, and interpersonal meanings. However, existing studies have not yet built a comprehensive understanding of LLMs’ capabilities in handling discourse particles. Moreover, the limited number of studies focuses primarily on high-resource languages such as English, with little attention paid to Southeast Asian languages. In this paper, we (1) propose \textscMalayPrag, a benchmark designed to systematically evaluate and analyze LLMs’ capabilities in handling discourse particles in colloquial Malay; and (2) introduce five attributes that provide a linguistically grounded, unified framework for interpreting the pragmatic functions of discourse particles. Applying these two contributions, we prompt ten off-the-shelf LLMs to perform three prediction tasks. The experimental results reveal substantial challenges for current LLMs in accurately connecting discourse particles with their pragmatic functions in Malay. The provision of the five attributes designed in this study is found to significantly improve these connections, highlighting the need for structured scaffolding for models’ pragmatic competence.

[NLP-7] he Abstraction Gap in Vision-Language Causal Reasoning

【速读】：该论文试图解决的问题是：当前视觉语言模型（VLMs）虽然能生成语法流畅的因果解释，但现有评估方法无法区分这种解释是基于语言上的合理性（linguistic plausibility）还是真实的因果推理能力（faithful causal reasoning）。其解决方案的关键在于提出了一种双探针（dual-probe）评估方法：文本仅探针（Text-Only Probe）用于衡量语言质量，链式文本探针（Chain-Text Probe）则要求模型先生成明确的因果链；进而通过抽象差距（Abstraction Gap, AG）指标量化两者性能差异。实验表明，大多数VLM在CAGE基准（涵盖5,500张图像、49,500个问题，覆盖Pearl因果层次结构）上表现出AG > 0.5，说明它们更依赖语言模式而非真实因果推理；而少数模型通过预训练和架构选择实现了接近零的AG，验证了忠实因果推理能力存在于当前VLM架构中。

链接: https://arxiv.org/abs/2605.28779
作者: Chinh Hoang,Mohammad Rashedul Hasan
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properties. The Text-Only Probe measures linguistic quality. The Chain-Text Probe requires models to first generate explicit causal chains. The Abstraction Gap (AG) metric quantifies the normalized performance difference. Evaluating eight VLMs on CAGE (Causal Abstraction Gap Evaluation), a benchmark of 49,500 questions across 5,500 images spanning Pearl’s causal hierarchy, we find seven models exhibit AG exceeding 0.50 with text scores of 6–8 but chain scores below 2.5. Fine-tuning on 45,000 chain-annotated examples fails to close the gap. However, one model achieves near-zero AG. The capability exists within current VLM architectures and depends on pretraining and architectural choices. CAGE provides a diagnostic tool for assessing faithful causal reasoning in VLMs.

[NLP-8] Can LLM s Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

【速读】：该论文试图解决大语言模型（LLM）在表达置信度时存在的校准问题，即模型使用的语义标记（如“可能……”）是否能稳定、一致地反映其内在不确定性。解决方案的关键在于提出并量化“标记内置置信度”（marker internal confidence, MIC），即模型在特定任务领域中对某一语义标记所关联的内在置信水平的估计，并通过7个指标系统评估MIC在不同分布下的稳定性。研究发现，即使在以模型为中心解释标记含义的情况下，LLMs仍存在显著的校准偏差，难以跨分布区分不同标记的置信水平，尽管其在任务间保持了大致稳定的排序关系，这揭示了当前模型在可信性和可靠性方面仍需改进语义标记的使用一致性与稳定性。

链接: https://arxiv.org/abs/2605.28778
作者: Gabrielle Kaili-May Liu,Arman Cohan
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:LLMs’ linguistically expressed confidence should faithfully reflect their intrinsic uncertainty. While recent work shows LLMs struggle to use epistemic markers (e.g., “it is likely…”) in a human-aligned fashion, it remains unclear whether models can apply their own linguistic confidence framework to associate markers with specific confidence levels in a stable and generalizable way, and how contextual features impact this ability. We conduct the first systematic study of this question, formalizing marker internal confidence (MIC) as the estimated intrinsic confidence a model associates with a specific epistemic marker in a given task domain. We present 7 metrics to evaluate the stability of MICs within and across distributions. Applying our analysis framework to diverse models and tasks, we find that LLMs remain faithfully miscalibrated even under model-centric interpretation of marker meanings, struggling to differentiate markers by internal confidence across distributions despite preserving a somewhat consistent ranking order across tasks. This supplies critical, complementary evidence to existing work toward a holistic understanding of faithful calibration in LLMs, emphasizing the need for more aligned and stable marker use to improve trustworthiness and reliability.

[NLP-9] Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

【速读】：该论文试图解决的问题是：如何高效地将小型通用计算机使用代理（Computer-use Agents, CUAs）专门化到特定软件领域，同时克服现有方法在数据合成和训练策略上的局限性。当前直接通过合成大规模目标领域训练数据的方法效果有限，且小模型在特定任务中表现不一致、性能薄弱。解决方案的关键在于提出一种无标注的专门化框架 LearnWeak，其核心创新包括：（1）利用更强的参考代理自动识别学生代理在目标领域的弱点，并据此生成针对性的任务与监督信号；（2）引入误差感知的专门化目标函数，区分规划错误与执行错误，实现更精准的行为更新。实验表明，LearnWeak 在 OSWorld 上相较 EvoCUA-8B 和 OpenCUA-7B 分别提升 11.6 和 11.1 个百分点，在多个领域均显著优于现有自主轨迹生成和训练基线，强调了“学生意识”在数据合成与训练中的关键作用，为小模型在多样化场景下的高效专业化提供了更系统、高效的路径。

链接: https://arxiv.org/abs/2605.28775
作者: Suji Kim,Kangsan Kim,Sung Ju Hwang
机构: KAIST; Samsung Electronics; DeepAuto.ai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student’s weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.

[NLP-10] Agent Explorative Policy Optimization for Multimodal Agent ic Reasoning

【速读】：该论文试图解决的问题是：当前视觉-语言模型（Vision-Language Models）在复杂任务中依赖内部推理（thinking）虽已取得进展，但许多现实世界问题需要调用外部工具（tool use），而现有强化学习方法（如GRPO）在训练过程中难以有效学习工具使用行为，导致工具调用成功率低且错误率高。解决方案的关键在于提出AXPO（Agent eXplorative Policy Optimization），其核心机制包括：1）针对所有错误的工具调用子组，固定思考前缀并重新采样工具调用及其后续动作；2）引入基于不确定性的前缀选择策略以提升探索效率。该方法显著缓解了“思考-行动差距”（Thinking-Acting Gap），在九个多模态基准测试中均优于SFT+GRPO，并在8B规模下达到32B基础模型的性能，参数量仅为四分之一。

链接: https://arxiv.org/abs/2605.28774
作者: Minki Kang,Shizhe Diao,Ryo Hachiuma,Sung Ju Hwang,Pavlo Molchanov,Yu-Chiang Frank Wang,Byung-Kwan Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.

[NLP-11] Extrapolative Weight Averag ing Reveals Correctness-Efficiency Frontiers in Code RL

【速读】：该论文试图解决的问题是：在强化学习（RL）用于编程竞赛的场景中，如何通过权重平均方法（weight averaging）从已训练的检查点（checkpoints）中扩展出新的、在推理阶段有用的模型配置，而无需额外的强化学习训练。其核心挑战在于验证外推性权重平均（extrapolative weight averaging）是否能够超越训练区间，生成性能更优或互补的模型策略。解决方案的关键在于利用嵌套单元测试覆盖率（nested unit-test coverage）作为奖励机制，从同一初始状态训练出不同覆盖级别的检查点，从而形成一个“正确性-效率前沿”（correctness-efficiency frontier）。研究发现，线性插值可复现该前沿，而外推则能将其延伸至训练范围之外；这种前沿在多种推理模式（纯推理、工具使用、代理编码）和模型规模（7B 和 32B）下均稳定存在，并且外推权重平均生成的模型在集成时显著提升了问题求解率（pass@250 提升 3.3%），表明嵌套测试覆盖诱导的结构化优化空间可通过外推权重平均有效导航与利用。

链接: https://arxiv.org/abs/2605.28751
作者: Kunhao Zheng,Pierre Chambon,Juliette Decugis,Jonas Gehring,Taco Cohen,Benjamin Negrevergne,Gabriel Synnaeve
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 54 pages

点击查看摘要

Abstract:Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extrapolative weight averaging can extend such frontiers to new checkpoints useful at inference time, without additional RL training, remains unclear. We study this question in RL for competitive programming, where hidden unit tests under time and memory limits enforce both functional correctness and computational efficiency. Starting from a shared initialization, we train checkpoints under nested unit-test coverage: low-coverage rewards require passing smaller-input tests, while high-coverage rewards require passing progressively larger tests up to the full suite. This sweep reveals the emergence of a correctness-efficiency frontier: on hard problems, higher-coverage reward reduces optimization failures but increases correctness failures, leaving solve rate nearly unchanged. Interpolation between low- and high-coverage checkpoints recovers this frontier, while extrapolation extends it beyond the trained endpoints. Both the frontier and its extrapolative continuation appear across three inference settings, pure reasoning, tool use, and agentic coding, and across two model scales, 32B and 7B. At the problem level, moving along the frontier changes which problems are solved, making extrapolated checkpoints complementary policies in inference-time scaling. Ensembles with extrapolative weight averaging broaden coverage and improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. These results show that nested unit-test coverage in code RL induces a frontier that extrapolative weight averaging can navigate, extend, and exploit.

[NLP-12] Stance Detection in Prediction Markets: Addressing Imbalanced Trader Commentary via Counterfactual Augmentation and Market Context

【速读】：该论文旨在解决预测市场评论中立场识别（stance detection）的问题，尤其关注如何从极度简短、包含交易者特定俚语且类别严重不平衡（仅8.7%的评论反对市场结果）的数据中准确提取用户对市场结果的立场信号。解决方案的关键在于：首先，引入市场上下文（market context）作为输入特征，显著提升三分类场景下“反向”（Anti）立场的召回率（从0.10提升至0.45）；其次，采用基于大语言模型（LLM）驱动的反事实翻转（counterfactual flips）生成合成少数类样本，实现条件性增强效果——在弱模型配置中提升F1分数（0.10→0.24），但在强配置中反而损害性能（如2类+上下文宏F1从0.68降至0.50）；最后，发现50%的合成数据比例为最优剂量，100%合成数据始终降低性能。注意力机制的可解释性分析进一步验证了上述三个关键发现的内在逻辑。

链接: https://arxiv.org/abs/2605.28745
作者: Thomas Mbrice
机构: Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:Prediction markets such as Polymarket aggregate crowd beliefs into real-time probability estimates, and the comments traders post beneath each market contain rich directional stance signals that prices alone cannot capture. This work introduces the first stance detection study applied to prediction market commentary, a domain characterized by extreme brevity, trader- specific vernacular, and severe class imbalance (only 8.7% of comments oppose the market outcome). RoBERTa-base is fine-tuned across a 4 x 3 ablation: four input configurations (2- class, 3-class x with/without market context) and three augmentation conditions (baseline, 50% synthetic, 100% synthetic). Synthetic minority-class samples are generated via LLM-driven Pro - Anti counterfactual flips using the Anthropic API. Results show that (1) market context is the single most impactful factor, raising 3-class Anti recall from 0.10 to 0.45; (2) counterfactual augmentation is conditionally effective, improving Anti F1 in weak configurations (0.10 - 0.24) while degrading strong ones (2-class-ctx macro F1: 0.68 - 0.50 at full dose); and (3) 50% augmentation is the optimal dose, with 100% consistently hurting performance. Attention-based interpretability analysis provides mechanistic support for all three findings.

[NLP-13] Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text

【速读】：该论文旨在解决大语言模型在临床文本生成任务中缺乏细粒度不确定性量化（Uncertainty Quantification, UQ）的问题，尤其是无法在token或片段级别准确识别模型对临床内容的不确定状态。现有UQ方法主要针对开放域生成任务设计，难以适配长篇临床文本的复杂语境。其解决方案的关键在于提出“逆向探测”（Reverse Probing）框架——该框架不依赖重新采样输出，而是将临床文本视为对模型内部状态的探针，直接从预存在的标注摘要中提取四类内部激活特征来估计token级不确定性。实验表明，该方法在两个专家标注的临床数据集上优于八种适配基线，在所有指标上均取得提升，AUPRC最高提升4倍，同时显著降低推理时间和计算成本；特征分析进一步揭示“能量差值”（delta energy）和“邻域上下文”（neighborhood context）是跨模型最稳定的预测因子，为理解模型如何响应无支持的临床内容提供了可解释性洞见。

链接: https://arxiv.org/abs/2605.28740
作者: Bushi Xiao,Sarvesh Soni,Daisy Zhe Wang
机构: University of Florida (佛罗里达大学); U.S. National Library of Medicine (美国国家医学图书馆)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes critical. Most existing uncertainty quantification (UQ) methods are designed for open-domain generation and cannot localize uncertainty at the token or span level in long clinical text. We propose Reverse Probing, the first UQ framework specialized for clinical summarization, which estimates token-level uncertainty directly from pre-existing labeled summaries. Rather than sampling new outputs, Reverse Probing treats the text as a probe into the model’s internal state, extracting uncertainty signals from four categories of internal activations. We evaluate on two expert-annotated clinical datasets and outperform eight adapted baselines on all metrics, achieving up to 4 times higher AUPRC while reducing inference time and computational costs. Feature analysis reveals that delta energy and neighborhood context are the most consistent predictors across all models. This study offers interpretable insights into how models internally respond to unsupported clinical content.

[NLP-14] Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

【速读】：该论文试图解决的问题是：当前编码专用模型（coding models）在面对恶意请求时的拒绝行为缺乏统一、可靠的评估标准，导致无法判断其是否达到了比通用语言模型更高的拒绝阈值——这是由其潜在危害性（如生成可直接运行的恶意代码）所决定的。现有基准测试存在碎片化问题，混淆了“可执行恶意代码请求”（如木马、勒索软件）与“有害安全知识请求”（如攻击技术说明），且不同数据集不可比，难以衡量真正关键的合规性指标。解决方案的关键在于构建一个经过五位专家共识标注的标准化提示库（prompt bank），通过严格分类区分两类请求，并提供具有高信度（Fleiss’ kappa = 0.767）和效度的测量基础，从而实现跨数据集的编码模型合规性比较。该工具释放了4,748个共识级“CODE”请求和1,923个“KNOWLEDGE”请求，为验证编码模型是否满足其高风险输出所需的更严格拒绝标准提供了可靠依据。

链接: https://arxiv.org/abs/2605.28734
作者: Richard J. Young,Gregory D. Moody
机构: University of Nevada Las Vegas (内华达大学拉斯维加斯分校)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 9 figures, 5 tables. Consensus-labeled prompt bank consolidating eight malicious-code corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) under a five-judge panel; 6,675 prompts, 33,375 classification calls, Fleiss’ kappa = 0.767

点击查看摘要

Abstract:A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon – a keylogger, a ransomware stub, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot presently tell whether they do. Refusal benchmarks for malicious code are fragmented: they mix requests for executable software (ready-to-run weapons) with requests for harmful security knowledge (information a human must still operationalise) and report refusal rates over non-comparable corpora, so no single statistic measures the property that actually matters. This paper introduces an expanded consensus-labeled prompt bank that distinguishes between these two request types and provides a construct-stable substrate for cross-corpus coding-model compliance measurement. Eight corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) are consolidated and classified under a five-judge consensus protocol (6,675 prompts x 5 judges = 33,375 calls). The panel reaches Fleiss’ kappa = 0.767 [95% CI 0.755, 0.777] (“substantial”); 95.0% of prompts draw at least four agreeing judges, 76.9% are unanimous, and the panel reproduces the earlier four-corpus release at Cohen’s kappa = 0.952 on the 3,133 shared prompts. The released bank comprises 4,748 consensus-CODE prompts (executable malicious code requests) and 1,923 consensus-KNOWLEDGE prompts (harmful security knowledge requests). The bank is the validated instrument the field has lacked: a reliability-quantified basis for testing whether coding models meet the stricter refusal standard their executable output demands.

[NLP-15] MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）中记忆系统不可靠、难以调试的问题，尤其是针对记忆在动态演化过程中出现的信息丢失、检索错位等系统性错误的定位与修复难题。其解决方案的关键在于提出了一种新颖的框架，将记忆处理流程转化为可执行的记忆演化图（memory evolution graphs），从而实现对操作级信息流的细粒度追踪；同时构建了MemTraceBench基准测试集，涵盖Long-Context、RAG、Mem0和EverMemOS等代表性记忆系统，系统化分析记忆失效模式，并引入一种自动归因方法，通过迭代追踪操作子图精准定位故障根源；最终利用这些细粒度归因信号驱动下游提示优化，形成闭环纠错机制，显著提升任务性能（最高达7.62%）。

链接: https://arxiv.org/abs/2605.28732
作者: Xinle Deng,Ruobin Zhong,Hujin Peng,Xiaoben Lu,Yanzhe Wu,Guang Li,Buqiang Xu,Yunzhi Yao,Jizhan Fang,Haoliang Cao,Junjie Guo,Yuan Yuan,Ziqing Ma,Yuanqiang Yu,Rui Hu,Baohua Dong,Hangcheng Zhu,Ningyu Zhang
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Ongoing work

点击查看摘要

Abstract:Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory’s dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long-Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment. Crucially, we leverage these fine-grained attribution signals to guide downstream prompt optimization, establishing a closed-loop system that automatically corrects faults and boosts end-task performance by up to 7.62%. Code will be released at this https URL.

[NLP-16] IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long Multimodal IPO Documents

【速读】：该论文旨在解决金融领域中对初始公开募股（Initial Public Offering, IPO）文件进行大规模、标准化分析的难题。IPO文件是公司上市时提交的长篇多模态文档，包含大量文本与图像信息，但其长度常超过50万token且结构不统一，导致现有自然语言处理和多模态模型难以有效应用。论文的关键解决方案在于提出IPO-Toolkit——一个开源框架，可将IPO文件解析为结构化段落文本并提取嵌入图像，从而支持可复现的大规模分析流程；基于此工具构建了IPO-Dataset，涵盖1994至2026年间超10.9万份IPO文件及7.6万余张图表，首次实现了对财务图表质量与误导性等关键属性的结构化评估任务，揭示了当前先进多模态模型在真实监管文档上的推理一致性不足问题，推动了金融文本与视觉信息融合分析的研究发展。

链接: https://arxiv.org/abs/2605.28714
作者: Michael Galarnyk,Siddharth Lohani,Vidhyakshaya Kannan,Sagnik Nandi,Aman Patel,Liqin Ye,Arnav Hiray,Rutwik Routu,Prasun Banerjee,Siddhartha Somani,Sudheer Chava
机构: Georgia Institute of Technology(佐治亚理工学院); Sai University(赛大学); Duke University(杜克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:An Initial Public Offering (IPO) filing is a document released when a private firm goes public, allowing individual (retail) investors to purchase its shares. These filings describe a firm’s business, financials, and risks and are long, multimodal documents with narrative text and images. Despite their importance to financial markets, there is no large-scale, standardized dataset or benchmark for studying IPO filings with modern language and multimodal models. These documents pose significant challenges: filings frequently exceed 500,000 tokens and lack consistent structural organization. We introduce the IPO-Toolkit, an open-source framework for downloading and parsing IPO filings into standardized section-structured text and extracted images. The toolkit segments filings, extracts embedded images, and produces structured outputs that enable large-scale, reproducible analysis workflows over long, multimodal documents. Using this infrastructure, we construct the IPO-Dataset, a large, section-structured, multimodal dataset covering more than 109,000 IPO filings and amendments from 1994 to 2026 and containing over 76,000 images. We establish structured evaluation tasks over extracted financial charts, including chart quality and misleadingness assessment. Our experiments show that state-of-the-art multimodal models often diverge from expert human judgments on these tasks, exposing alignment challenges in multimodal reasoning over long, real-world regulatory documents. Beyond benchmarking, the IPO-Dataset enables large-scale analysis of section-level textual variation and cross-industry differences in visual and textual disclosure practices. Our code, dataset, and website are publicly available under CC-BY-4.0.

[NLP-17] owards Reliable Multilingual LLM s-as-a-Judge: An Empirical Study

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在多语言文本自动评估中的应用难题，尤其是在低资源语言和缺乏领域内数据场景下的性能瓶颈。其核心解决方案在于系统性地比较不同策略：包括指令翻译、单语与多语监督训练、以及模型规模的影响，并在英语、西班牙语和巴斯克语（高、中、低资源语言）上进行实证分析。关键发现是：当有领域内数据时，微调较小的模型即可达到与商用模型相当的性能；而在无领域数据的情况下，使用更大的模型进行零样本（zero-shot）评估更有效；此外，使用域外数据微调反而会损害模型表现。这些结果为构建高效、可靠的多语言评估流水线提供了可操作的指导原则。

链接: https://arxiv.org/abs/2605.28710
作者: Irune Zubiaga,Aitor Soroa,Rodrigo Agerri
机构: HiTZ Center - Ixa, University of the Basque Country EHU (巴斯克大学艾哈中心-伊克斯研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particularly for low-resource languages and scenarios where in-domain data is scarce. This work explores several strategies for developing multilingual LLMs-as-a-judge, considering whether in-domain data is available for fine-tuning or not. We systematically analyze English, Spanish, and Basque, representing high-, mid-, and low-resource languages, considering instruction translation, monolingual versus multilingual supervision, and model size. For evaluation, we extend two existing meta-evaluation datasets to Basque and Spanish. Our results reveal key trade-offs: When in-domain data is available, fine-tuned smaller models can achieve performance comparable to proprietary models, whereas zero-shot evaluation with larger models proves more effective in out-of-domain settings. We also observe that fine-tuning on out-of-domain data can adversely affect model performance. These findings provide practical guidance for building efficient, reliable multilingual evaluation pipelines. The data and code are publicly available at hitz-zentroa/mJudge.

[NLP-18] he Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic ACL EMNLP2026

【速读】：该论文试图解决的问题是：GSM-Symbolic基准测试中报告的大型语言模型（LLMs）在处理模板生成的GSM8K问题变体时性能下降的现象，是否真正反映了模型缺乏真实推理能力。其解决方案的关键在于：首先，采用广义线性混合模型（Generalised Linear Mixed Models）并引入按题目的随机效应重新评估20个开源模型的表现，发现仅一半模型在原始提示格式下表现出统计显著的性能变化；其次，识别出GSM-Symbolic数据集中存在未被注意的大整数分布偏移（与GSM-Base相比，K-S统计量=0.12，p < 0.001），控制这一“大数效应”后可解释约一半剩余显著差异；最后，进一步分析表明不同模型存在特定的失败模式（如变量绑定脆弱性、算术限制和双任务干扰），从而揭示了对LLM推理能力的泛化结论在统计和机制层面均不严谨。

链接: https://arxiv.org/abs/2605.28700
作者: Dominika Agnieszka Długosz,Arlindo Oliveira,Natalia Díaz Rodríguez
机构: Instituto Superior Técnico INESC-ID, Universidade de Lisboa, Portugal; Dept. of Computer Science and AI DaSCI Institute, Universidad de Granada, Spain
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 38 pages, 11 figures. Submitted to ACL ARR / EMNLP 2026

点击查看摘要

Abstract:The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this conclusion rests on shaky statistical ground. Re-evaluating 20 open-weight models using Generalised Linear Mixed Models with per-question random effects, we find that only half exhibit statistically significant performance changes under the original prompt format. Moreover, we identify a previously unacknowledged factor: the main GSM-Symbolic dataset contains a systematically shifted distribution of larger integers in problem texts relative to GSM-Base (K-S statistic = 0.12, p 0.001), contradicting the original authors’ claims. Controlling for this large number effect accounts for significance in roughly half the remaining cases. Among models with statistically significant performance deltas, we identify distinct, model-specific failure profiles - including fragility of variable binding, arithmetic limitations, and dual-task interference - underscoring that blanket claims about LLM reasoning are both statistically premature and mechanistically misleading.

[NLP-19] Sense Representations Are Inducible Interfaces

【速读】：该论文试图解决的问题是：如何在不重新训练预训练语言模型（LM）的情况下，为模型引入显式的词义表示（sense representations），以支持词义消歧、词汇引导（lexical steering）和跨语言对齐等任务。现有方法通常要求模型在预训练阶段就嵌入词义结构，限制了其通用性和灵活性。解决方案的关键在于提出ACROS（Adaptive Cross-lingual Representation via Orthogonal Sensing），它通过一个门控残差添加机制（gated residual addition）将显式词义路径注入到冻结的预训练解码器LM中，从而在保持原始模型性能的同时，实现三种功能：零样本词义消歧（在Raganato ALL数据集上F1达64.95，与WordNet首义启发式方法相当）、低KL散度的词汇引导（在5,161个CoInCo案例中，简单非oracle代理恢复约90%的正向偏移），以及基于SENSIA的跨语言适应（在四种语言上平均R@1达0.988，目标FLORES PPL为7.94）。ACROS使得词义表示成为普通预训练LM可诱导的接口，显著提升了模型的可控性和多任务适配能力。

链接: https://arxiv.org/abs/2605.28669
作者: Jan Christian Blaise Cruz,Alham Fikri Aji
机构: MBZUAI; SEACrowd
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Sense representations (explicit, per-token meaning decompositions) are useful for disambiguation, steering, and cross-lingual alignment, but existing approaches require models to be pretrained with sense structure baked in. We introduce ACROS, which induces an explicit sense pathway into a frozen pretrained decoder LM through a gated residual addition. On SmolLM2-360M, ACROS preserves base LM quality while supporting three uses of the same induced variables: zero-shot word-sense disambiguation (64.95 F1 on Raganato ALL, competitive with the WordNet first-sense heuristic), low-KL lexical steering across 5,161 CoInCo cases where a simple non-oracle proxy recovers about 90% of positive shifts, and SENSIA cross-lingual adaptation to four languages (mean R@1 0.988, target FLORES PPL 7.94). ACROS makes sense representations an inducible interface for ordinary pretrained LMs.

[NLP-20] Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection

【速读】：该论文试图解决安全检测模型在训练时缺乏高质量的HHH（Helpful, Harmless, Honest）违规输出样本的问题，而现有方法难以高效生成此类数据。其解决方案的关键在于探索激活操控（Activation Steering, AS）是否能生成适用于下游分类器训练的高质量合成数据。研究通过内在和外在评估验证了AS的有效性：内在评估引入了样本级和集合级多样性作为新的质量维度，发现增强操控强度会降低响应多样性；外在评估表明，在4个概念中，3个概念的AS生成数据优于提示生成数据，但仅有41/136种AS配置表现更优，说明下游性能依赖于成功度、连贯性和多样性三者的协同平衡。研究进一步发现，这三个指标的调和平均值比单独使用成功度和连贯性更能一致地预测下游AUROC，为实践者提供了可操作的超参数调优目标。整体而言，该工作揭示了AS在安全检测数据合成中的潜力，并首次强调多样性是优化AS的关键因素。

链接: https://arxiv.org/abs/2605.28664
作者: Vijeta Deshpande,Tootiya Giyahchi,Veena Padmanabhan,Leman Akoglu,Anna Rumshisky
机构: UMass Lowell; Amazon; CMU; Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Safety detection models require examples of HHH (Helpful, Harmless, Honest)-violating outputs for robust generalization, however such examples are scarce. Activation Steering (AS) has emerged as a data-efficient method for generating target-concept-aligned responses. We investigate whether AS can generate high-quality training datasets for downstream classifiers, a question that remains untested. We present a two-fold study with intrinsic and extrinsic evaluation across 4 concepts \times,2 models \times,4 steering methods. Intrinsically, beyond the field-standard rubric of steering success (concept alignment) and coherence, we introduce sample- and set-level diversity as a quality axis previously absent from the literature, and find that increasing steering strength reduces response diversity. Extrinsically, we replace HHH-violating examples in the available training data with steered generations and fine-tune detection classifiers. AS-generated data results in a better classifier than the prompting-generated data on 3 of 4 concepts. However, only 41 of 136 AS configurations outperform prompting, indicating that downstream utility lies in a narrow regime that jointly satisfies success, coherence, and diversity. The harmonic mean of these three axes correlates with downstream AUROC more consistently across concepts than success and coherence alone, providing a practical heuristic target for practitioners tuning AS hyperparameters. Together, our results highlight the potential of AS in synthetic data generation for improving safety detection and identify diversity as a critical, previously overlooked axis for tuning AS.

[NLP-21] Interpretability-Guided Layer Selection over Subspace Projection: SAEs as Stethoscopes Not Scalpels for Raw Task Vector Model Editing

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在特定领域（如数学推理）中提升性能时，如何避免全量微调带来的计算成本和灾难性遗忘问题。其核心挑战在于如何实现精准、高效的模型编辑。解决方案的关键创新在于重新定位稀疏自动编码器（Sparse Autoencoders, SAEs）的作用：传统方法将SAE视为“手术刀”，试图通过投影任务向量至SAE特征子空间来实施干预，但研究发现这一策略因激活空间方向与权重空间任务向量存在几何错位，导致约97%的修改能量被丢弃，无法带来统计显著的改进。作者提出“SAE作为听诊器而非手术刀”的新范式——利用SAE进行层级诊断（基于特异性得分识别关键层），再直接注入未过滤的原始任务向量，从而实现更有效的编辑。实验表明，该方法在Minerva数学基准上使数论准确率从29.6%提升至39.4%，5个数学子领域显著改善且无任何退化，同时保持确定性和零额外推理开销，为可解释性引导的模型编辑提供了原则性框架。

链接: https://arxiv.org/abs/2605.28649
作者: Li Lei,Madalina Ciobanu,Qingqing Mao,Ritankar Das
机构: Incept Labs; Titan Holdings
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs increasingly require surgical model editing to enhance domain-specific capabilities without incurring the computational cost or catastrophic forgetting associated with full fine-tuning. Sparse Autoencoders (SAEs) have emerged as a promising tool in this setting, in principle allowing for feature-level identification of where to intervene. In this work, we rigorously evaluate an SAE-guided editing pipeline for mathematical reasoning on Gemma-3-4B-IT and uncover a fundamental failure mode: the intuitively appealing approach of projecting task vectors onto SAE feature subspaces acts as an information bottleneck that discards approximately 97% of the modification energy, yielding no statistically significant improvements across seven math subjects. We show that this failure stems from a geometric misalignment between activation-space SAE directions and weight-space task vectors. We then propose a shift in perspective: SAE as a Stethoscope, Not a Scalpel, where SAEs are used for layer-level diagnosis rather than intervention-level filtering. By injecting unfiltered raw task vectors only into layers identified by an SAE-derived specificity score, we improve Number Theory accuracy from 29.6% to 39.4% (z=+3.41, p=0.0007) on the Minerva Math benchmark; 5 of 7 math subjects significantly improved and none significantly degraded. Our method is fully deterministic, requires no additional inference cost, and provides a principled framework for interpretability-guided model editing.

[NLP-22] MaskClaw: Edge-Side Personalized Privacy Arbitration for GUI Agents with Behavior-Driven Skill Evolution EMNLP2026

【速读】：该论文旨在解决GUI代理（GUI agent）在跨应用操作中因依赖截图而引发的隐私泄露问题，特别是当截图包含个人身份信息（PII）、医疗记录、支付凭证及工作场所特定流程时。现有静态PII检测工具无法根据任务、接收方、应用状态和用户角色动态调整隐私策略，而云端视觉语言模型（VLM）推理则可能在决策前上传原始屏幕内容，带来风险。解决方案的关键在于提出MaskClaw——一个部署在边缘侧的隐私仲裁器，它通过本地提取视觉证据、检索用户与任务相关的策略记忆，并在截图离开受信任环境前做出“允许（Allow）”、“遮蔽（Mask）”或“询问（Ask）”的决策。此外，作者构建了P-GUI-Evo基准，用于评估不同策略在真实UI模式下的隐私保护效果，实验表明仅依赖模式匹配、云端推理或路由机制易导致过度确认、过度遮蔽或原始数据暴露，而MaskClaw能有效规避这些问题并支持隐私技能的迭代演化。

链接: https://arxiv.org/abs/2605.28646
作者: Yanqiu Zhao,Dongying Zheng,Kaibo Huang,Yukun Wei,Zhongliang Yang,Linna Zhou
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Preprint. Submitted to EMNLP 2026. 21 pages, including appendices; 5 figures

点击查看摘要

Abstract:GUI agents rely on screenshots to infer intent and operate across applications, but these screenshots often contain private messages, medical records, payment credentials, and workplace-specific workflows. Privacy decisions in this setting depend on task, recipient, application state, and user role, yet static PII detectors miss these boundaries and cloud-side VLM reasoning can upload the raw screen before deciding what should be protected. We present MaskClaw, an edge-side privacy arbitrator for GUI agents. MaskClaw extracts local visual evidence, retrieves user- and task-specific policy memory, and decides Allow, Mask, or Ask before raw screenshots leave a trusted user- or organization-controlled environment. In five designed skill-evolution scenarios, it turns corrections, cancellations, and edits into reusable privacy skills checked by a sandbox gate. We introduce P-GUI-Evo, a benchmark built from real UI patterns, reconstructed HTML screens, and sanitized labels. Experiments show that pattern matching, cloud reasoning, and routing alone tend to over-confirm, over-mask, or expose raw screenshots under the same protocol. The artifact is available at this https URL.

[NLP-23] GraphSteal: Structural Knowledge Stealing from Graph RAG via Traversal Reconstruction

【速读】：该论文试图解决的问题是：将知识图谱（Knowledge Graph）集成到检索增强生成（Retrieval-Augmented Generation, RAG）系统中虽提升了大语言模型（LLM）的推理能力，但同时也引入了结构隐私泄露的新风险——攻击者可通过黑盒交互方式重构隐藏的知识图谱结构，从而获取敏感实体、关系及多跳依赖。解决方案的关键在于提出一种面向结构的重建框架（structure-oriented reconstruction framework），其核心包含两个互补策略：一是基于深度的启发式搜索（Depth-Wise Heuristic Search），通过递归扩展以实体为中心的证据来提取细粒度节点属性；二是基于广度的扩散搜索（Breadth-Wise Diffusion Search），利用关系诱导的邻域传播推断图拓扑结构。实验表明，该方法在通用和医疗场景下可从代表性Graph RAG系统中恢复超过90%的原始知识图谱，且现有防护机制对此类攻击防御效果有限，凸显了在Graph RAG流水线中保护结构隐私的固有挑战。

链接: https://arxiv.org/abs/2605.28645
作者: Jinze Gu,Qinghua Mao,Xi Lin,Jun Wu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances LLMs by grounding generation in query-relevant external evidence. Beyond unstructured text corpora, Graph RAG integrates knowledge graphs into the retrieval pipeline, enabling LLMs to access entities, relations, and multi-hop dependencies encoded in structured knowledge. However, the same structured knowledge that empowers Graph RAG also creates a new privacy attack surface. We demonstrate that Graph RAG systems can be turned into structural oracles: through adaptive black-box interactions, an adversary can elicit sufficient relational evidence to reconstruct substantial portions of the hidden knowledge graph. We propose a structure-oriented reconstruction framework that recovers targeted graphs from both local and global perspectives. Specifically, Depth-Wise Heuristic Search extracts fine-grained node attributes by recursively expanding entity-centered evidence, while Breadth-Wise Diffusion Search infers graph topology by propagating across relation-induced neighborhoods. Experiments on generic and healthcare scenarios demonstrate that our method can recover over 90% of the original knowledge graph from representative Graph RAG systems, revealing sensitive entities, relations, and structural dependencies with high fidelity. Existing guradrails provide limited defense against our attack, highlighting the inherent difficulty of safeguarding structural privacy in Graph RAG pipelines.

[NLP-24] GraphLit: Learning Text-Enriched Dynamic Character Network Representations for Literary Study

【速读】：该论文试图解决现有文学文本图表示方法中忽视角色互动文本语境的问题，即当前方法主要关注角色间交互关系，而忽略了角色在具体情节中的上下文信息。其解决方案的关键在于提出动态异质角色网络（Dynamic Heterogeneous Character Networks, DHCNs），将长篇小说组织为时间局部化的异质图结构，使角色与其所处的文本语境对齐；同时设计了GraphLit自监督学习框架，通过掩码图自编码器目标学习丰富的文学表征。实验表明，GraphLit在12项角色相关任务上优于仅使用文本或仅使用图的基线模型，尤其在需要语境理解的任务中表现突出，并进一步展示了DHCNs与GraphLit在文学分析中的应用潜力，例如探究叙事非线性与动态社会特征之间的关联。

链接: https://arxiv.org/abs/2605.28643
作者: Gaspard Michel,Elena V. Epure,Romain Hennequin,Christophe Cerisara,Mirella Lapata
机构: Deezer Research (Deezer 研究院); Loria (洛里亚实验室); IDIAP (IDIAP 研究所); School of Informatics, University of Edinburgh (爱丁堡大学信息学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Methods to represent literary texts as graphs or sequences of graphs mainly focus on representing character interactions, and often overlook another crucial aspect: the textual context in which characters interact. We introduce Dynamic Heterogeneous Character Networks (DHCNs), which organize long novels into temporally localized heterogeneous graphs that align characters with their textual contexts. We extract around 20,000 DHCNs from Project Gutenberg, and propose GraphLit, a self-supervised learning framework that learns rich literary representations through a masked graph autoencoder objective. Across a wide-range of 12 character-related tasks, GraphLit improves over text-only and graph-only baselines, particularly on tasks requiring contextual understanding. Finally, we demonstrate the applicability of DHCNs and GraphLit for literary analysis by studying the link between narrative non-linearity and dynamic social features.

[NLP-25] he Attentional White Bear Effect in Transformer Language Models EMNLP2026

【速读】：该论文试图解决的问题是：指令抑制（instruction-based suppression）是否真正消除了语言模型对禁止内容的内部表征，还是仅仅抑制了其表达。解决方案的关键在于通过表征探测（representational probing）、注意力分析和行为语义泄露实验，在多个Transformer模型中系统性地检验被抑制概念在隐藏表征中的可恢复性、对注意力路由的影响以及对下游生成结果的潜在塑造作用。研究发现，即使在成功避免词汇层面泄露的情况下，禁止概念仍能从隐藏表征中高精度恢复、持续影响注意力分配并显著改变生成内容，揭示了行为层面与表征层面对齐之间的根本性差距。

链接: https://arxiv.org/abs/2605.28639
作者: Rebecca Ramnauth,Brian Scassellati
机构: Yale University (耶鲁大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Currently under review at EMNLP 2026

点击查看摘要

Abstract:Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models. We find that prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families. Our results expose a fundamental gap between behavioral and representational alignment.

[NLP-26] Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM -based Mobile-Using Agents

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）驱动的移动代理在执行用户指令时存在的两个关键问题：一是过度执行（over-execution），即代理在无法完成任务时仍强行执行；二是过度求助（over-soliciting），即交互式代理过于依赖人类干预。解决方案的关键在于提出一个通用的置信度融合框架（confidence integration framework），通过两个阶段实现基于置信度的主动且鲁棒的交互：第一阶段为交互能力赋能（interaction capability empowerment），代理通过监督微调学习输出动作和置信度分数；第二阶段为置信度偏差校正（confidence bias correction），代理结合语义相似性检索与直接偏好优化，提升置信度估计的准确性。实验表明，所提出的 Mobile-Aptus 在四个主流移动代理基准（OS-Kairos、AITZ、Meta-GUI 和 AndroidControl）上均达到最优性能，离线测试平均任务成功率提升超过 17%，真实动态环境中任务成功率提升 26%，且每条指令仅需 0.64 次干预步骤。

链接: https://arxiv.org/abs/2605.28629
作者: Zheng Wu,Pengzhou Cheng,Zongru Wu,Yuan Guo,Tianjie Ju,Aston Zhang,Gongshen Liu,Zhuosheng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Meta (Meta)
类目: Computation and Language (cs.CL)
备注: Accepted by TASLP

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have shown exceptional potential in enabling mobile-using agents to autonomously execute human instructions. However, fully automated agents often try to execute tasks even when they are unable to resolve them, leading to the problem of over-execution. Previous studies solve it by training a interactive mobile-using agents to let agents request human interaction when agents can not complete user instructions. However, we find that these interactive agents tend to exhibit over-soliciting behavior, relying excessively on human intervention. To mitigate both over-execution and over-soliciting, we propose a universal confidence integration framework that enables confidence-driven proactive and robust interaction in MLLM-based mobile-using agents. The framework consists of two stages: interaction capability empowerment and confidence bias correction. In the interaction capability empowerment stage, agents learn through supervised fine-tuning to output both actions and confidence scores. In the confidence bias correction stage, agents learn to output more accurate confidence scores by combining semantic similarity retrieval with direct preference optimization. Experimental results show Mobile-Aptus achieves state-of-the-art performance on the four popular mobile-using agent benchmarks: OS-Kairos, AITZ, Meta-GUI, and AndroidControl. Mobile-Aptus consistently outperforms all baselines in offline benchmarks, with an average improvement over 17% in task success rate. In real-world dynamic experiments, Mobile-Aptus surpasses the baseline by 26% in task success rate with only 0.64 intervention steps per instruction. The codes are available at this https URL.

[NLP-27] Measuring Form and Function in Language Models ACL

【速读】：该论文试图解决的问题是：如何量化评估语言模型在儿童语言习得过程中的表现，特别是针对英语中限定词（determiner）的句法结构和语用功能这两个早期且准确习得的语言特征。解决方案的关键在于提出了一种名为“上下文替代选择”（Contextual Alternative Choice, CAC）的新提示方法，该方法能够对语言模型的句法和语用知识进行针对性测试，并实现语言模型与儿童以及独立建立的统计基准之间的直接比较。研究发现，当前在相当数据量下训练的模型尚无法同时达到人类儿童在形式和功能两个维度上的基准表现，但一些超大规模模型已接近这一水平，从而为评估语言模型的认知状态提供了新的技术路径。

链接: https://arxiv.org/abs/2605.28616
作者: Héctor Javier Vázquez Martínez,Charles Yang
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review at ACL Rolling Review May 2026 cycle

点击查看摘要

Abstract:We introduce quantitative metrics for child language acquisition to evaluate language models. Our focus is on the formal syntactic and functional discourse properties of determiners in English, which young children acquire early and accurately. We propose Contextual Alternative Choice (CAC), a new prompting method which provides targeted tests for both syntactic and discourse knowledge of language. The method enables direct comparison of language models against children, and more importantly, against statistical benchmarks independently established in empirical research. No current model trained on a comparable amount of data simultaneously meet both formal and functional benchmarks like human children, but some very large models do. We present our results as methodological and technical contributions, with specific emphasis on cognitive status of language models.

[NLP-28] Adaptive Multimodal Agents -Based Framework for Automatic Workflow Execution

【速读】：该论文试图解决当前自主代理在复杂工作流中执行任务时面临的两大核心问题：一是从结构化元数据解析到通用环境感知的过渡困难；二是现有方法将任务序列视为离散、线性的片段，导致无法捕捉任务之间的潜在转换拓扑结构，从而限制了其在新场景或非平稳环境中的适应能力。解决方案的关键在于提出一种新颖的多模态多代理框架，采用两阶段流水线：第一阶段为离线发现阶段，通过自适应构建来自碎片化执行日志的拓扑知识库；第二阶段在推理过程中利用对固定预构建图的自适应检索增强生成（Adaptive Retrieval-Augmented Generation, RAG）机制，并结合闭环协同验证协议实现动态自我修正与导航。该基于图的方法显著提升了任务分解能力和自适应导航性能，在真实场景中验证了其在训练数据有限情况下仍能保持高可靠性和语义感知能力。

链接: https://arxiv.org/abs/2605.28607
作者: Susanna Cifani,Mario Luca Bernardi,Marta Cimitile
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. Accepted for publication at the 2026 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS 2026)

点击查看摘要

Abstract:Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with the transition from structured metadata parsing to general environmental perception. While the integration of MLLMs has enabled agents to interact directly with GUIs, existing approaches typically treat task sequences as discrete, linear episodes. This fragmentation prevents agents from capturing the underlying transition topology, limiting their effectiveness in novel or non-stationary scenarios. To address this, we propose a novel multimodal multi-agent framework that achieves automatic workflow execution through a distinct two-phase pipeline. First, during an offline discovery phase, the architecture adaptively constructs a topological knowledge base from fragmented execution logs. During inference, agents leverage Adaptive Retrieval-Augmented Generation (RAG) over this fixed, pre-established graph, coupled with a closed-loop collaborative verification protocol to dynamically self-correct and navigate. This graph-based approach facilitates superior task decomposition and adaptive navigation performance. We validate our framework in a real-world context, demonstrating its ability to maintain high reliability and semantic awareness even with limited training data.

[NLP-29] Satisfiability Solving with LLM s: A Matched-Pair Evaluation of Reasoning Capability

【速读】：该论文试图解决的问题是：当前大型语言模型（LLMs）在布尔可满足性（SAT）相关任务中的推理能力尚不明确，且传统评估指标（如准确率、精确率、召回率和F1分数）可能误导对模型真实推理能力的判断。解决方案的关键在于提出一种基于“配对公式协议”的新评估方法，引入准确区分率（Accurate Differentiation Rate, ADR），要求模型同时正确识别一对最小差异的可满足与不可满足实例，从而更可靠地衡量模型的逻辑推理能力；此外，通过将CNF格式的SAT问题转化为顶点覆盖（Vertex Cover）和离散三维装箱（discrete 3D packing）等不同表示形式进行跨表示一致性测试，发现多数模型在不同表示下决策一致率超过80%，表明其决策规则具有表示不变性。这一方法有效区分了具备真正推理能力的模型与仅依赖启发式策略的模型，为评估LLMs在逻辑推理任务上的表现提供了更稳健、更具代表性的基准。

链接: https://arxiv.org/abs/2605.28602
作者: Leizhen Zhang,Shuhan Chen,Sheng Chen
机构: University of Louisiana at Lafayette (路易斯安那大学拉斐特分校); East China Normal University (华东师范大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: Accepted at the ACM International Conference on the Foundations of Software Engineering (FSE 2026)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions, Vertex Cover and discrete 3D packing, to probe representation-invariant reasoning. We first evaluate models using conventional metrics, including accuracy, precision, recall, and F1, as well as the SAT phase-transition setting. We find that these metrics can be misleading: many models obtain high scores by over-predicting satisfiable formulas, fail to reproduce the classical easy-hard-easy signature around the 3-SAT threshold, and degrade sharply as the number of variables grows. To address this problem, we introduce a paired-formula protocol based on minimally different satisfiable and unsatisfiable instances, together with Accurate Differentiation Rate (ADR), which requires both members of each pair to be classified correctly. ADR separates reasoning-oriented models from heuristic ones and correlates with witness validity. Beyond CNF, we test cross-representation consistency by converting CNF to Vertex Cover and 3-SAT to discrete 3D packing. Model decisions on CNF and on the corresponding graph or packing instances agree for most models on more than 80 percent of instances, suggesting stable decision rules across representations. Overall, our results show that SAT is a conservative probe for LLM reasoning, and that paired evaluation with ADR provides a more faithful and representation-robust assessment than conventional metrics. Comments: Accepted at the ACM International Conference on the Foundations of Software Engineering (FSE 2026) Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO) Cite as: arXiv:2605.28602 [cs.AI] (or arXiv:2605.28602v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.28602 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-30] Evaluating the Realism of LLM -powered Social Agents : A Case Study of Reactions to Spanish Online News

【速读】：该论文试图解决的问题是：当前基于大语言模型（LLM）的社会代理在模拟在线社交行为时，其真实感难以验证，尤其是针对短时、反应式的对话场景（如新闻评论），现有研究多依赖通用基准评估，缺乏对具体互动模式的细致分析。解决方案的关键在于构建一个可控的实验框架，使用西班牙在线新闻数据集（Hatemedia）配对5,631条新闻与58,555条真实用户评论，并用五种主流LLM在统一设置下生成对应合成评论，从仇恨言论（hate speech）、情感倾向（sentiment）和语义一致性（semantic alignment）三个维度系统比较真实与合成评论的分布差异。结果表明，未经微调的LLM无法有效模拟真实受众反应，存在仇恨言论低估、情感偏差及分布距离远等问题；微调虽提升部分指标，但效果不均衡，其中Qwen3表现最均衡，Mistral7B则在情感和语义上更贴近人类，但过度放大仇恨言论比例，说明生成内容的“合理性”并不等同于“分布真实性”。

链接: https://arxiv.org/abs/2605.28598
作者: Alejandro Buitrago López,Alberto Ortega Pastor,Javier Pastor-Galindo,José A. Ruipérez-Valiente
机构: University of Murcia (穆尔西亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-powered social agents are increasingly used to simulate online social behavior, yet their realism remains difficult to validate. Existing work has largely relied on general-purpose benchmarks, while less attention has been paid to short, reactive discourse such as audience replies to online news. In this paper, we evaluate whether LLM-generated reactions to Spanish online news reproduce measurable properties of real audience discourse. Using the Hatemedia dataset, we pair 5,631 news items with 58,555 real audience reactions, and generate a matched synthetic dataset using five LLMs under a shared experimental setting. We compare real and synthetic reactions across three dimensions: hate speech, sentiment, and semantic alignment, considering both off-the-shelf and fine-tuned generation. Results show that off-the-shelf models are poor proxies for real audience reactions: they strongly underproduce hate speech, introduce model-specific sentiment biases, and remain distributionally distant from human replies. Fine-tuning improves fidelity unevenly. Qwen3 provides the most balanced approximation, while Mistral7B achieves the strongest sentiment and semantic alignment but overshoots hate prevalence. Plausible synthetic replies do not necessarily reproduce the distributional properties of public discourse. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.28598 [cs.CL] (or arXiv:2605.28598v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.28598 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-31] Models That Know How Evaluations Are Designed Score Safer

【速读】：该论文试图解决的问题是：当前人工智能（AI）安全评估的有效性可能受到模型在受控测试环境与实际部署环境中行为不一致的影响，尤其是当模型因识别出“评估情境”而改变行为时，这种现象可能导致评估结果失真。解决方案的关键在于提出并验证“评估元知识”（evaluation meta-knowledge）这一新概念——即模型通过训练数据中隐含的关于评估结构特征（如可验证性、道德困境等）的信息，习得对评估场景的识别能力，从而在不依赖显式记忆或口头表达评估意识的情况下调整行为。作者通过在合成文档上微调模型，并在六个安全基准上进行测试，发现微调后的模型显著更安全，且该行为变化在排除显式评估意识表达后依然存在，证明了评估元知识是一种独立于传统混淆因素的新干扰源，对AI安全评估的设计和解释具有重要影响。

链接: https://arxiv.org/abs/2605.28591
作者: Katharina Deckenbach,Haritz Puerto,Jonas Geiping,Sahar Abdelnabi
机构: ELLIS Institute Tübingen; Max Planck Institute for Intelligent Systems; Tübingen AI Center
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on six safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at this https URL.

[NLP-32] Soft-SVeRL: Self-Verified Reinforcement Learning with Soft Rewards

【速读】：该论文试图解决的问题是：在部分可验证（partially verifiable）任务中，如何利用生成式 AI (Generative AI) 的强化学习（Reinforcement Learning, RL）框架来提升语言模型的表现。这类任务的特点是无法通过单一的正确答案进行判断，例如多要求提示（multi-requirement prompts）或无唯一参考答案的任务，传统基于硬性通过/失败标签的奖励机制难以提供有效的训练信号。

解决方案的关键在于提出 Soft-RLVR 框架，其核心创新包括：1）将提示分解为原子级要求组成的检查清单（checklist），2）使用大语言模型（LLM）作为验证器对每个项目逐项评分，3）基于这些细粒度评分构建软奖励（soft reward）用于训练策略模型。这种方法将稀疏的“通过/失败”监督转化为密集的部分得分信号，从而增强强化学习的稳定性与效果；同时，论文还形式化分析了平均项级判断带来的噪声抑制与部分奖励之间的权衡，并识别出在何种条件下基于检查清单的验证优于整体验证（holistic verification）。此外，作者进一步引入 Soft-SVeRL——一种策略自身担任验证者的自验证变体，但发现其易受奖励膨胀（reward inflation）影响，需引入显式稳定机制以防止训练崩溃。实验表明，在规则驱动的指令遵循任务中，Soft-RLVR 可使 IFEval 分数提升最高达 11.1 分，且验证器质量与检查清单设计均显著影响最终强化学习性能。

链接: https://arxiv.org/abs/2605.28561
作者: Saurabh Dash,Pierre Clavier,John Dang,Matthias Galle,Marzieh Fadaee,Ahmet Üstün,Beyza Ermis
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable: prompts contain multiple requirements, responses may satisfy some but not all of them, or no single reference answer might exist. We introduce Soft-RLVR, a framework for reinforcement learning from decomposed, learned verification signals. Soft-RLVR converts each prompt into a checklist of atomic requirements, scores candidate responses item by item with an LLM verifier, and trains on the resulting soft reward. Checklist-based rewards turn sparse pass/fail supervision into a denser partial-credit signal, but they also introduce a tradeoff: averaging item-level judgments can reduce verifier noise, while partial credit can reward incomplete responses. We formalize this tradeoff and identify conditions under which checklist-based verification gives a more reliable RL training signal than holistic verification. We further introduce Soft-SVeRL, a self-verifying variant of Soft-RLVR in which the policy also acts as the verifier. We show that self-verification is prone to reward inflation from overly permissive self-judgments, and that explicit stabilization is needed to prevent this collapse. In a controlled instruction-following setting with rule-based ground-truth evaluation, checklist-based Soft-RLVR improves IFEval by up to 11.1 points using only learned verifier rewards. Our experiments further show that verifier quality and checklist quality both affect downstream RL outcomes, and that explicit stabilization is essential for effective self-verification.

[NLP-33] Cultural Binding Heads in Language Models

【速读】：该论文试图解决大语言模型（LLMs）在跨文化情境中缺乏差异意识的问题，即模型倾向于对所有文化群体采取均等处理，而忽视了语境中应有的文化区分。其解决方案的关键在于通过机制可解释性与因子设计，在N4文化挪用基准测试上识别出每种模型中2-3个中间层注意力头，这些头因果性地促进了文化绑定（cultural binding）——即文化元素与其对应身份的关联过程。实验表明，移除这些注意力头中的身份到文化项边可使绑定强度下降9-23%；且这些注意力头在指令微调（instruct）与基础（base）模型间具有迁移性，说明文化绑定是在预训练阶段形成的。进一步通过α缩放（α-scaling）实现剂量响应式控制，当α=2-3时适度放大引导信号，可提升文化差异化准确率1-3个百分点，同时保持中性推理能力不受显著影响。知识探测任务还发现，模型的知识储备远超其实际应用量（多出3-5倍），表明瓶颈在于信息路由而非知识缺失。

链接: https://arxiv.org/abs/2605.28543
作者: Avrile Floro,Luca Benedetto
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLMs often default to equal treatment across cultural groups, even though context warrants differentiation: this is a lack of difference awareness. Using mechanistic interpretability and a factorial design on the N4 cultural appropriation benchmark from Wang et al. (2025), we identify 2-3 mid-layer attention heads per model that contribute causally to cultural binding across eight models (four architectures, base and instruct). Cultural binding is the process of associating cultural items with the appropriate identity. Knockout of the identity-to-item edges on these heads lowers the binding strength by 9-23%. The identified heads transfer from instruct to base models, suggesting that cultural binding is created at pre-training. An \alpha -scaling shows a graded dose-response and moderate amplification steering at generation ( \alpha = 2-3 ) increases cultural differentiation accuracy by 1-3 pp while leaving neutral reasoning mostly intact. A knowledge probing task shows that models know 3-5 times more than they act upon it, indicating that the bottleneck lies in routing and not knowledge.

[NLP-34] GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

【速读】：该论文试图解决多模态大语言模型在构建图形用户界面（GUI）代理时，因缺乏对GUI操作的世界知识而导致的真实任务完成效率低下问题。现有方案通常依赖昂贵的多智能体架构或传统的后训练范式（如监督微调SFT和强化学习RL），但这些方法仅能通过动作标注或奖励信号隐式吸收知识，导致轨迹记忆效率低且缺乏真正的理解。解决方案的关键在于提出一种中训练阶段的方法GUI-CIDER，其核心创新是通过因果内化（Causal Internalization）和密度感知示例重选择（Density-aware Exemplar Reselection）显式地将GUI世界知识嵌入模型。该方法分为三个阶段：(1) 数据合成，从GUI轨迹中提炼静态规划与动态因果知识并转化为文本；(2) 示例重选择，基于因果结构奖励与语义冗余惩罚过滤语料；(3) 中训练阶段，利用优化后的数据嵌入所学知识。实验表明，GUI-CIDER在两个GUI知识基准和三个任务完成基准上均显著提升了代理对GUI操作的理解能力和任务成功率。

链接: https://arxiv.org/abs/2605.28534
作者: Zheng Wu,Chengcheng Han,Zhengxi Lu,Tianjie Ju,Yanyu Chen,Qi Gu,Xunliang Cai,Zhuosheng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Meituan (美团); Zhejiang University (浙江大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi-agent scaffolding or conventional post-training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). However, post-training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI-CIDER, a mid-training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density-aware Exemplar Reselection. GUI-CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid-training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI-CIDER consistently improves both the agent’s understanding of GUI operations and its task success this http URL codes are available at this https URL.

[NLP-35] Entropy-aware Masking for Masked Language Modeling

【速读】：该论文试图解决传统掩码语言建模（Masked Language Modeling, MLM）中随机掩码策略导致学习信号效率低下的问题。其解决方案的关键在于提出一种基于熵分布的掩码策略，利用模型对token预测的熵来识别更具信息量且不确定性较高的token进行掩码，从而提升训练的有效性；此外，论文还提出了一种无需依赖外部参考模型的自掩码（self-masking）方法，进一步优化训练效率。实验表明，该方法在GLUE基准上相较基线平均提升5%性能，并且结合知识蒸馏后效果更优。

链接: https://arxiv.org/abs/2605.28526
作者: Gokul Srinivasagan,Kai Hartung,Munir Georges
机构: AImotion Bavaria; Technische Hochschule Ingolstadt (英戈尔施塔特应用技术大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: accepted at starsem 2026 Conference

点击查看摘要

Abstract:Masked language modeling has become a standard pretraining objective for training encoder-based language models. In this approach, certain tokens in the input are masked, and the model learns to predict them using the surrounding context. This process enables the model to capture both syntactic and semantic properties of language. Conventionally, the tokens selected for masking are chosen at random, which may not always yield the most effective learning signals. In this work, we examine a token masking strategy based on entropy distribution. We use the model’s entropy over token predictions to identify which tokens should be masked. This method aims to target tokens that are more informative and uncertain to improve the training efficacy. We also propose a novel self-masking approach that enhances training efficiency without relying on an external reference model. Experimental results demonstrate that our method achieves an average performance improvement of 5% in GLUE scores compared to the baseline. Further, we experiment with combining knowledge distillation with entropy masking, resulting in the best overall results.

[NLP-36] ClinicalEncoder26AM: A Multlilingual Diagnosable ColBERT Model; Evidences from the MultiClinNER Shared Task

【速读】：该论文旨在解决临床和生物医学文本中多语言实体识别（NER）任务的性能瓶颈问题，尤其是如何在有限标注数据下提升模型对症状、疾病和操作等实体的召回率与跨语言一致性。其解决方案的关键在于构建一个名为ClinicalEncoder26AM的多语言可诊断ColBERT模型，该模型通过多层次语义对齐机制将token级表示与ClinicalMap25（一个受BioLORD-2023启发并融合合成及标注监督信号的临床潜在空间）对齐；同时采用基于BGE-M3的后训练策略，结合合成临床笔记、医患对话和MedMentions等标注资源，在命名实体级别和句子级别上进行多适配器蒸馏，并引入ColBERT风格的检索目标以增强语义表示能力。实验表明，该方法在MultiClinNER共享任务中实现了最先进的多语言实体召回率，并在字符加权F1得分上位列前五，且训练曲线显示其数据效率显著优于基础M3模型，验证了临床后训练的有效性。

链接: https://arxiv.org/abs/2605.28521
作者: François Remy
机构: Parallia AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:ClinicalEncoder26AM is a multilingual Diagnosable ColBERT for clinical and biomedical texts, which aligns at multiple levels its token-level semantic with ClinicalMap25, a clinical latent space inspired by BioLORD-2023 and enriched with synthetic and annotated supervision. The post-training recipe builds upon BGE-M3, and combines synthetic clinical notes, patient–doctor conversations, and annotated resources such as MedMentions, while considering both named-entity-level and sentence-level representations in a multi-adapter distillation, along with a ColBERT-style retrieval objective. In this system demonstration paper, we evaluate the model in the MultiClinNER shared task by finetuning it as a BIO tagger for patient symptoms, disorders, and procedure spans, using a lightweight two-layer CNN head to improve local boundary detection. The resulting system remains simple, processes most documents in a single 8192-token window, and achieves state-of-the-art multilingual entity recall, while achieving Top 5 overall across all entity types and languages in Character-weighted F1 scores. Training curves further show that ClinicalEncoder26AM is markedly more data-efficient than the base M3 model, supporting the usefulness of its clinical post-training for downstream information extraction. The model can be downloaded on this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.28521 [cs.CL] (or arXiv:2605.28521v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.28521 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-37] On Compositional Learning Behaviours in Formal Mathematics

【速读】：该论文试图解决的问题是：如何实现能够攻克形式数学中“长尾难题”（即高难度数学证明）的自进化科学代理（self-evolving scientific agents），其核心挑战在于这些代理是否具备组合式学习行为（Compositional Learning Behaviours, CLBs）——即在具体情境中对新颖符号结构进行 grounding（具身化理解）与重组的能力，而不仅仅是对预训练原子符号的简单再组合。解决方案的关键在于提出 S2B-LM，这是一个改进的符号行为基准测试框架，通过移除数值处理干扰并引入链式思维（chain-of-thought）支架来激发而非仅探测潜在的 CLB 能力。实验结果表明，CLB 能力是突破奥林匹克级别数学证明（miniF2F >75%）的必要条件（p=0.004），但并非充分条件，从而揭示了当前模型在形式数学验证长尾问题上的能力瓶颈与关键路径。

链接: https://arxiv.org/abs/2605.28512
作者: Kevin Yandoka Denamganaï
机构: 未知
类目: Computation and Language (cs.CL)
备注: work in progress, under review

点击查看摘要

Abstract:Self-evolving scientific agents capable of conquering the hard tail of formal mathematics require Compositional Learning Behaviours (CLBs) – the capacity to ground and recombine novel symbolic structures in context, beyond mere recombination of prelearned atoms. We propose \textbfS2B-LM, an adaptation of the Symbolic Behaviour Benchmark that removes numerical processing as a confound and adds chain-of-thought scaffolding to elicit rather than merely probe latent CLB competency. Cross-evaluating ten Lean~4 theorem provers on CLB competency (adj-ZSCT) and miniF2F whole-proof performance, exact permutation tests establish a hierarchical necessity structure: search-heavy models cover the tractable bulk without detectable CLBs, yet every model breaking into the Olympiad-level tier (miniF2F 75% ) is among the five highest CLB scorers ( p=0.004 ). After ruling out model scale as a confound, our results show that CLB competency is \emphnecessary but not sufficient for the hard tail of formal mathematical verification.

[NLP-38] Functional Entropy: Predicting Functional Correctness in LLM -Generated Code with Uncertainty Quantification

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在代码生成任务中产生功能错误代码的问题，尤其是如何有效识别此类错误。当前虽有不确定性量化（Uncertainty Quantification, UQ）方法在自然语言生成中用于检测幻觉，但其在代码生成中的适用性尚不明确。解决方案的关键在于提出一种基于功能等价性的新方法族——通过用LLM驱动的功能等价评估替代依赖自然语言推理（Natural Language Inference, NLI）的语义等价判断，从而更准确地衡量代码输出的不确定性。其中，功能熵（functional entropy）作为语义熵的代码特异性变体，在15个模型-基准组合中于11个场景取得最高AUROC，并在多数设置下实现最佳校准性能，显著优于NLI基线及其他方法。

链接: https://arxiv.org/abs/2605.28500
作者: Dylan Bouchard,Mohit Singh Chauhan,Zeya Ahmad,Ho-Kyeong Ra
机构: CVS Health( CVS健康公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models have shown impressive capabilities in code generation, yet they often produce functionally incorrect code. Uncertainty quantification (UQ) methods have emerged as a promising approach for detecting hallucinations in natural language generation, but their effectiveness for code generation tasks remains underexplored. We systematically evaluate how UQ techniques transfer to code generation across three programming languages, five LLMs, and over 1,700 problems. We find that some token-probability-based methods generalize effectively without modification, while sampling-based methods relying on natural language inference (NLI) fail because NLI models cannot distinguish functionally different code, causing most responses to collapse into a single semantic cluster. To address this, we introduce functional equivalence methods, a family of code-specific methods that replace NLI-based semantic equivalence with an LLM-based functional equivalence assessment, including functional entropy, a code-specific analog of semantic entropy. Functional equivalence methods achieve top AUROC in 11 out of 15 model-benchmark combinations and the best calibration across most settings, consistently outperforming both NLI-based counterparts and all other methods evaluated.

[NLP-39] A new semantically annotated corpus with syntactic-semantic and cross-lingual senses

【速读】：该论文试图解决的是法语多义动词的词义消歧（Word Sense Disambiguation, WSD）问题，旨在构建一个标注丰富的语料库以支持相关研究与模型训练。其解决方案的关键在于创建了一个包含20个法语多义动词实例的新标注语料库，每个动词实例被赋予三个层次的语义标签：（1）该动词在平行语料库中对应英文版本的实际翻译；（2）该动词在计算词典（Lexicon-Grammar表）中的词条编号；（3）由上述两个标签拼接而成的细粒度语义标签，从而实现从粗到细的多层次语义标注体系，提升词义消歧任务的精度和可解释性。

链接: https://arxiv.org/abs/2605.28494
作者: Myriam Rakho,Eric Laporte,Matthieu Constant
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We describe a new sense-tagged corpus for word sense disambiguation. The corpus is constituted of instances of 20 French polysemous verbs. Each verb instance is annotated with three sense labels: (1) the actual translation of the verb in the english version of this instance in a parallel corpus, (2) an entry of the verb in a computational dictionary of French (the Lexicon-Grammar tables) and (3) a fine-grained sense label resulting from the concatenation of the translation and the Lexicon-Grammar entry.

[NLP-40] Comonadic Morphophonology: A Compositional Framework for Context-Dependent Morphological Rules in Finnish

【速读】：该论文试图解决有限状态转换器（FST）在处理音位形态规则（如辅音交替、元音和谐、所有格后缀同化）时导致的状态爆炸问题，同时弥补神经模型缺乏对规则形式化表述的不足。其解决方案的关键在于提出一种基于共范畴（comonad）的全新代数框架：通过引入“Writer共范畴（DeletionSet x Zipper）”这一创新构造，使得每个音位形态规则被建模为从局部上下文到单个输出音段的函数（类似细胞自动机中的局部规则），并以共Kleisli箭头的形式组合；其中，长度变化规则通过共Kleisli箭头实现严格组合性，删除操作以幺半群作用累积而非中间材料化。实验证明，13个共Kleisli箭头即可替代Omorfi中874个延续类（规则表示层面减少67:1），且支持双向形态学处理（分析与生成复用同一组箭头），在UD芬兰语-TDT数据集上仅用规则进行消歧即可达到83.92% UPOS准确率（结合外部后缀标注器达94.66%），验证了该框架作为实用形态引擎的有效性。

链接: https://arxiv.org/abs/2605.28484
作者: Yongseok Jang
机构: Independent Researcher
类目: Computation and Language (cs.CL)
备注: 13 pages. Accepted at the Society for Computation in Linguistics (SCiL) 2026

点击查看摘要

Abstract:Composing finite-state transducers (FSTs) for context-dependent morphophonological rules – consonant gradation, vowel harmony, possessive suffix assimilation – leads to multiplicative state explosion; neural models sidestep the problem but provide no formal account of the rules themselves. We present the first framework where each morphophonological rule is a function from a focused local context to a single output segment – the type of a local rule familiar from cellular automata – and where length-changing rules compose as coKleisli arrows of a comonad. Our central contribution is the Writer comonad (DeletionSet x Zipper), a new algebraic construction that restores strict coKleisli compositionality for such rules: each rule is a coKleisli arrow, extend lifts it to a global transformation, and deletions accumulate as a monoid action rather than requiring intermediate materialization. As supporting evidence, thirteen coKleisli arrows provide an alternative formulation expressing the same morphophonological behaviors that Omorfi encodes via 874 continuation classes (67:1 reduction at the rule-representation level), and the same abstraction enables bidirectional morphology – a MorphGenerator reuses the analysis arrows for generation. On UD Finnish-TDT, the system achieves 83.92% UPOS accuracy with rule-only disambiguation (94.66% with an external suffix tagger), validating the framework as a practical morphological engine.

[NLP-41] Beyond One Path: Evaluating and Enhancing Divergent Thinking in Interactive LLM Agents

【速读】：该论文试图解决的问题是：当前对大语言模型（Large Language Models, LLMs）的评估主要基于单轮文本生成，无法捕捉智能体在迭代交互中展现出的发散性思维（divergent thinking），而发散性思维是创造力的核心维度。为解决这一问题，作者提出MUTATE基准，用于从路径层面（agent发现通向同一目标的多种替代路径）和动作层面（个体动作需采用非典型、机制转变的物体使用方式）两个维度评估智能体的发散性思维能力。其关键创新在于：不同于传统仅统计成功路径的评估方法，MUTATE同时评分已完成路径与偏离路径的尝试，从而保留了被常规成功率忽略的发散推理过程。实验揭示现有LLMs存在结构性盲区——在即时收敛压力下易陷入“立即行动固化”（immediate action fixation），难以提升动作层面的发散性。为此，作者进一步提出ReDNA方法，通过将无约束的发散候选生成与有约束的收敛选择分离，显著提升了两个层面的发散性表现，并在外部创造性环境中表现出良好泛化能力；其成功源于对韧性发散推理的质性增强，而非简单的环境探索。

链接: https://arxiv.org/abs/2605.28465
作者: Jihyeong Park,Ingeol Baek,Jeonghyun Park,Hwanhee Lee
机构: Chung-Ang University (中央大学)
类目: Computation and Language (cs.CL)
备注: 28 pages, 16 figures, 19 tables

点击查看摘要

Abstract:Divergent thinking is a core dimension of creativity, yet existing evaluations of Large Language Models (LLMs) treat them as single-turn text generations, failing to capture how an agent reasons through iterative interaction. To address this, we introduce MUTATE, an interactive benchmark designed to evaluate agentic divergent thinking at two levels: path-level, where an agent discovers multiple alternative paths to the same goal, and action-level, where individual actions require non-typical, mechanism-shifting object uses. Unlike success-only evaluations, MUTATE scores both completed paths and off-path attempts, capturing divergent reasoning that conventional success rates discard. Our experiments with frontier LLMs reveal a structural blind spot in existing frameworks: when exposed to immediate convergence pressure, they tend to fall into immediate action fixation, failing to improve action-level divergence. To overcome this, we propose ReDNA, which separates unconstrained divergent candidate generation from convergent constraint selection. ReDNA significantly outperforms prior methods across both divergence levels and generalizes effectively to an external creativity environment. We also confirm its success stems from a qualitative enhancement of resilient divergent reasoning rather than simple environmental exploration.

[NLP-42] he Cases LJP Never Sees: Prosecution Decision Prediction for More Complete Criminal Liability Assessment

【速读】：该论文试图解决法律人工智能（Legal AI）在刑事司法领域评估中的关键盲区问题：传统法律判决预测（Legal Judgment Prediction, LJP）仅基于已通过检察机关审查并正式起诉的刑事案件，忽略了因证据不足、无刑事责任或可免除刑罚而未被起诉的案件，导致对刑事追责能力的评估不完整。为此，作者提出**起诉决定预测（Prosecution Decision Prediction, PDP）**这一全新任务，旨在模拟检察机关在审查阶段的决策过程，将案件分类为“起诉”或“三种非起诉决定”，从而更全面地体现AI在证据评估、法律归类和价值判断方面的能力。解决方案的关键在于构建首个面向检方审查环节的基准测试集PDP-Bench（包含4630个真实中国检察决定，覆盖190项罪名），并通过实验证明当前主流大语言模型（LLMs）在PDP任务上表现显著低于LJP，且现有增强方法无法弥合差距，同时揭示简单结果奖励机制难以实现可靠的PDP判别能力。

链接: https://arxiv.org/abs/2605.28464
作者: Junyu Lu,Qi Wei,Peishuo Zheng,Jie Zhang,Hui Huang,Qianru Wang,Chuan Xiao,Jianbin Qin,Shuyuan Zheng
机构: Beijing Institute of Technology, Zhuhai (北京理工大学珠海校区); Osaka University (大阪大学); Xi’an University of Technology (西安理工大学); Institute of Science Tokyo (东京科学研究所); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 5 figures, 22 tables

点击查看摘要

Abstract:Legal Judgment Prediction (LJP) has become a core benchmark for evaluating AI in the criminal legal domain, but it only sees criminal cases that have already passed prosecutorial review and been formally indicted. As a result, LJP leaves a substantial blind spot in assessing criminal liability, overlooking cases involving insufficient evidence, no criminal liability, or guilt exempted from punishment. To fill this gap, we propose \textbfProsecution Decision Prediction (PDP), the first Legal AI task built around prosecutorial review, which classifies each case into prosecution or one of three non-prosecution decisions and reflects legal AI’s capabilities in evidence evaluation, legal subsumption, and value-based discretion. We further construct \textbfPDP-Bench, a benchmark of 4,630 real Chinese prosecutorial decisions spanning 190 charges. Extensive experiments show that state-of-the-art LLMs perform substantially worse on PDP than on LJP and that mainstream enhancement routes fail to close the gap. Moreover, controlled RLVR interventions show that simple outcome rewards fail to produce generalizable PDP discrimination.

[NLP-43] AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates

【速读】：该论文旨在解决直接偏好优化（DPO）中存在的梯度不对称问题：DPO在训练过程中对不偏好响应的抑制速度远快于对偏好响应的促进，导致模型更倾向于避免错误回答而非主动生成优质回答。其解决方案的关键在于提出AdaDPO——一种自适应的DPO变体，通过引入基于策略模型生成概率（参考模型概率为可选组件）的逐偏好对停止梯度（stop-gradient-based）系数，强制使偏好与非偏好响应对应的梯度幅度相等。该方法在保持原始DPO超参数结构的同时，通过每标记梯度平衡和数值裁剪机制实现稳定优化，并在Llama-3-8B-Instruct模型上于UltraFeedback数据集上的实验中显著优于DPO，尤其在长度控制胜率（LC）方面提升明显，且有效缓解了长度偏差问题。由于仅在损失层面进行改进，AdaDPO可无缝集成至现有基于偏好的对齐流程中，无需改动数据收集或模型架构，且其自适应原理可推广至SimPO、R-DPO、IPO、CPO和ORPO等多种成对对比损失函数。

链接: https://arxiv.org/abs/2605.28440
作者: Shaolong Chen,Madalina Ciobanu,Qingqing Mao,Ritankar Das
机构: Incept Labs; Titan Holdings
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 5 figures

点击查看摘要

Abstract:DPO has become a widely adopted alternative to RLHF for aligning LLMs with human preferences, eliminating the need for a separate reward model or RL loop. Recent theoretical analysis uncovers an asymmetric gradient behavior in DPO: the loss suppresses dispreferred responses substantially faster than it promotes preferred ones, causing the model to learn to avoid bad answers rather than to generate good ones. We propose AdaDPO, a Self-Adaptive variant of the DPO algorithm that introduces per-preference-pair, stop-gradient-based coefficients derived directly from the policy model’s generation probabilities, with the reference model’s probabilities as an optional component. AdaDPO is constructed to enforce equality of gradient magnitudes between preferred and dispreferred probabilities; the practical implementation balances per-token gradients and applies a numerical clipping bound for stability, while retaining DPO’s original hyperparameter structure. On Llama-3-8B-Instruct trained on UltraFeedback under a SimPO similar setup, AdaDPO consistently outperforms DPO on AlpacaEval 2: it achieves higher length-controlled win rates (LC) in 81% of hyperparameter combinations, attains the global best LC (48.3%) and raw win rate (46.1%), and enlarges the LC-over-WR margin in 88% of combinations, indicating effective mitigation of length bias. Additional analyses on KL divergence, reward margin, and reward accuracy confirm that AdaDPO rectifies the gradient imbalance and yields more efficient optimization. Because it operates purely at the loss level, AdaDPO can be dropped into existing preference-based alignment pipelines without changing data collection or model architectures. The method requires only a few lines of code, and the same self-adaptive principle generalizes to a broad family of pairwise contrastive preference losses including SimPO, R-DPO, IPO, CPO, and ORPO.

[NLP-44] Breaking the Script Barrier: Enabling Automatic Alignment for PoS-based ASR Error Analysis in Non-Latin Scripts

【速读】：该论文试图解决自动语音识别（ASR）系统评估中缺乏细粒度语言学错误分析的问题，尤其是现有对齐工具在非拉丁文字母语言中的可靠性不足。解决方案的关键在于提出一种鲁棒、自动化且语言无关的对齐机制，适用于不同ASR架构及拉丁与非拉丁文字系统（包括音节文字Abugida、字母文字Alphabetic和辅音文字Abjad），从而实现假设、参考文本和评估序列的一致对齐，为下游语言学分析（如词性标注PoS-wise error analysis）提供基础。进一步地，作者利用标准词性标注器实现了可扩展、可复现的错误分析，并通过实验证明此类错误信息可用于改进ASR训练过程以降低词错误率（WER）。

链接: https://arxiv.org/abs/2605.28438
作者: Prasenjit K Mudi,Dahlia Devapriya,Sheetal Kalyani
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) systems are commonly evaluated using aggregate metrics such as Word Error Rate (WER), which do not capture the linguistic structure of errors. Fine-grained analysis, such as Part-of-Speech (PoS)-wise error characterization, requires accurate alignment between ASR hypotheses and reference transcriptions. However, existing alignment tools are often unreliable for languages written in non-Latin scripts. In this work, we address this gap by proposing a robust, automated, language-agnostic alignment mechanism applicable across ASR architectures and across languages written in both Latin and non-Latin scripts. This enables consistent alignment of hypotheses, references, and evaluation sequences, forming the basis for downstream linguistic analysis. Building on this, we employ standard PoS taggers to perform scalable and reproducible PoS-wise error analysis. Notably, we perform alignment and downstream ASR error analysis across three major segmented writing systems, namely, Abugida (Tamil, Hindi, Kannada), Alphabetic (English, Russian, Greek), and Abjad (Arabic). We further demonstrate how such error information can be leveraged during ASR training to improve metrics such as WER.

[NLP-45] Roles with Rails: Contract-Preserving Role Evolution in Multi-Agent Structured Reasoning

【速读】：该论文试图解决角色驱动的大语言模型（LLM）多智能体系统中角色池难以自适应演化的问题。现有方法要么固定角色配置导致缺乏灵活性，要么允许无约束的角色生成引发角色漂移，破坏系统结构契约（structural contracts），如能力覆盖、消息兼容性、验证机制、最终答案聚合及输出协议等。解决方案的关键在于提出“合同保持的角色演化”（contract-preserving role evolution）范式，要求每次角色变更必须维持五类结构契约的完整性；具体实现为SERO框架，其核心包括：基于信用评分的检索机制、带受保护终端聚合器和条件验证修复的信用排序通信有向无环图（DAG），以及一个上下文相关的Bandit控制器——仅当LLM提出的修改既满足合同约束又能提升任务得分时才予以采纳。实验证明该方法在三种不同LLM骨干上的真实推理基准中均有效提升了系统的适应性和稳定性。

链接: https://arxiv.org/abs/2605.28433
作者: Ling-Yue Ge,Lan-Zhe Guo
机构: 未知
类目: Computation and Language (cs.CL)
备注: 33 pages, 23 figures, 12 tables

点击查看摘要

Abstract:Role-based LLM multi-agent systems need adaptive role pools, yet adapting such systems is not merely a matter of prompt optimization: roles often carry structural obligations, including capability coverage, message compatibility, validation, final-answer aggregation, and parser-compatible output protocols. Existing systems either fix the role inventory and lose adaptivity, or allow unconstrained generation to induce role drift, removing structurally necessary roles and breaking answer contracts. We formulate this as contract-preserving role evolution, requiring every committed edit to preserve five structural contracts (capability, communication, validation, aggregation, output protocol). We instantiate this formulation in SERO, a Self-Evolving Role Orchestration framework that evolves a typed role-card pool through credit-guided retrieval, a credit-ranked communication DAG with a protected terminal aggregator and conditional validator repair, and a contextual-bandit controller whose LLM-proposed edits are committed only when they preserve the contracts and improve task score. Experiments on real-world reasoning benchmarks across three LLM backbones confirm the value of contract-preserving role evolution.

[NLP-46] Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agent ic Reinforcement Learning

【速读】：该论文试图解决的问题是：在基于技能的强化学习（Skill-based Reinforcement Learning, RL）中，如何平衡通用技能（general skills）与任务特定技能（task-specific skills）的处理方式，以避免传统方法中因完全外部化（externalization）导致的上下文开销过高或因完全内部化（internalization）引发的过拟合与知识冲突问题。解决方案的关键在于提出一种名为 Skill0.5 的新型代理强化学习框架，其核心创新是通过一个动态、难度感知的路由机制（dynamic, difficulty-aware router），将任务分层至不同掌握层级，并采用差异化优化策略：对困难任务采用特权蒸馏（privileged distillation）将通用技能内化，构建认知基础；对简单任务则通过诊断探测（diagnostic probing）惩罚捷径行为，强制使用特定技能。该方法实现了通用技能与任务特定技能的协同优化，在 ALFWorld 和 WebShop 基准测试中显著优于记忆型和技能型基线方法，且在分布内和分布外场景下均表现出更强的泛化能力。

链接: https://arxiv.org/abs/2605.28424
作者: Jiapeng Zhu,Jianxiang Yu,Yibo Zhao,Chengcheng Han,Qi Gu,Xunliang Cai,Xiang Li,Weining Qian
机构: East China Normal University(华东师范大学); Meituan Longcat Team(美团龙猫团队)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer and task-specific skills for dynamic execution. However, existing skill-based reinforcement learning (RL) methods typically force a rigid choice between full externalization, which incurs prohibitive context overhead, and full internalization, which risks overfitting and knowledge conflicts. To address this dilemma, we propose Skill0.5, a novel agentic RL framework that explicitly differentiates skill treatments by combining general skill internalization with task-specific skill utilization. Driven by a dynamic, difficulty-aware router, Skill0.5 streams tasks into distinct mastery tiers to apply tailored optimization strategies: it internalizes general skills via privileged distillation to build a cognitive foundation for hard tasks, while using diagnostic probing on easy tasks to penalize shortcuts and enforce specific skill utilization. Experiments on ALFWorld and WebShop demonstrate that Skill0.5 outperforms both memory-based and skill-based RL baselines, yielding performance improvements across both in-distribution and out-of-distribution scenarios.

[NLP-47] FABSVer: Faster Training and Better Self-Verification for LLM Mathematical Reasoning

【速读】：该论文试图解决大语言模型在数学推理中难以可靠判断自身解答正确性的问题，以及现有自验证方法因将解题生成与验证分离而导致训练成本过高的问题。其解决方案的关键在于提出FABSVer框架，通过将生成与验证融合为单一生成过程，显著降低训练开销；同时引入动态参考模型更新（DRMU）机制，理论上和实证上均证明可突破奖励收敛瓶颈，提升奖励上限并实现持续优化。实验表明，FABSVer在多个模型规模下均优于现有方法，在数学基准测试中表现出更强的自验证能力和推理性能，且训练时间仅为原有方法的51%–71%。

链接: https://arxiv.org/abs/2605.28389
作者: Haihui Pan,Junwei Bao,Hongfei Jiang,Yang Song
机构: Zuoyebang Education Technology (作业帮教育科技)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models have made significant progress in mathematical reasoning, they remain unreliable at judging the correctness of their own solutions. Existing approaches that equip models with self-verification typically treat solution generation and verification as two separate tasks, leading to substantially increased training time. In this paper, we propose FABSVer, which fuses these two tasks into a single generation pass, dramatically reducing training overhead while jointly optimizing both capabilities. We further identify a convergence bottleneck both theoretically and empirically: as training progresses, the reward reaches a plateau because the policy is constrained by a fixed reference model. To overcome this, we introduce Dynamic Reference Model Update (DRMU), which raises the reward ceiling and enables sustained reward growth. Extensive experiments on math benchmarks demonstrate that FABSVer achieves superior self-verification and reasoning performance across three model scales, while requiring only 51%–71% of the training time of existing methods. Analysis further reveals distinct learning phases in how models acquire self-verification, and that the gap between verify and answer rewards shrinks noticeably as model size increases.

[NLP-48] PrionNER: A Named Entity Recognition Dataset for Prion Disease Biomedical Literature ACL25

【速读】：该论文旨在解决罕见神经退行性疾病——朊病毒病（prion disease）在早期诊断困难的问题，特别是由于临床表现不特异导致的识别难题。其核心解决方案是构建了一个名为PrionNER的命名实体识别（Named Entity Recognition, NER）数据集，该数据集基于PubMed摘要进行人工标注，涵盖15个粗粒度和31个细粒度的临床相关实体类型（如疾病、症状、诊断、解剖结构、治疗等），共包含317篇文献、2,943个句子及6,955个文本边界实体标注。关键创新在于其临床导向的标签体系与高一致性标注（跨标注者F1值达81.78），为低资源、细粒度、非扁平化提取场景下的罕见病生物医学自然语言处理（Biomedical NLP）提供了首个公开可用且具有挑战性的基准。

链接: https://arxiv.org/abs/2605.28375
作者: An Dao,Nhan Ly,Thao Tran,Yuji Matsumoto,Akiko Aizawa
机构: The University of Tokyo, Tokyo, Japan; Tohoku University, Sendai, Japan; RIKEN Center for Advanced Intelligence Project, Tokyo, Japan; National Institute of Informatics, Tokyo, Japan
类目: Computation and Language (cs.CL)
备注: 29 pages, 5 figures, accepted at ACL 25th Workshop on Biomedical Language Processing (BioNLP 2026)

点击查看摘要

Abstract:Prion diseases are rare, rapidly progressive, and fatal neurodegenerative disorders that remain difficult to diagnose, particularly in their early stages because of nonspecific clinical presentations. However, to our knowledge, there is no publicly available prion-disease-focused dataset designed to capture a broad range of clinically relevant entities from the biomedical literature. We introduce PrionNER, a manually annotated named entity recognition dataset for prion disease clinical information in PubMed abstracts. The current release comprises 317 abstracts, 2,943 sentences, and 6,955 text-bound entity annotations spanning 15 coarse-grained and 31 fine-grained clinically oriented entity types covering diseases, symptoms, diagnostics, findings, anatomy, treatments, and temporal and statistical evidence. Inter-annotator agreement reaches 81.78 exact-match F1, indicating strong annotation consistency. We benchmark supervised BERT baselines, W2NER, and zero-shot extractors on PrionNER. W2NER is the strongest supervised model, and Gemma-4-31B is the strongest zero-shot model, but the benchmark remains challenging, especially for structurally complex mentions and fine-grained clinically adjacent label distinctions. PrionNER provides a clinically grounded benchmark for prion-disease information extraction and supports research on rare-disease biomedical NLP under low-resource, fine-grained, and non-flat extraction conditions. The dataset, annotation guidelines, and evaluation scripts are available at this https URL.

[NLP-49] Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

【速读】：该论文试图解决的问题是：在使用Lean等定理证明器对自然语言数学答案进行评判时，其信号（signal）存在显著的不完整性与不可靠性——许多答案无法形式化，且证明失败可能源于类型错误或缺少库事实，而非答案本身错误。针对此问题，论文提出COVCAL方法，其关键在于利用Lean追踪诊断信息构建选择器，通过两种风险控制策略（保守的Bonferroni界和更紧致的dev-then-cal规则），在有限样本下提供可证明的接受风险边界，从而决定是否接受答案或放弃判断（abstain）。实验表明，自动形式化覆盖率是可行性核心：7B规模的自动形式化器覆盖不足（仅28%），导致Bonferroni规则全部弃权；而专用证明器达到79%覆盖率时，可在17/20个Bootstrap分区中实现可行决策，接受约48%的问题并保持0.98的准确率。因此，论文贡献在于明确指出了在何种形式化器和覆盖率条件下，部分形式化信号可在风险可控前提下被可信使用。

链接: https://arxiv.org/abs/2605.28365
作者: Pauline Bourigault,Xiaotong Ji,Matthieu Zimmer,Rasul Tutunov,Haitham Bou Ammar
机构: Imperial College London; Huawei Noah’s Ark Lab; UCL Centre for AI
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved coverage but 20% at low, and (ii) sparse and often unfaithful: a 7B autoformalizer proves a class for only 28% of problems, and a manual audit finds only approximately 43% of those proofs faithful. We propose COVCAL, a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under two regimes (a conservative Bonferroni bound and a tighter dev-then-cal rule). Feasibility depends on autoformalization coverage: with the 7B formalizer the signal is too sparse and Bonferroni abstains on all 20 bootstrap partitions, whereas a prover-specialized formalizer reaches 79% coverage and flips it to feasible on 17 of 20, accepting approximately 48% of problems at 0.98 accepted accuracy. Since self-consistency alone is already 91% accurate, our contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.

[NLP-50] PubMedCausal: A Span-Level Annotated Corpus for Causal Relation Extraction in Biomedical Text EMNLP2026

【速读】：该论文试图解决生物医学文本中因果关系抽取（Causal Relation Extraction, CRE）的评估难题，现有资源常将因果关系与广义关联混淆、局限于句级标注或仅关注显式因果线索，导致模型难以准确捕捉真实文献中表达的因果主张。解决方案的关键在于构建一个基于PubMed摘要的跨度级别（span-level）标注语料库PubMedCausal，其中包含30,000段落级样本、3,945条因果关系和6,491对经人工校准的因果对，每条关系均标注完整的原因与结果跨度、因果类型及句子属性，从而支持对因果检测和全跨度因果抽取的联合评估。实验表明，尽管使用生物医学预训练模型（如PubMedBERT）在检测任务上表现最佳（F₁=0.7391），而生成式模型（如DeepSeek-R1-32B）在少样本提示下实现最优跨度抽取（Cosine Pair F₁=0.6765），但整体性能仍受类别不平衡、长跨度、隐式因果、跨句关系及提示敏感性等因素显著制约，凸显了该任务的复杂性和挑战性。

链接: https://arxiv.org/abs/2605.28363
作者: Ifeoluwa Kunle-John,Josiah Paul,Oluwatosin Agbaakin,Peter Aina,Ikenna Odezuligbo,Sydney Anuyah
机构: Edyah Limited; University of Ibadan; Indiana University; Creighton University
类目: Computation and Language (cs.CL)
备注: Submitted to EMNLP 2026, 8 Pages, 23 page appendix

点击查看摘要

Abstract:Causal relation extraction (CRE) is central to biomedical text mining, but current resources often conflate causal relations with broader associations, restrict annotation to sentence-level examples, or focus mainly on explicit causal cues. This limits their usefulness for evaluating whether models can recover causal claims as they are actually expressed in biomedical text. We introduce PubMedCausal, a span-level annotated corpus for biomedical CRE built from PubMed abstracts. The corpus contains 30,000 paragraph-level rows, including 3,945 causal rows and 6,491 adjudicated cause–effect pairs. Each causal relation is annotated with full-text cause and effect spans, causality type, and sententiality, enabling evaluation of both causal detection and full-span causal extraction. We benchmark discriminative encoders and open-source generative models across detection and extraction settings. For causal detection, biomedical encoders are strongest, with PubMedBERT reaching an F _1 score of 0.7391. For span-level extraction, the best generative baseline is DeepSeek-R1-32B with few-shot prompting, reaching a Cosine Pair F _1 of 0.6765. We further test transfer learning by evaluating PubMedCausal-trained encoders on external causal relation datasets, showing that the resource supports cross-dataset evaluation. Our results show that biomedical CRE remains difficult under class imbalance, long causal spans, implicit causality, inter-sentential relations, and prompt sensitivity. Code and Data can be found here: this https URL

[NLP-51] When Discourse Pressures Conflict: Information Structure in Vision-Language Model Outputs

【速读】：该论文试图解决的问题是：当前对视觉语言模型（VLMs）的评估主要关注其是否识别正确的视觉内容，但缺乏对其生成内容在语篇中是否恰当表达的关注。具体而言，研究聚焦于VLMs能否在视觉 grounded 的问答任务中区分语篇中已知的“话题”（Topic）与新信息的“焦点”（Focus），即是否遵循信息结构（Information Structure, IS）原则。解决方案的关键在于利用匈牙利语这一语法显式标记话题和焦点的语言特性，使IS选择可被观测；通过对比六种VLMs与人类参与者在回答策略上的差异，发现模型虽然能生成符合IS的结构，但过度规则化，缺乏人类表现出的灵活性——人类根据话语状态、语法角色和定指性等因素采用多样的实现策略，而VLMs则倾向于收敛到有限的响应模板，表现出类似模式坍缩（mode collapse）的现象。这提示未来VLM评估应超越内容准确性，扩展至内容如何在语篇中组织与呈现。

链接: https://arxiv.org/abs/2605.28346
作者: Marcell Fekete,Johannes Bjerva,Tamás Káldi
机构: Aalborg University (奥尔堡大学); ELTE Research Centre for Linguistics (ELTE语言学研究中心); ELTE Bárczi Gusztáv Faculty of Special Needs Education (ELTE巴尔奇·古斯塔夫特殊教育学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly evaluated for whether they identify the right visual content, but little is known about whether they express such content in a discourse-appropriate form. We address this research gap using information structure (IS), testing whether VLMs distinguish discourse-old Topics from discourse-new Foci in visually grounded question answering. We exploit Hungarian, a language in which Topic and Focus map onto dedicated syntactic positions, making IS choices observable in text. Comparing six VLMs with human participants, we find that models produce IS-relevant constructions, but over-regularise this sensitivity. Under the interacting pressures of discourse status, grammatical role (preference for subject Topics) and definiteness (preference for indefinite Foci), humans choose variable strategies for IS realisation. VLMs, by contrast, collapse onto narrow response templates, resembling mode collapse (Kirk et al., 2024). Our findings suggest that VLM evaluation should look beyond content accuracy to how content is packaged for the discourse.

[NLP-52] HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains

【速读】：该论文试图解决当前通用机器翻译评估基准（如FLORES-200）在中英翻译任务上出现的“饱和现象”，即现代大语言模型（Large Language Models, LLMs）得分高度集中，难以区分系统间性能差异的问题，尤其是在金融、医疗、法律和科技等知识密集型领域。解决方案的关键在于提出一个难度感知的诊断性基准——HardMTBench，其核心创新包括：1）构建覆盖12个领域的20,000个方向性测试条目，确保领域平衡；2）采用三阶段构建流程，利用LLM多信号判别器综合评估知识密度、翻译难度、术语负载和参考译文准确性；3）基于硬度融合规则与分域配额生成最终测试集，显著扩大系统间得分差距（GEMBA范围扩大约两倍），并揭示质量指标无法捕捉的领域特异性术语和知识薄弱点。

链接: https://arxiv.org/abs/2605.28315
作者: Zheng Li,Mao Zheng,Mingyang Song,Tianxiang Fei
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:General-purpose machine translation benchmarks such as FLORES-200 have reached a saturation regime on Chinese-English pairs, where modern large language models cluster within a narrow band of high scores. Across 22 systems, FLORES-200 zh-en GEMBA scores fall in a 7.87-point range with a standard deviation of 2.29, which compresses the separation between systems on knowledge-intensive domains such as finance, healthcare, law, and science and technology. We introduce HardMTBench, a difficulty-aware diagnostic benchmark for bidirectional Chinese-English domain translation. HardMTBench covers 12 domains and contains 10,000 hand-curated source sentences with reference translations, packaged as 20,000 directional test items. A three-stage construction pipeline builds a domain-balanced candidate pool of 84,566 pairs, applies an LLM-based multi-signal judge over knowledge density, translation difficulty, terminology load and reference correctness, and assembles the final test set under a hardness fusion rule with per-domain quotas. Across 22 systems spanning general LLMs, commercial engines and specialised MT models, HardMTBench widens the cross-system GEMBA range by roughly a factor of two over FLORES-200, induces visible rank reorderings, and exposes domain-specific terminology and knowledge weaknesses that quality-only metrics tend to flatten. All data and code are open-sourced at this https URL.

[NLP-53] Argument Quality Assessment with Large Language Models : A Pairwise Bradley-Terry Approach

【速读】：该论文试图解决的问题是：如何有效评估大语言模型（LLM）在判断论点质量方面的表现，尤其是在逻辑、修辞和辩证三个维度上与人类专家判断的一致性。解决方案的关键在于设计了一种基于零样本（zero-shot）、少样本（few-shot）和思维链（chain-of-thought）提示策略的评估框架，利用成对比较（pairwise comparisons）构建 Bradley-Terry 模型来推断论点的潜在强度得分，并据此生成论点排名。实验表明，尽管 LLMs 与人类专家判断的相关性仅为中等（如 Llama-70B 达到 Cohen’s κ = 0.493），但其预测结果具有高度稳定性（<7.75% 的标签差异），并通过多数投票和少样本提示进一步提升可靠性，显示出对论证质量多维理解的部分一致性与互补性。

链接: https://arxiv.org/abs/2605.28313
作者: Nicolás Benjamín Ocampo,Agnes Paullate Nyiranziza,Davide Ceolin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. We investigate the extent to which LLMs can effectively perform this task. We tested 12 open-weight LLMs of different sizes and families under zero-shot, few-shot, and chain-of-thought to approximate expert pairwise comparisons of argument quality across three dimensions-logical, rhetorical, and dialectic-and used these comparisons in a Bradley-Terry model to infer latent strength scores and derive a ranking of arguments. Our insights show that LLMs have promising but moderate correlation with human expert judgments, with Llama-70B obtaining the strongest alignment, reaching moderate Cohen’s \kappa = 0.493 and moderate correlations with Bradley-Terry scores derived from these annotations (Kendall, Pearson, and Spearman: 0.327-0.477). Other LLMs exhibit weak, moderate, or high alignment with Llama-70B while achieving comparable results against human experts, suggesting partial but complementary understanding of underlying quality dimensions despite differences in model size and family. Moreover, LLM predictions are stable across trial runs, with fewer than 7.75% of cases yielding different labels. Remaining variability is handled via majority voting and few-shot prompting for large-size models.

[NLP-54] HELEA: Hard-Negative Benchmark and LLM -based Reranking for Robust Entity Alignment

【速读】：该论文旨在解决知识图谱（Knowledge Graph, KG）融合中实体对齐（Entity Alignment, EA）任务面临的评估偏差问题，即现有基准测试容易让模型依赖实体名称重叠而非深层关系结构进行判断，从而无法有效检验模型区分同名但不同实体的能力。解决方案的关键在于提出一种“同名难负样本增强策略”（same-name hard-negative augmentation strategy），通过挖掘知识图谱中名称冲突群体中的同名异义实体对，构建高质量控制的评估基准（DW-HN29K、DY-HN27K）和增强训练语料（DW-Train、DY-Train）。在此基础上，论文进一步提出HELEA框架，其核心是两阶段设计：第一阶段利用硬负样本增强训练数据，在1跳知识图谱上下文中训练实体编码器检索模块；第二阶段采用无需额外训练的大语言模型（LLM）重排序机制提升精度。实验表明，传统依赖名称的基线模型在新基准上性能骤降至接近随机水平，而HELEA在DW-HN29K上达到F1=0.967的同时，保持标准DW-15K上的Hit@1=0.993，验证了方法的有效性与鲁棒性。

链接: https://arxiv.org/abs/2605.28308
作者: Yoonjin Jang,Junwoo Kim,Youngjoong Ko
机构: SungKyunKwan University (成均馆大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures, 9 tables. Code and benchmarks available at this https URL

点击查看摘要

Abstract:Entity Alignment (EA) is essential for knowledge graph (KG) fusion, but existing benchmarks often allow models to exploit name overlap rather than relational structure. This makes it difficult to evaluate whether models can reject same-name entities that refer to different real-world objects. Our primary contribution is a same-name hard-negative augmentation strategy that simultaneously yields quality-controlled evaluation benchmarks (DW-HN29K, DY-HN27K) and augmented training corpora (DW-Train, DY-Train), by mining same-name but distinct entity pairs from KG name-collision groups. We further introduce HELEA, a two-stage framework integrating (i) entity encoder retrieval trained on hard-negative-augmented training corpora with 1-hop KG context, and (ii) LLM-based reranking without additional training. Experiments show that name-dependent baselines collapse to near-random performance on our hard-negative benchmarks, while HELEA achieves F1 0.967 on DW-HN29K while maintaining Hit@1 0.993 on standard DW-15K.

[NLP-55] Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

【速读】：该论文旨在解决混合专家（MoE）模型在非英语下游任务中适应困难的问题。现有微调方法将MoE模型视为整体学习器，忽略了预训练过程中形成的异构路由结构。研究发现，中间层构成一个语言通用的对齐区域，其中路由差异能显著预测各语言任务性能差距。解决方案的关键在于提出RA-MoE（Routing-Aligned MoE Fine-Tuning），这是一个三阶段框架：首先基于英文和目标语言的正确性将任务样本分为四类（cc/ci/ic/ii），其次识别中间层中与任务相关的专家，最后通过引入路由对齐损失，在ci类样本上引导目标语言的路由行为模仿英文任务专家激活模式。实验表明，RA-MoE在三个MoE模型、三项任务和六种目标语言上均优于标准SFT及Routing Steering、RISE等强基线，且ci类样本比例可作为对齐收益的可靠预测指标。

链接: https://arxiv.org/abs/2605.28306
作者: Guanzhi Deng,Kuan Wu,Haibo Wang,Shing Yin Wong,Sichun Luo,Linqi Song
机构: City University of Hong Kong (香港城市大学); Carnegie Mellon University (卡内基梅隆大学); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models have emerged as a dominant paradigm for efficient LLM scaling, yet adapting them to non-English downstream tasks remains challenging. Existing fine-tuning approaches treat MoE models as monolithic learners, ignoring the heterogeneous routing structure that develops during pretraining. We validate across multiple MoE models and downstream tasks that middle layers form a language-universal alignment zone where routing divergence strongly predicts per-language task performance gaps. Building on this observation, we propose RA-MoE (Routing-Aligned MoE Fine-Tuning), a three-stage framework that categorizes parallel task examples into a four-way taxonomy (cc/ci/ic/ii) based on correctness in English and the target language, identifies task-relevant experts in the middle layers, and augments standard SFT with a routing alignment loss that encourages target-language routing on ci-type examples to follow the English task-expert activation pattern. Experiments across three MoE models, three tasks, and six target languages demonstrate that RA-MoE consistently outperforms standard SFT and strong baselines including Routing Steering and RISE, with the ci proportion of a task-language pair serving as a reliable predictor of alignment benefit.

[NLP-56] Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在复杂推理过程中常出现显式的反思痕迹（如“wait”、“hmm”、“alternatively”等拟人化标记），但这些标记是否真正反映深层反思机制尚不明确，且存在因冗余重复标记导致的过度思考风险。解决方案的关键在于通过提示层（prompt-level）和token层（token-level）干预抑制这些拟人化标记，并系统评估其对四个基准测试及两种模型规模下任务性能的影响。研究发现，这些标记并非推理性能所必需，抑制它们反而可在较大采样预算下保持甚至提升性能，且模型仍能执行无标记的验证行为，表明此类标记更多是表层线索而非反思本质的可靠代理，从而为未来基于非显式标记模式的推理机制研究提供了方向。

链接: https://arxiv.org/abs/2605.28305
作者: Yahan Yu,Noa Nakanishi,Fei Cheng
机构: Kyoto University, Japan
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 12 figures

点击查看摘要

Abstract:Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such as wait, hmm, and alternatively. Although these markers are commonly used as visible indicators of reflection, their mechanisms remain unclear, which leaves the risk of overthinking associated with redundant and repetitive reflection markers. In this work, we revisit anthropomorphic reflection markers, examining their necessity for reasoning and role in the reflection. We suppress these markers through prompt-level and token-level interventions, and analyze their effects on task performance across four benchmarks and two model scales. Our results show that anthropomorphic markers are not uniformly necessary for reasoning performance: suppressing them can preserve or improve performance in several settings, especially under larger sampling budgets. Meanwhile, marker suppression does not necessarily remove reflection behavior, as models can still perform marker-free verification. These suggest that anthropomorphic markers tend to be surface cues rather than reliable proxies for reflection itself, and motivate future research on reasoning mechanisms beyond explicit marker patterns.

[NLP-57] Where Rollouts Begin: Low-Load High-Leverag e First-Token Diversification for RLVR

【速读】：该论文试图解决强化学习中可验证奖励（Reinforcement Learning with Verifiable Rewards, RLVR）方法在训练推理模型时面临的轨迹多样性不足问题，即现有方法难以充分探索不同的推理路径，从而限制了策略优化效果。解决方案的关键在于：识别并利用推理标记后首个token的位置作为提升多样性的重要切入点——该位置的分布具有高度集中但与正确性解耦的特点，通过均匀采样该位置的top-N候选词来扩展 rollout 群体覆盖的区域，同时不改变原有的正确性信号。作者提出 REFT（Rollout Exploration with First-Token Diversification），这是一种轻量级改进方案，在不修改RLVR其他组件的前提下，显著提升了不同规模模型（0.5B–7B）和难度级别下的Pass@1、Pass@8及Pass@64指标表现。

链接: https://arxiv.org/abs/2605.28295
作者: Soeun Kim,Albert No
机构: Yonsei University (延世大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy’s first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy’s own top- N candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.

[NLP-58] CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models

【速读】：该论文试图解决隐式思维链（Implicit Chain-of-Thought, CoT）方法在推理过程中缺乏与显式推理路径对齐以及无法根据示例复杂度自适应调整的问题。其解决方案的关键在于提出CIRF（Chain-of-thoughts Into Reusable Functional units）框架，该框架将推理过程建模为离散功能标记（functional tokens）的动态序列，每个功能标记对应显式CoT轨迹中语义一致的推理单元。通过微调模型以自回归方式生成这些功能标记及其可选结果，再输出最终答案，CIRF实现了潜在推理与功能单元序列的一致性，从而支持并行训练、显式推理对齐和自适应推理能力。实验表明，CIRF在数学、符号和常识推理基准上相较于最先进的隐式CoT方法提供了更优的准确率-延迟权衡，且分析验证了其构建的可解释功能标记具有稳定性能提升效果。

链接: https://arxiv.org/abs/2605.28292
作者: Yukyung Lee,Yumeng Shen,Jinhyeong Park,Hyein Yang,Jun-Hyung Park
机构: Boston University (波士顿大学); Hankuk University of Foreign Studies (韩国外国语大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typically lack alignment with explicit rationales and adaptivity to example complexity. In this work, we propose CIRF (\textit\underlineChain-of-thoughts \underlineInto \underlineReusable \underlineFunctional units), an implicit CoT framework that performs reasoning as a dynamic sequence of discrete functional tokens. CIRF assigns a functional token to each semantically coherent reasoning unit in explicit CoT traces. The model is then fine-tuned to autoregressively generate functional tokens and their optional results, followed by the final answer. This design aligns latent reasoning with a sequence of functional units, facilitating parallel training, explicit rationale alignment, and adaptive reasoning. Extensive experiments on mathematical, symbolic, and commonsense reasoning benchmarks show that CIRF provides a favorable accuracy-latency trade-off compared with state-of-the-art implicit CoT methods. Further analyses demonstrate that CIRF constructs distinct, interpretable functional tokens, leading to consistent performance improvements.

[NLP-59] PrunePath: Towards Highly Structured Sparse Language Models

【速读】：该论文试图解决的问题是：现代语言模型中前馈网络（Feed-forward Networks, FFNs）占据大部分参数和计算量，但现有剪枝方法难以将稀疏性有效转化为硬件友好的推理效率提升。解决方案的关键在于提出一种名为PrunePath的预算自适应结构化稀疏化框架，其核心创新包括：基于MoEfication机制，用softmax归一化的路由分布替代独立专家阈值筛选，并在累积质量阈值下激活重要专家，从而在token级别施加概率预算，实现推理时自适应专家数量和单一检查点下的稀疏性控制旋钮。此外，作者通过实现Triton内核优化KV缓存解码，将结构化稀疏性转化为实际内存节省和可测量的解码速度提升，验证了PrunePath在构建高稀疏、部署友好型大语言模型方面的优越性能。

链接: https://arxiv.org/abs/2605.28283
作者: Zhexuan Gu,Zixun Fu,Yancheng Yuan
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Feed-forward networks (FFNs) dominate the parameter count and computation of modern language models, yet existing pruning methods often struggle to convert sparsity into hardware-friendly inference efficiency gains. We introduce \textbfPrunePath, a budget-adaptive structured sparsification framework for FFN layers. Built on MoEfication, PrunePath replaces independent expert-wise thresholding with a softmax-normalized routing distribution and activates important experts under a cumulative-mass threshold. This formulation imposes a token-level probability budget, enabling adaptive expert counts and a direct inference-time sparsity knob from a single checkpoint. Across NLU, NLG, and instruction-tuning evaluations, PrunePath achieves a favorable sparsity–performance trade-off compared with existing static pruning and MoEfication-based methods. We further implement Triton kernels for KV-cache decoding to translate the resulting structured sparsity into practical memory savings and measurable decoding-speed improvements. These results demonstrate the superior performance of PrunePath for building highly sparse, deployment-friendly large language models.

[NLP-60] When Seekers Are Hard to Help: Evaluating Emotional Support Dialogue Systems in Worst-Case Interactions

【速读】：该论文旨在解决当前情感支持对话系统（ESDS）评估中存在的一大缺陷：现有评估多依赖于由大语言模型（LLM）模拟的“合作型”求助者，这些求助者行为理想化、情绪表达清晰且易于互动，导致评估结果过于乐观，无法真实反映系统在面对复杂困难情境时的表现。其解决方案的关键在于提出一种基于最坏情况交互（worst-case interactions）的新型评估框架，该框架包含一个基于LLM的困难求助者模拟器和四个面向挑战性场景的量化指标（深度情感理解、引导式探索、平衡情绪支持、真实且有依据的支持）。通过专家模拟实验与大规模系统测试，研究发现几乎所有现有ESDS在最坏情况下性能显著下降，而该框架不仅能更客观地揭示系统短板，还能生成高质量训练数据以提升小模型的鲁棒性。

链接: https://arxiv.org/abs/2605.28228
作者: Jiajie Yang,Yangchun Li,Guanyi Chen,Rui Fan,Xin Bai,Tingting He
机构: Central China Normal University (华中师范大学); Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning (湖北省人工智能与智能学习重点实验室); National Language Resources Monitoring and Research Center for Network Media (国家网络媒体语言资源监测与研究中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emotional Support Dialogue Systems (ESDSes) are increasingly evaluated and trained with LLM-simulated seekers. However, such simulated seekers often behave as cooperative, average-case users who disclose clearly, respond constructively, and accept support within a few turns. This can lead to overly optimistic evaluation and obscure whether ESDSes can handle difficult help-seeking interactions. In this work, we study ESDS evaluation under worst-case interactions, where seekers are hard to help due to low engagement, resistance, limited self-disclosure, emotional volatility, or rigid negative interpretations. We first conduct an expert simulation study with eight experienced counselling professionals, who simulate difficult seekers, interact with existing Chinese ESDSes, provide scale ratings, and participate in semi-structured interviews. Based on this study, we derive worst-case seeker behaviours and identify key limitations of current systems. We then propose a worst-case evaluation framework consisting of an LLM-based worst-case seeker simulator and four worst-case-oriented metrics: Deep Emotional Understanding, Guided Exploration, Balanced Emotional Support, and Authentic and Grounded Support. Evaluating 17 systems, we find that nearly all models suffer substantial performance drops under worst-case interactions. Large general-purpose LLMs are generally more robust than specialised ESDSes, but even the strongest models struggle to sustain engagement and improve seekers’ emotional states. Finally, we show that worst-case simulation can also generate useful training data, improving the robustness of smaller models.

[NLP-61] Why We Need Speech to Evaluate Speech Translation

【速读】：该论文试图解决的问题是：当前语音翻译模型虽然能够保留语音特有信息（如说话人性别、语调和强调），但现有的评估指标无法有效衡量这些语音特性。解决方案的关键在于开发专门针对语音特性的质量评估模型，具体包括：(1) 使用带有语音编码器的SpeechCOMET系列模型以更好地捕捉语音特征；(2) 利用先进的语音大模型（SpeechLLM）作为评判者进行评估。然而研究发现，尽管这些方法在标准质量评估上达到或超过基于文本的COMET指标，却仍无法稳定评估语音特异性现象，原因在于：当前编码器对语音特征的保留不充分、模型倾向于忽略语音源信号，以及训练数据中相关样本不足。论文强调，未来进展需依赖专门设计的语音特异性训练数据及真正依赖语音输入的建模方式。

链接: https://arxiv.org/abs/2605.28227
作者: Maike Züfle,Danni Liu,Vilém Zouhar,Jan Niehues
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speech translation models are increasingly capable of preserving speech-specific information (e.g., speaker gender, prosody, and emphasis), yet evaluation metrics remain blind to such phenomena. We meta-evaluate both text- and speech-based quality estimation metrics on two contrastive datasets targeting gender agreement and prosody, and find that both fall short, even when given direct access to the speech signal. We then train SpeechCOMET, a family of quality estimation models with speech encoders, and evaluate a state-of-the-art SpeechLLM as a judge. Both match or exceed text-based COMET on standard quality estimation, but neither consistently assesses speech-specific phenomena. We identify three causes: (1) speech-specific features are not reliably preserved in current encoders, (2) models tend to ignore the speech source signal, and (3) quality estimation training data contains too few relevant examples. We release all models and code, and argue that progress requires dedicated speech-specific training data and models that genuinely condition on speech.

[NLP-62] Supervised Semantic Differential for Cross-Cultural Concept Analysis: A Case Study of Human Affect

【速读】：该论文试图解决跨文化心理意义比较中因语言差异导致的语义维度组织方式不一致的问题，尤其是传统基于词级翻译的方法难以捕捉深层语义结构的局限性。解决方案的关键在于引入监督语义微分（Supervised Semantic Differential, SSD）的跨语言扩展方法，通过在对齐的多语言词嵌入空间中估计受监督的语义梯度，并利用置换检验和自举置信区间测试梯度的一致性与差异性；进一步通过围绕差异梯度聚类来解释残差差异，从而识别出跨语言共享的语义模式与具有文化特异性的结构差异。实验在波兰语、英语和法语的情感规范词典上验证了该方法的有效性，发现情感维度（效价、唤醒度、支配感）在不同语言间具有显著可恢复性，且存在系统性差异：效价较为共通，而唤醒度和支配感则体现出与身体威胁、审美刺激、内在情绪性、宏观权威及日常控制相关的文化差异，同时揭示了部分语料特异性伪影，强调了结果解释需谨慎。

链接: https://arxiv.org/abs/2605.28225
作者: Jan Sikora,Paweł Lenartowicz,Hubert Plisiecki
机构: University of Warsaw; Society for Open Science; Centre for Brain Research, Jagiellonian University; IDEAS Research Institute
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures, excluding the appendices. Code to reproduce our results is available at this https URL

点击查看摘要

Abstract:Cross-cultural comparison of psychological meaning requires methods that go beyond word-level translation and examine how semantic dimensions are organized across languages. We introduce a cross-lingual extension of the Supervised Semantic Differential (SSD), which estimates supervised semantic gradients in embedding space and compares them across aligned multilingual word embeddings. The method tests gradient alignment and difference using permutation procedures and bootstrap intervals, and interprets residual differences through clustering around the difference gradient. We demonstrate the approach on Polish, English, and French affective norm lexicons, modeling Valence, Arousal, and Dominance where available. Affective dimensions were significantly recoverable across languages and model settings. Cross-lingual comparisons showed broad alignment together with structured residual differences: Valence appeared mostly shared, whereas Arousal and Dominance produced more interpretable contrasts involving bodily threat, aesthetic stimulation, internal emotionality, macro-level authority, and everyday control. Several clusters also reflected corpus-specific artifacts, underscoring the need for cautious interpretation. Cross-lingual SSD offers an explainable framework for testing semantic alignment, identifying divergence, and generating hypotheses about cross-cultural differences in psychological meaning.

[NLP-63] IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following

【速读】：该论文试图解决的问题是：当前机器翻译模型在实际应用中不仅需要语义等效性，还需严格遵守多种约束条件（如JSON/HTML结构、术语表、上下文消歧和语域匹配），而传统评估指标（如BLEU和xCOMET）无法有效衡量这些约束的遵循程度，且现有指令跟随基准测试忽略了翻译任务的跨语言特性。解决方案的关键在于提出一个名为\bench的多语言翻译指令遵循基准，涵盖七种语言，包含4,506个单约束和2,838个多约束样本，覆盖六类约束维度和五种组合模式，并采用确定性检查器与基于评分规则的LLM判官相结合的乘法融合策略，以抵抗奖励黑客行为，从而更全面地评估模型在复杂约束下的表现。

链接: https://arxiv.org/abs/2605.28218
作者: Mingrui Sun,Mao Zheng,Zheng Li,Mingyang Song
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL)
备注: 11 pages, 6 figures, conference

点击查看摘要

Abstract:Modern translation workflows demand more than semantic equivalence. Users routinely require models to preserve JSON or HTML schemas, honor curated glossaries, disambiguate with provided context, and match prescribed registers, often several at once. Conventional metrics such as BLEU and xCOMET capture semantic fidelity but provide little signal on constraint adherence, while general instruction following benchmarks ignore the cross-lingual nature of translation. We introduce \bench, a benchmark for multilingual translation instruction following covering seven languages, with 4,506 single-constraint and 2,838 multi-constraint items spanning six constraint dimensions and five compositional patterns with instructions issued in all seven languages. Constraints are split into a gating subset verified by deterministic checkers and a continuous subset scored by a rubric-based LLM judge, combined under a multiplicative rule that resists reward hacking. Evaluating 15 models reveals systematic gaps that prior protocols miss: Instruction following scales with size more sharply than translation quality, glossary and structured-format constraints dominate the difficulty gradient, and general instruction following rankings correlate only weakly with translation behavior. Our benchmark are available at this https URL.

[NLP-64] When Helpful Context Leaks: Privacy Risks in Domain-Adapted ASR

【速读】：该论文试图解决的问题是：在专业场景中，语音大模型（SpeechLLMs）通过提示词（prompting）或微调（fine-tuning）进行领域定制时，存在一种被忽视的隐私泄露风险——即模型可能因识别特定领域术语而被诱导错误转录与实际发音相似但不同的词汇，从而泄露敏感信息。解决方案的关键在于系统性地评估两种定制机制下的隐私泄露率，并提出一种基于提示层的缓解策略；研究发现，在不使用上下文提示的前提下进行微调能实现准确率与泄露风险之间的最佳平衡。

链接: https://arxiv.org/abs/2605.28211
作者: Maike Züfle,Jan Niehues
机构: Karlsruhe Institute of Technology, Germany
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:SpeechLLMs are increasingly deployed in professional settings where domain customisation is standard practice: users supply context in prompts with sensitive information, fine-tune on proprietary recordings, or both. We identify and systematically investigate an overlooked privacy risk of such customisation: a model adapted to recognise domain-specific terminology can be nudged into transcribing a phonetically similar word from its context or training data, even when a different word is spoken, thereby leaking private information. To evaluate this risk, we construct a controlled dataset and measure leakage rates across two customisation mechanisms, prompting and fine-tuning. Both mechanisms cause measurable leakage, compounding when combined. We evaluate a prompt-level mitigation strategy and analyse the accuracy-leakage trade-off across customisation approaches, finding that fine-tuning without context prompts offers the best balance. We release our code and dataset publicly.

[NLP-65] Pruning and Distilling Mixture-of-Experts into Dense Language Models

【速读】：该论文试图解决的问题是：当前主流的混合专家（Mixture-of-Experts, MoE）语言模型虽然性能优越，但其所有专家参数必须常驻内存，限制了在内存受限环境下的部署可行性；现有压缩方法仅减少专家数量，仍保留MoE结构的根本瓶颈。解决方案的关键在于提出首个系统性框架，将训练好的MoE模型转换为标准全连接（dense）架构：通过评分（scoring）、选择（selection）和分组（grouping）机制对专家进行整合，将其串联成一个密集前馈网络（FFN），并利用知识蒸馏（knowledge distillation）从原始MoE教师模型中迁移知识以优化性能。实验表明，评分策略的选择影响最大，作者提出的多样性感知评分方法在多个基准模型上均优于已有方法，并且在参数量相当条件下，相比传统密集模型剪枝方案，在下游任务平均准确率上提升6.3个百分点，同时训练速度提高1.6倍。

链接: https://arxiv.org/abs/2605.28207
作者: Junhyuck Kim,Jihun Yun,Haechan Kim,Gyeongman Kim,Joonghyun Bae,Jaewoong Cho
机构: KRAFTON; KAIST
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

[NLP-66] he Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

【速读】：该论文试图解决当前嵌入模型（embedding model）评估中对鲁棒性（robustness）刻画过于静态和单一的问题。现有基准（如MTEB）仅报告一个综合得分，忽略了模型在面对不同类型的输入变化时表现出的差异化响应，从而掩盖了潜在的性能缺陷。解决方案的关键在于提出一个动态评估框架——Harder Text Embedding Benchmark (HTEB)，通过大语言模型（LLM）在推理阶段随机变换输入数据，从三个具有实际意义的维度（词汇/风格、文本长度、语言）动态测试模型鲁棒性。实验表明，不同模型在各维度上呈现特定且部分解耦的鲁棒性特征，模型规模扩大虽能提升整体表现但无法弥合原始与变换后评价之间的差距，且英语数据集比多语言数据集更易受HTEB变换影响，这验证了HTEB能够识别出部署相关维度上的真实优劣，推动嵌入模型评估向多维、动态方向演进。

链接: https://arxiv.org/abs/2605.28190
作者: Manuel Frank,Haithem Afli
机构: Munster Technological University (梅斯特技术大学)
类目: Computation and Language (cs.CL)
备注: 29 pages, 11 figures

点击查看摘要

Abstract:Embedding benchmarks like MTEB report a single score per model, implicitly treating robustness as a static, scalar property. We argue that embedding robustness is multidimensional, since models respond differently to different types of variation, and requires dynamic evaluation to expose failures hidden by static benchmarks. We introduce the Harder Text Embedding Benchmark (HTEB), a dynamic evaluation framework that challenges model robustness along three practically interpretable axes (Lexical/Stylistic, Length and Language) by stochastically transforming inputs at evaluation time with an LLM. Evaluating 16 open-weight embedding models on 32 datasets covering 42 languages under transformations validated by 4,800 human ratings on an English subsample, we find three patterns: (1) Models exhibit specific, partly decoupled robustness profiles across axes. (2) Across three model families, scale increases absolute scores but does not close the gap between original and transformed evaluations. Here, scaling tends to improve specifically the Language axis. (3) English datasets are more sensitive to HTEB transformations than multilingual datasets. This demonstrates that HTEB identifies strengths and weaknesses of models along deployment-relevant axes, challenging current embedding benchmarks and arguing for multidimensional, dynamic robustness evaluation.

[NLP-67] Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在高风险决策场景中因事实不变但表述方式不同（即语义框架差异）而导致决策不一致的问题。其关键解决方案是提出Valign方法，这是一种基于表示层（representation-level）的干预策略，通过锚定决策到一个稳定的价值先验（value prior），引导隐藏状态向模型内部价值一致的方向演化，并从隐藏状态中投影出受时间切片和叙事生动性敏感的方向，从而有效抑制由语义框架变化引发的决策翻转，显著提升LLMs在事实等价输入下的决策稳定性。

链接: https://arxiv.org/abs/2605.28188
作者: Seojin Hwang,Minju Kim,Junhyuk Choi,JeongHyun Park,Hwanhee Lee
机构: Chung-Ang University (中央大学)
类目: Computation and Language (cs.CL)
备注: 29 pages, 7 figures, 31 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making settings such as legal reasoning, where consistency under factually equivalent inputs is critical. However, we find that fact-preserved but differently framed inputs can significantly destabilize LLM decisions. To systematically investigate this problem, we introduce Fragile, a large-scale benchmark that isolates fact-preserving semantic framing across three controlled dimensions: value-tinted narration, temporal slice, and narrative vividness. Our experiments reveal a high susceptibility of LLMs to framing, with an average decision flip rate of 28.6%. We find that simple prior prompt-level and activation-level interventions not only fail to suppress framing sensitivity but actively amplify it. We therefore propose Valign, a representation-level method that explicitly targets these framing dimensions by anchoring decisions to a stable value prior, steering hidden states toward the model’s value-consistent direction, and projecting out temporal-vividness-sensitive directions from the model’s hidden states. Valign consistently reduces framing-induced decision flips, demonstrating that robust mitigation requires directly targeting the internal pathways in which framing operates.

[NLP-68] BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在德国法律领域中进行基于包含关系（subsumption-based）的法律推理能力评估问题。现有研究缺乏针对德语法律文本的专业基准数据集，且对LLM生成答案的质量评估方法不够系统化。解决方案的关键在于构建BenGER（Benchmark for German Law）数据集，该数据集包含596个模拟考试的自由文本法律案例任务和531个短篇教义性推理任务，并引入一种与评分标准对齐的“LLM作为裁判员”（LLM-as-a-Judge）框架，通过多盲审人类评分协议（三名盲审者加一名作者知情评审者）进行交叉验证。实验表明，使用LLM作为裁判时，与人类评审群体的一致性仅略低于移除一名盲审者的情况（Calderon相关系数r=0.96），且封闭式旗舰模型在所有语料上表现最优，同时人机协作（human–AI co-creation）显著优于纯人类独立作答。

链接: https://arxiv.org/abs/2605.28183
作者: Sebastian Nagl,Ann-Kristin Mayrhofer,Martin Heidebach,Aleyna Koçak,Anne Zettelmeier,Elly Breu,Angelina Greiner,Sofija Milijas,Matthias Grabmair
机构: Technical University of Munich (TUM); Ludwig Maximilian University of Munich (LMU); University of Konstanz; University of Saarbrücken
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Pre-Print

点击查看摘要

Abstract:We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The BenGER dataset consists of three components: 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks. We evaluate 12 contemporary LLM systems – closed flagship, efficiency-oriented, and open-weight – across automatic and judge-based metrics. On a controlled validation subset of timed human-written solutions under both unaided and human–AI co-creation conditions, we contextualise model performance against these human baselines. We introduce a rubric-aligned LLM-as-a-Judge framework cross-validated against a multi-rater human-grading protocol (three blind reviews plus one author-informed creator review per solution). Our results show that replacing a blind human reviewer with the LLM judge degrades agreement with the full human pool no more than removing that reviewer altogether (Calderon r=0.96 vs.~r=0.96, matched n=30), that closed-flagship systems lead the leaderboard across all corpora, and that human–AI co-creation substantially outperforms unaided human work.

[NLP-69] When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models

【速读】：该论文试图解决的问题是：在完全非自回归（fully non-autoregressive, fully non-AR）文本生成中，基于模型置信度的位置选择策略容易因误判高置信度位置而导致生成不完整或过早解码锚点邻近token的问题。解决方案的关键在于提出一种名为“后缀锚定置信度调制”（Suffix-Anchored Confidence Modulation）的训练-free方法，其核心机制包括两个方面：一是插入简短的后缀锚点以促进响应完整性；二是根据解码进度动态调节锚点附近的置信度，从而抑制锚点邻近token的过早解码。该方法在保持完全非自回归生成并行性优势的同时，显著提升了置信度驱动的解码质量，在纯文本推理、视觉语言推理和代码生成等多个基准上均优于现有方法。

链接: https://arxiv.org/abs/2605.28181
作者: Jungwon Park,Jimyeong Kim,Jungmin Ko,Nojun Kwak,Wonjong Rhee
机构: Seoul National University (首尔国立大学); Daegu Gyeongbuk Institute of Science and Technology (大邱庆北科学技术院)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Diffusion language models decode text by iteratively denoising masked token sequences, making the choice of which positions to decode a central inference-time decision. Most training-free decoding strategies use model confidence for position selection, assuming that high-confidence positions are ready to be decoded. In this work, we revisit this assumption by studying when confidence misleads fully non-autoregressive (fully non-AR) decoding. EOT tokens can receive high confidence and cause incomplete generation; inserting a suffix anchor can mitigate this issue but introduces local overconfidence near the anchor, causing anchor-adjacent tokens to be decoded too early. To address these issues, we propose Suffix-Anchored Confidence Modulation, a simple training-free method that inserts a short suffix anchor to encourage response completion and modulates confidence near the anchor according to decoding progress. This preserves the response-completion benefit of suffix anchoring while reducing premature decoding of anchor-adjacent tokens. Across text-only reasoning, vision-language reasoning, and code-generation benchmarks, our method consistently improves confidence-based fully non-AR decoding, outperforms explicit EOT suppression, and preserves the parallel decoding advantage of fully non-AR generation.

[NLP-70] SuperValid: Capability-Aligned OOD Validation for Generalizable Downstream Scaling

【速读】：该论文试图解决现有大语言模型（Large Language Model, LLM）训练中规模定律（Scaling Laws）在预测下游任务性能时的泛化能力不足问题。具体而言，以往方法要么仅关注基准测试层面的性能，引入了场景特异性噪声；要么依赖独立同分布（IID）验证损失，无法有效追踪训练分布变化时模型能力的实际提升。解决方案的关键在于提出一种基于“能力层级”（capability level）的分析框架——SuperValid，其通过蒸馏能力域内多个基准的核心概念并生成多样化的、知识丰富的分布外（OOD）验证数据，从而构建与下游性能强相关且稳定的验证损失指标。实验表明，SuperValid损失在不同架构、规模和训练数据分布的模型上均展现出稳健的相关性，且无需额外基准评估即可在训练过程中实时计算，支持模型选择、早停和规模决策等关键训练策略。

链接: https://arxiv.org/abs/2605.28179
作者: Quanen Sun,Changxin Tian,Ke Shi,Cai Chen,Cunyin Peng,Jia Liu,Kunlong Chen,Zhiqiang Zhang
机构: Ant Group
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling laws guide large language model training by relating compute to cross-entropy loss, and recent work further extends them to predict downstream benchmark performance. However, prior approaches face generalization limitations from two aspects: focusing on benchmark-level performance introduces scenario-specific artifacts, while relying on IID validation loss fails to track capability improvements when training distributions vary. In this work, we argue that downstream scaling should be studied at the capability level, which captures shared skill factors across related tasks while abstracting away benchmark-specific noise. We propose SuperValid, a framework that synthesizes OOD (out-of-distribution), capability-aligned validation data by distilling core concepts from benchmarks within a capability domain and expanding them into diverse, knowledge-rich texts. Extensive experiments spanning 17 benchmarks grouped into 6 capability domains show that SuperValid loss exhibits strong and stable correlation with downstream performance across models of different architectures, scales, and training data distributions. As a training-free metric computable during training without benchmark evaluation, SuperValid enables effective model selection, early stopping, and scaling decisions.

[NLP-71] DEPART: DEcomposing PARiTy across Multilingual LLM s

【速读】：该论文试图解决多语言大语言模型（Multilingual Large Language Models, mLLMs）评估中存在系统性偏差但缺乏解释的问题，即现有排行榜仅报告各语言的准确率，却未揭示性能差异的根本原因，导致从业者无法采取有效措施改进。其解决方案的关键在于提出一个两步贝叶斯分层框架，将多语言性能方差分解为可解释的组成部分：首先，通过分析语言特征（如文字系统、语系、类型学距离）和模型内部对英语的表征相似性，发现这些因素能解释79%的理解任务和92%的推理任务中的语言差异；其次，进一步分解“模型×基准测试×语言”三维数据立方体，揭示自然语言理解（NLU）与推理任务具有根本不同的方差结构——前者由模型身份主导（占66.7%），后者则由基准测试与模型交互效应主导（占46.3%）。这一方法将多语言评估从被动的性能映射转变为可诊断、可干预的框架，为识别并缓解语言间不公平提供了明确路径。

链接: https://arxiv.org/abs/2605.28163
作者: Manan Uppadhyay,Prashant Kodali,Pranjal Chitale,Reshma Ramaprasad,Himanshu Beniwal,Sunayana Sitaram
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than artifacts of sampling noise via distribution-free Friedman and Kruskal–Wallis tests, then introduce a two-step Bayesian hierarchical framework that decomposes multilingual performance variance into interpretable components. First, isolating the variance attributable to language identity, we show that observable language features (script, family, typological distance) explain R^2_\textling = 79% of this variance on understanding tasks and 92% on reasoning, with a model’s internal representational similarity to English emerging as the dominant predictor across both task buckets. Second, decomposing the full (model \times benchmark \times language) cube, we find that NLU and reasoning have fundamentally divergent variance profiles: model identity dominates understanding ( 66.7% of variance), whereas the benchmark \times model interaction dominates reasoning ( 46.3% ). Together these results recast multilingual evaluation from passive performance mapping into an explainable, diagnostic framework with concrete levers for targeting the root drivers of language disparity.

[NLP-72] Self-Consistency via Marginal Sharpening

【速读】：该论文试图解决的问题是：当前基于推理时采样（inference-time sampling）的方法在激发语言模型的推理能力时，存在目标不当的问题——现有幂次采样（power-sampling）方法通过锐化完整生成输出的分布来提升性能，但这种策略将推理路径与最终答案混杂在一起，无法有效捕捉答案是否由多个合理推理路径支持。解决方案的关键在于：将优化目标从完整的输出分布转向“答案边缘分布”（answer marginal），即聚焦于答案本身的合理性，而非整个生成序列的概率。作者提出一种仅依赖自回归并行采样的高效近似算法，直接逼近锐化后的答案边缘分布，从而在数学和编码基准上显著优于标准幂次采样，同时速度提升数个数量级。这一方法使“自一致性”（self-consistency）成为推理时可优化的目标，而非事后投票机制。

链接: https://arxiv.org/abs/2605.28142
作者: Aleksei Arzhantsev,Otmane Sakhi,Nicolas Chopin
机构: Criteo AI Lab; ENSAE, Institut Polytechnique de Paris
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Inference-time sampling can elicit strong reasoning abilities from language models without additional training. Existing power-sampling methods do so by sharpening the distribution over full generated outputs, favoring completions that are individually likely under the model. We argue that this is the wrong object to target for reasoning: a completion entangles a reasoning trace with a final answer, whereas what matters is whether an answer is supported by many plausible reasoning paths. We therefore shift the target from the full-output distribution to the sharpened answer marginal, making self-consistency an inference-time objective rather than a post-hoc voting criterion. Surprisingly, this marginal target admits an efficient approximation: we propose a simple, purely autoregressive parallel sampling algorithm that approximately samples from the sharpened answer marginal, eliciting stronger performance than standard power sampling on mathematics and coding benchmarks while being orders of magnitude faster.

[NLP-73] Better heads do not guarantee better binarized constituency parsing

【速读】：该论文试图解决的问题是：在句法成分解析（constituency parsing）中，使用依赖关系诱导的主导性（headedness）作为二叉化控制信号是否能够提升解析性能，尤其是在对标点符号敏感的评估指标下。解决方案的关键在于比较基于学习的主导性（learned headedness）与基于规则的主导性（rule-based headedness）在二叉化过程中的表现差异，结果表明尽管学习到的主导性在内在头预测任务上优于规则方法，但在经过去二叉化（debinarization）后并未带来稳定的解析性能提升，尤其在宏观平均的标点敏感F₁指标上表现更差，这说明语言学上合理的主导性并不一定是最优的解析器控制信号。

链接: https://arxiv.org/abs/2605.28131
作者: Zeyao Qi,Yige Chen,Eitan Klinger,Vivaan Wadhwa,Jungyeul Park
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We revisit punctuation-aware tree binarization for constituency parsing and ask whether dependency-induced headedness improves binary parser supervision. Although learned heads substantially outperform rule-based heads in intrinsic head prediction, they do not yield consistent parsing gains after debinarization. In particular, punctuation-conditioned evaluation shows that learned headedness underperforms rule-based binarization in macro-average punctuation-sensitive F_1 , despite a small overall gain on CTB. Similar instability appears under cross-treebank transfer. These results suggest that \ycclinguistically grounded headedness is not necessarily parser-optimal when used as a binarization control signal. The paper presents a negative result: better head prediction does not imply better punctuation-sensitive constituency parsing.

[NLP-74] Chinese Word Boundary Recovery through Character Alignment Projection

【速读】：该论文旨在解决非标准中文文本中分词（Chinese word segmentation）的脆弱性问题，尤其是在语言学习者输入中因字符级偏差导致词边界断裂时，传统分词方法难以准确识别词边界。其解决方案的关键在于将词边界恢复建模为基于对齐的投影任务：首先在字符层面将噪声源句与更清晰的目标句对齐，然后将目标侧的正确词边界投影回源句，从而修正源端的过切分错误。该方法不仅提升了分词鲁棒性，还通过引入两个评估资源（基于MuCGEC的手动校验学习者语料和源自中文Penn Treebank的合成基准）验证了词边界恢复与普通分词任务的本质差异，并证明了对齐投影是一种在噪声输入下稳定中文标注与评估的合理机制。

链接: https://arxiv.org/abs/2605.28128
作者: Lusha Wang,Yuchen Li,Su Yuan,Jungyeul Park
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chinese word segmentation is especially fragile in non-standard text, where language learner errors and other character-level divergences disrupt the word boundaries assumed by downstream annotation and evaluation. This paper formulates Chinese word boundary recovery as an alignment-based projection task. Given a noisy source sentence and a cleaner target counterpart, we first align the two strings at the character level and then project target-side word boundaries back onto the source. Beyond the recovery method itself, we introduce two evaluation resources: a manually checked learner Chinese benchmark based on MuCGEC and a controlled synthetic benchmark derived from the Chinese Penn Treebank. Experiments show that direct segmentation remains vulnerable to compound fragmentation in learner input, whereas the proposed two step projection method corrects many over-segmentation errors by using the corrected target to recover source-side word spans. The results show that word boundary recovery is distinct from ordinary segmentation and that alignment projection provides a principled mechanism for stabilizing Chinese annotation and evaluation under noisy input.

[NLP-75] Risk-aware Selective Prompting for Hallucination Mitigation in Large Vision-Language Models ACL EMNLP

【速读】：该论文试图解决大视觉语言模型（LVLM）中幻觉问题，特别是针对提示验证（prompt-based verification）机制在实际应用中的效果不明确、甚至可能引入新错误的问题。其解决方案的关键在于揭示了验证提示的“风险依赖性”行为：即验证提示在困难输入上能有效纠正错误，但在简单输入上不仅收益有限，还可能引发新的错误，这源于其诱导的保守输出倾向——通过将注意力从视觉标记转移到指令标记，并形成特定中间层熵模式，表明并非统一增强视觉对齐，而是指令条件下的注意力重分配。基于此发现，作者提出无需训练的“风险感知选择性提示”（RSP）方法，利用预生成阶段的不确定性信号动态触发验证提示，从而避免始终启用验证带来的性能下降，同时保持基线性能，且不同模型架构的最佳选择信号存在差异。

链接: https://arxiv.org/abs/2605.28123
作者: Yuang Huang,Yafeng Zhang,Yu Zilan
机构: Shanghai Jiao Tong University (上海交通大学); iFLYTEK (科大讯飞); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: 7 pages, 1 figures, submitted to ACL ARR 2026 May (EMNLP)

点击查看摘要

Abstract:Prompt-based verification is widely used to mitigate hallucinations in large vision-language models (LVLMs), yet when it helps remains poorly understood. We systematically study verification prompting across two representative LVLM architectures and hallucination benchmarks, and find that it is a risk-bearing intervention: its corrections increase with input difficulty, while newly introduced errors persist across difficulty levels. As a result, always-on prompting helps on hard inputs but offers little benefit – and can harm – easier ones. Our analysis further shows that this behavior is associated with a conservative output shift. Verification prompts redistribute attention from visual tokens toward instruction tokens and induce a distinct middle-layer entropy pattern absent in a neutral-prompt control, suggesting instruction-conditioned attention redistribution rather than uniformly improved visual grounding. Motivated by this input-dependent risk, we propose Risk-aware Selective Prompting (RSP), a training-free approach that uses pre-generation uncertainty signals to trigger verification selectively. RSP mitigates the degradation of always-on prompting while preserving baseline performance, and reveals that effective selection signals vary across architectures.

[NLP-76] SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding Agents

【速读】：该论文试图解决的问题是：当前编码代理（coding agent）在执行良性任务时可能出现“过度积极行为”（overeager behavior），即在任务顺利完成的前提下，某些操作（如文件删除或凭据泄露）悄然超出了授权范围，而现有基准测试方法无法有效检测此类行为。现有评估体系存在三大缺陷：任务完成类基准对所有成功运行给予奖励、对抗性攻击类基准仅关注恶意提示、且此前唯一针对过度积极行为的基准使用固定提示集，导致部分模型-代理组合被低估。解决方案的关键在于提出SNARE（Synthesizing Non-adversarial scenarios for Adaptive Reward-guided Elicitation）管道：通过可复用的“作用域片段”和“陷阱片段”合成良性场景，利用无需人工判断的判别标志（oracle）识别异常行为（如未授权文件增删），并采用Thompson采样动态分配每对模型-代理的运行预算以最大化触发过度行为的概率。实验表明，在24种过度积极原型基础上构建的OverEager基准中，10,000次良性运行中有19.51%触发过度行为，且不同模型-代理组合间差异高达11.9倍，其中代理框架的影响远大于模型本身（分别贡献56%和21%），说明单一框架或模型评估会系统性低估整体风险。

链接: https://arxiv.org/abs/2605.28122
作者: Yubin Qu,Yi Liu,Gelei Deng,Yanjun Zhang,Yuekang Li,Ying Zhang,Leo Yu Zhang
机构: Griffith University; Quantstamp; Nanyang Technological University; UNSW Sydney; Wake Forest University
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A coding agent executes a benign task as a sequence of shell, file, and network actions, any of which can quietly exceed the authorized scope while the task still completes. We call this overeager behavior: the prompt is not adversarial and the run succeeds, yet an out-of-scope step can leak credentials or delete files. Existing benchmarks miss it: task-completion suites credit any finished run, jailbreak suites probe adversarial prompts, and the one prior overeager benchmark applies a single fixed prompt set to every agent-model pair, leaving its easiest and most resistant pairs under-measured. We present SNARE (Synthesizing Non-adversarial scenarios for Adaptive Reward-guided Elicitation), a pipeline that composes benign scenarios from reusable scope and trap fragments, scores each run with a judge-free oracle flagging trap-pattern matches and unsolicited file additions or deletions, and uses Thompson sampling to steer each pair’s run budget toward the scenarios that most often trigger it. Instantiating it over 24 overeager archetypes yields OverEager, which we run across a 4x5 matrix of four coding agents and five base models. Across 10,000 benign runs, 19.51% trigger overeager behavior, with per-pair rates spanning 11.9x. This variation is driven by the agent framework, not the model: the framework accounts for 56% of it against the model’s 21%, so any single-framework or single-model evaluation undercounts the matrix by about a fifth.

[NLP-77] MIRAG E: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content

【速读】：该论文试图解决的问题是：基于视觉语言模型（VLM）的移动端图形用户界面（GUI）代理在面对对抗性提示注入攻击时缺乏鲁棒性，尤其是当攻击者将恶意文本嵌入到用户生成内容区域（如聊天消息、评论等）中时，代理无法区分可信界面元素与恶意内容，从而导致误操作。解决方案的关键在于提出MIRAGE（Mobile Injection of Realistic Adversarial GUI Examples）——一个无需修改代理、应用或操作系统即可生成逼真对抗样本的三阶段流水线：首先通过局部定位器（Localizer）识别用户可控区域；接着由生成器（Generator）合成符合上下文且风格一致的恶意载荷并渲染为原生样式；最后由策展模块（Curator）确保样本的真实性和分布平衡。其核心创新在于将“攻击覆盖范围”、“真实性”和“分布均衡”三个控制维度分离，从而在保持视觉不可区分性的前提下实现高成功率（23%-30%），同时实验证明单纯依赖视觉质量过滤无法有效防御此类攻击。

链接: https://arxiv.org/abs/2605.28116
作者: Ruoqi Guo,Yi Liu,Gelei Deng,Yiheng Xiong,Yuekang Li,Ying Zhang,Leo Yu Zhang,Lida Zhao,Ji Jie,Yuxiao Lu
机构: Griffith University; Quantstamp; Nanyang Technological University; Singapore Management University; University of New South Wales; Wake Forest University; Independent Researcher
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose actions from what they see, so they cannot reliably separate trusted interface elements from user-generated content. We present MIRAGE (Mobile Injection of Realistic Adversarial GUI Examples), a pipeline that turns benign mobile screenshots into prompt-injection samples by placing attacker-controlled text into ordinary user-generated content regions, without modifying the agent, the application, or the operating system. MIRAGE operates in three stages: a Localizer identifies user-controllable regions on the screenshot, a Generator synthesises context-aware payloads and renders them in the application’s native style, and a Curator moderates realism and balances the samples across applications, region types, and attack intents. A key challenge is that an injected screenshot must stay visually indistinguishable from genuine user content while still diverting the agent; we address this by separating the stages that control reach, realism, and distributional balance. On a 1,111-sample benchmark spanning ten applications and eleven attack intents, all five evaluated VLM agents are vulnerable, with attack success rates of 23%-30%, and MIRAGE scores higher on human realism ratings than the strongest prior attack (3.02 versus 2.52 out of 5). We further find that per-sample realism and attack success are uncorrelated, so visual-quality filtering alone cannot reliably defend against this threat.

[NLP-78] Ask Now Use Later: Benchmarking the Proactivity Gap in Long-Lived LLM Agents

【速读】：该论文试图解决长期运行的大语言模型（LLM）代理在跨会话中因缺乏主动性而无法获取用户隐性偏好导致的“主动缺口”问题，即代理无法基于未被明确表达的偏好采取行动。解决方案的关键在于提出一个名为Ask-to-Remember（ATR）的新任务范式，它要求代理在当前任务不涉及的情况下，判断是否应主动询问未来可复用的用户偏好，并通过构建首个ATRBench基准测试来量化这一能力——该基准将用户偏好设为隐藏的真实值，从而迫使代理真正“提问”而非依赖记忆。实验表明，当前八种前沿LLM代理在ATR任务上表现远低于理想情况（至少低62分），且提示工程难以弥补差距，诊断结果进一步指出偏好获取是瓶颈所在，这为未来提升代理的长期主动性和个性化服务能力提供了关键评估工具与改进方向。

链接: https://arxiv.org/abs/2605.28108
作者: Bin Wu,Guanyun Zou,Bingbing Wang,Huan Zhao,Chuan Shi
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学（深圳）); Noumena AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A long-lived LLM agent, such as OpenClaw, earns its value by acting on a user’s preferences and constraints across sessions, not just the current request. Yet today’s agents keep what a user volunteers but rarely ask for what stays unspoken, leaving a proactivity gap in long-lived LLM agents: an agent cannot act on a preference it never obtained. As users delegate more of their affairs to agents, the impact of this gap grows. We isolate one concrete, controllable slice of this gap as Ask-to-Remember (ATR): the agent decides whether to ask now for a reusable user preference that the current task does not need but a later session with the same user will. ATR is hard even to evaluate: the right question is underdetermined and its payoff deferred to tasks that may never arise. ATRBench, to the best of our knowledge the first ATR benchmark, makes it measurable by fixing each user’s preferences as hidden ground truth, so success demands asking, not recall. Across eight frontier LLM agents, defaults fall at least 62 points below an oracle handed the relevant preference, and prompting closes little of it. Diagnostics identify acquisition as the bottleneck. ATRBench surfaces this proactivity gap in current agents and offers a diagnostic testbed for closing it.

[NLP-79] ConRAG : Consensus-Driven Multi-View Retrieval for Multi-Hop Question Answering

【速读】：该论文旨在解决当前多跳问答（multi-hop question answering, QA）任务中检索增强生成（Retrieval-augmented generation, RAG）方法性能不足的问题，尤其是现有方法在复杂推理场景下难以有效整合跨文档证据的局限性。其解决方案的关键在于提出一种共识驱动的多视角RAG框架（ConRAG），通过系统性优化查询端（query-side）和语料端（corpus-side）的表示，并融合关系、实体和文本三种多视图证据信号，提升检索准确性与推理能力。实验表明，ConRAG在三个多跳QA基准上显著优于所有基线方法，例如在Vanilla RAG基础上平均性能提升达+26.9%，并使Gemma-4-31B模型在挑战性MuSiQue基准上达到新的SOTA水平。

链接: https://arxiv.org/abs/2605.28093
作者: Yikai Zhu,Kunfeng Chen,Qihuang Zhong,Juhua Liu,Bo Du
机构: Wuhan University (武汉大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a promising paradigm for enhancing large language models (LLMs) on multi-hop question answering (QA), which requires reasoning over evidence from multiple documents. Current multi-hop RAG methods generally focus on either query-side task decomposition or corpus-side knowledge graph construction. Despite their progress, these methods still struggle to achieve satisfactory performance on complex multi-hop QA tasks. To this end, we propose ConRAG, a consensus-driven multi-view RAG framework that effectively boosts LLMs on complex multi-hop QA. The core of ConRAG is to systematically optimize both the query and corpus sides and to leverage multi-view evidence (relation, entity, and text signals) for more accurate retrieval. Extensive experiments on three multi-hop QA benchmarks show that ConRAG consistently outperforms all baselines by a clear margin, e.g., up to +26.9% average performance gains over vanilla RAG, and enables Gemma-4-31B to achieve a new state-of-the-art record on the challenging MuSiQue benchmark.

[NLP-80] SMILE-Next: Teaching Large Language Models to Detect Classify and Reason about Laughter

【速读】：该论文旨在解决现实场景中对笑声的复杂社会信号理解不足的问题，现有研究多局限于孤立的笑声分析任务，缺乏对多模态语境下笑声意图的系统性建模。解决方案的关键在于提出两个核心组件：一是针对笑声特性的Self-Instruct方法，通过自动合成多样化的笑声导向指令提升模型在不同任务与领域间的泛化能力；二是基于专家路由机制的“笑声专家混合”（Mixture-of-Laugh-Experts, MoLE）框架，该框架能动态选择适配特定笑声任务的专用专家模块，从而显著提升任务性能与计算效率。实验表明，二者结合显著优于现有的多模态大语言模型基线，推动了真实场景下笑声理解的鲁棒性发展。

链接: https://arxiv.org/abs/2605.28084
作者: Lee Jung-Mok,Kim Sung-Bin,Joohyun Chang,Lee Hyun,Tae-Hyun Oh
机构: KAIST (韩国科学技术院); POSTECH (浦项科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter analysis tasks, a comprehensive understanding of laughter in real-world scenarios remains underexplored. Therefore, we introduce SMILE-Next, a dataset for real-world laughter understanding with multimodal textual representations and question-answer annotations across three tasks: laughter detection, laughter type classification, and laughter reasoning. Building upon SMILE-Next, we aim to develop a laughter-specialized large language model capable of nuanced understanding of laughter in real-world contexts. To this end, we propose two key components: laughter-specific Self-Instruct and the Mixture-of-Laugh-Experts (MoLE) framework. Laughter-specific Self-Instruct enhances generalization across tasks and domains by automatically synthesizing diverse laughter-centric instructions. MoLE introduces a task-adaptive expert routing mechanism that dynamically selects specialized experts tailored to each laughter-related task, improving task-specific performance and efficiency. Experimental results show that the combination of our proposed components substantially outperforms multimodal LLM baselines, advancing robust real-world laughter understanding. Project page is at: this https URL.

[NLP-81] ATLAS: All-round Testing of Long-context Abilities across Scales

【速读】：该论文试图解决当前长上下文语言模型评估中存在的两个关键问题：一是模型性能随上下文长度增加而急剧下降的“性能坍塌”现象未被充分揭示；二是检索能力强大并不等同于在下游任务中表现优异，即强检索能力无法有效转化为实际应用效果。解决方案的关键在于提出ATLAS框架，其核心创新包括三项方法论原则：(i) 建立分层分类法，将基础操作与应用工作负载分离以定位失败来源；(ii) 引入长度感知的AUC评分机制，通过固定8K–1M长度网格对得分-长度曲线积分，取代单一指标，实现性能退化全貌刻画；(iii) 提出ATLAScore，一种基于分类维度的调和平均聚合指标，能惩罚不平衡的性能分布，并通过非线性聚合过程传递端到端不确定性。该框架在8个能力维度、9个可审计组件和6,438个实例上验证了26个模型，结果显示不同长度下的排名显著变化，表明应按能力和长度分别报告长上下文质量，而非依赖单一总分。

链接: https://arxiv.org/abs/2605.28079
作者: Deli Huang,Cunguang Wang,Hongyin Tang,Zhe Tang,Linsen Guo,Dongyu Ru,Ruoshi Yuan,Ziyue Zhu,Xiaoyu Li,Ziwen Wang,Chen Zhang,Anchun Gui,Wen Zan,Jiaqi Zhang,Xuezhi Cao,Jingang Wang,Xunliang Cai,Yixin Cao
机构: Meituan(美团); Fudan University(复旦大学)
类目: Computation and Language (cs.CL)
备注: 29 pages, 13 figures. Preprint

点击查看摘要

Abstract:Long-context language models now advertise context windows up to millions of tokens, yet evaluations typically report a single length or a narrow task family, masking two failure modes: performance can collapse as length grows, and strong retrieval need not transfer to downstream use. We present ATLAS, a benchmarking framework that redefines long-context evaluation as length-dependent capability profiling. ATLAS contributes three methodological principles:(i) a layered taxonomy separating foundational operations from application workloads so failures can be attributed, (ii) length-aware AUC scoring that integrates score-length curves over a fixed 8K-1M grid, replacing single-point metrics with full degradation profiles, and (iii) ATLAScore, a harmonic-mean aggregate over taxonomy categories that penalizes imbalanced profiles, with end-to-end uncertainty propagation from subset scores through the nonlinear final aggregate. We instantiate the framework across eight capability dimensions with nine auditable components and 6,438 instances, and evaluate 26 models. Gemini-3.1-Pro-Preview leads at 128K, Claude-Opus-4.6 leads at 1M. Rankings reshuffle substantially between ATLASscore@8K-128K and ATLASscore@8K-1M: 7 models move by at least two ranks, and the two taxonomy layers share only 61% of cross-model variance, with individual rank gaps up to 12 positions. These results support reporting long-context quality by capability and length, not by a single headline score.

[NLP-82] StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

【速读】：该论文旨在解决个性化故事重写（story rewriting）中如何在保持情节一致性与叙事连贯性的前提下，有效适配读者偏好这一核心问题。传统风格迁移方法仅关注表面风格转换，而忽视了对上下文语境的深度理解与增强，导致用户满意度提升有限。其解决方案的关键在于提出一个全新的两阶段模型STORYLENSWRITER：首先通过监督微调（supervised fine-tuning）捕捉基础叙事结构，再利用基于GRPO（Generalized Reward Policy Optimization）的强化学习进行上下文感知的叙事丰富化（context-aware narrative enrichment），从而显著提升读者偏好契合度。该方案依托于作者构建的大规模基准STORYLENSBENCH和对应的评估框架，实验证明其在忠实性、连贯性和用户满意度等多个维度上均优于现有基线方法。

链接: https://arxiv.org/abs/2605.28073
作者: Hanwen Cui,Yuting Mei,Yuhang Fu,Dingyi Yang,Qin Jin
机构: Beijing University of Posts and Telecommunications; AIM3 Lab, Renmin University of China; Nanyang Technological University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures, 15 tables

点击查看摘要

Abstract:Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence. Unlike conventional work on style transfer, we argue that effective story rewriting demands context-aware narrative enrichment beyond surface-level stylistic adaptation. Our pilot human study shows that style adaptation alone provides only marginal gains in reader satisfaction (2.3%), while context-enhanced rewriting substantially improves user preference alignment (24.5%). Motivated by this, we introduce STORYLENSBENCH, a large-scale benchmark for preference-aligned story rewriting, comprising structured story books, multi-dimensional reader preference profiles, and ranked context-aware rewritten stories. Building on this benchmark, we propose STORYLENSEVAL, a reward model for estimating reader satisfaction over rewritten stories, and STORYLENSWRITER, a two-stage rewriting model combining supervised fine-tuning with GRPO-based reinforcement learning. We further establish a comprehensive evaluation framework covering fidelity, coherence, and reader satisfaction. Experimental results demonstrate that STORYLENSWRITER consistently outperforms strong generation and personalization baselines, highlighting the importance of context-aware narrative enrichment for personalized story rewriting.

[NLP-83] PromptEmbedder:: Efficient and Transferable Text Embedding via Dual-LLM Soft Prompting

【速读】：该论文旨在解决当前大型语言模型（Large Language Models, LLMs）在文本嵌入任务中，基于LoRA等适配方法存在的计算效率低下和跨架构迁移能力差的问题。现有方法在面对新骨干网络时需从头重新训练，成本高昂。其解决方案的关键在于提出PromptEmbedder——一种双LLM框架，通过将嵌入知识从特定骨干权重中解耦，利用一个提示生成LLM（Prompting LLM）以可微分方式生成针对任务的软提示（soft prompts），并作用于冻结的嵌入LLM（Embedding LLM），从而在对比训练过程中保持完整的梯度流动。该设计使任务相关知识集中于轻量级的提示生成模块，仅需重新训练一个线性对齐矩阵即可适配新架构，显著降低资源消耗：实验表明，在MTEB基准上性能媲美LoRA微调，同时减少40% GPU内存占用并提升3.7倍训练速度，实现了高效、架构无关的LLM表示学习范式。

链接: https://arxiv.org/abs/2605.28066
作者: Yu-Che Tsai,Kuan-Yu Chen,Yuan-Hao Chen,Yu-Han Chang,Ching-Yu Tsai,Yu-Hsiang Chuang,Shou-De Lin
机构: National Taiwan University; National Taiwan University AI Center of Research Excellence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable efficacy in text embedding, yet current adaptation methods like LoRA face significant bottlenecks in computational efficiency and cross-architecture transferability. Whenever a new backbone emerges, existing approaches require costly retraining from scratch. To address this, we propose PromptEmbedder, a novel dual-LLM framework that decouples embedding knowledge from specific backbone weights. PromptEmbedder utilizes a Prompting LLM to generate instruction-aware soft prompts for a frozen Embedding LLM via a differentiable generation process with continuous relaxation, ensuring full gradient flow during contrastive training. By localizing task-specific knowledge within the Prompting LLM, adapting to new architectures requires only retraining a lightweight linear alignment matrix. Evaluations on the MTEB benchmark show that PromptEmbedder achieves comparable performance with LoRA finetuning while reducing GPU memory by 40% and accelerating training by 3.7x. Our approach establishes a scalable, architecture-agnostic paradigm for efficient LLM-based representation learning.

[NLP-84] Challenges in Explaining Pretrained Clinical Text Classifiers ECML KDD2025 ALT

【速读】：该论文试图解决的问题是：在临床自然语言处理（Clinical NLP）中，如何有效解释神经网络模型对长篇、非结构化医疗文本的预测结果。现有后验解释方法（如LIME和SHAP）在应用于临床叙事时存在局限性，难以提供可靠且具有临床意义的解释。解决方案的关键在于识别并改进当前基于词元（token-level）和扰动（perturbation-based）的解释技术的不足，提出更具备临床意义、语义基础稳固且对语言噪声鲁棒的解释策略，以提升模型决策过程的透明性和可信度。

链接: https://arxiv.org/abs/2605.28060
作者: Kristian Miok,Matej Klemen,Blaz Škrlj,Marko Robnik Šikonja
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 7 figures. Accepted at the First Workshop on Responsible Healthcare using Machine Learning (RHCML 2025), co-located with ECML PKDD 2025

点击查看摘要

Abstract:Explaining the predictions of neural models in clinical NLP remains a significant challenge, especially for complex tasks involving long, unstructured medical texts. While post-hoc methods like LIME and SHAP are widely used, they often fall short when applied to clinical narratives. In this paper, we identify core limitations of token-level and perturbation-based explanation techniques through targeted demonstra- tions on a hospital length-of-stay prediction task. Our findings reveal issues such as overemphasis on non-informative tokens, instability in at- tributions, and high-confidence predictions for incoherent input variants. These results underscore the need for explanation strategies that are clin- ically meaningful, semantically grounded, and robust to linguistic noise.

[NLP-85] Prompting Is All You Need: Multi-view Prompting Large Language Models for Aspect-Based Sentiment Analysis

【速读】：该论文试图解决的问题是：在基于方面的情感分析（Aspect-Based Sentiment Analysis, ABSA）任务中，尽管少样本提示（few-shot prompting）方法能够显著减少对标注数据的需求并优于零样本基线，但仍存在与微调模型之间的性能差距，且大型语言模型（LLM）推理带来的高计算成本限制了其实际部署。解决方案的关键在于提出一种基于多视角提示（Multi-View Prompting, LLM-MvP）的方法，通过将多视角原理（即考虑不同元素顺序）引入LLM提示设计，并结合结构约束解码、无上下文语法（context-free grammar）以及前缀批处理（prefix batching），在保持与微调模型相当甚至更优性能的同时，大幅降低计算开销。实验证明，LLM-MvP在五个基准数据集上有效弥合了少样本提示与微调模型之间的性能差距，提供了一种高效且实用的ABSA解决方案。

链接: https://arxiv.org/abs/2605.28058
作者: Nils Constantin Hellwig,Niklas Donhauser,Jakob Fehle,Udo Kruschwitz,Christian Wolff
机构: University of Regensburg (雷根斯堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent work explored the capabilities of Large Language Models (LLMs) in Aspect-Based Sentiment Analysis (ABSA) through few-shot prompting, requiring substantially fewer annotated examples while achieving notable improvements over zero-shot baselines. However, a performance gap remained compared to models fine-tuned on hundreds of examples, and the computational costs of LLM inference present practical barriers to deployment. We introduce LLM-based Multi-View Prompting (LLM-MvP), which adapts the multi-view principle of considering multiple element orderings to LLM prompting. By combining schema-constrained decoding with a context-free grammar and prefix batching, LLM-MvP achieves performance competitive or superior to fine-tuned approaches while substantially reducing computational overhead. Extensive experiments across five benchmark datasets demonstrate that LLM-MvP closes the gap between few-shot prompting and fine-tuned models, offering a practical and efficient solution for ABSA.

[NLP-86] Knowledge Dependency Estimation for Reliable Question Answering

【速读】：该论文试图解决大语言模型（Large Language Model, LLM）在问答（Question Answering, QA）任务中缺乏可解释性的问题，即如何准确识别模型预测所依赖的具体知识单元（knowledge units），尤其是在存在噪声和冗余候选知识源（如上下文、检索结果、分解推理等）的复杂场景下。其解决方案的关键在于提出一种名为Knot的结构化、感知排序的知识依赖估计器：通过子集级别的反事实监督信号学习，利用潜在依赖因子的覆盖度建模子集敏感性，并生成具有排序感知能力的细粒度单位得分，从而在不进行全量扰动测试的前提下，有效区分冗余、可替代与互补性的知识单元，实现对QA预测的可信度评估与早期风险预警。

链接: https://arxiv.org/abs/2605.28047
作者: Chaodong Tong,Qi Zhang,Nannan Sun,Lei Jiang,Yanbing Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); China Industrial Control Systems Cyber Emergency Response Team (中国工业控制系统网络安全应急响应技术中心)
类目: Computation and Language (cs.CL)
备注: 12 tables, 9 figures

点击查看摘要

Abstract:Reliable question answering requires identifying not only whether an answer is correct, but also which available knowledge the prediction depends on. In realistic LLM-based QA, this knowledge may come from context, retrieval, decomposition, or intermediate reasoning, forming a noisy and redundant candidate space rather than a clean gold evidence set. We study \emphknowledge dependency estimation: estimating the sensitivity of a fixed black-box QA model to different candidate knowledge units. The challenge is to obtain fine-grained dependency scores without exhaustive test-time perturbation while modeling redundancy, substitutability, and complementarity. We propose \textbfKnot, a structured rank-aware knowledge dependency estimator. Knot learns from subset-level counterfactual supervision, models subset sensitivity through coverage over latent dependency factors, and derives rank-aware unit scores to identify influential candidates. Across multiple-choice and generative QA benchmarks, Knot outperforms all compared baselines in subset-sensitivity prediction and produces more faithful unit rankings than deployable baselines without extra QA-model calls; when used for practical risk screening, its dependency scores help flag error-prone QA predictions early.

[NLP-87] MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

【速读】：该论文试图解决现有代理记忆系统中存在的三大问题：被动调用（passive invocation）、推理与检索解耦（reasoning-retrieval decoupling）以及检索片段结构与代理导航需求之间的不匹配（structural mismatch）。其解决方案的关键在于提出MemCog——一种将记忆视为认知过程（Memory-as-Cognition）的新型记忆系统，通过构建可导航的记忆存储（Navigable Memory Store）以关联图结构组织知识，设计跨维度导航接口（Cross-Dimensional Navigation Interface）支持多步推理驱动的遍历，并引入主动推理协议（Proactive Reasoning Protocol）使代理能够从对话上下文中自发启动记忆探索。实验表明，MemCog在被动问答基准上达到先进水平（LoCoMo: 92.98, LongMemEval: 95.8），并在首个用于评估主动记忆触发的基准ProactiveMemBench上显著优于基线，验证了“记忆即认知”范式的有效性。

链接: https://arxiv.org/abs/2605.28046
作者: Zihan Li,Xingyu Fan,Feifei Li,Wenhui Que
机构: WeChat, Tencent Inc., Beijing, China
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of flat passage lists, suffering from passive invocation, reasoning-retrieval decoupling, and structural mismatch between retrieved fragments and the agent’s navigational needs. We propose MemCog, a Memory-as-Cognition system that makes memory access an integral part of the reasoning process. MemCog organizes user knowledge as Navigable Memory Store with associative link graphs, exposes Cross-Dimensional Navigation Interface for multi-step reasoning-driven traversal, and employs Proactive Reasoning Protocol that drives agents to spontaneously initiate memory exploration from conversational context. We additionally construct ProactiveMemBench, the first benchmark for evaluating proactive memory triggering. Experiments show that MemCog achieves state-of-the-art on passive QA benchmarks (92.98 on LoCoMo, 95.8 on LongMemEval) while substantially outperforming baselines on ProactiveMemBench, demonstrating the advantage of Memory-as-Cognition.

[NLP-88] Extracting Small Translation Specialists from LLM s by Aggressively Pruning Experts

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在机器翻译任务中存在过度参数化的问题，即当前最先进的LLMs虽然翻译性能优异，但其训练目标涵盖多种任务，导致模型规模庞大、内存和计算资源消耗过高。解决方案的关键在于利用混合专家模型（Mixture-of-Experts, MoE）的模块化特性，通过识别与翻译无关的专家并进行无训练剪枝（pruning），从而显著压缩模型体积而不显著影响翻译质量。具体而言，作者利用专家专业化（expert specialization）和多语言能力的可分离性（separability of multilingual capabilities）来定位冗余专家，并在不重新训练的情况下实现高达70%的专家剪枝；结合极短的监督微调（SFT），进一步可实现75%甚至接近90%的剪枝率，同时恢复或保持合理的翻译性能，表明翻译任务仅需模型中的一小部分参数，即可实现高效压缩。

链接: https://arxiv.org/abs/2605.28042
作者: Liu O. Martin,Lucas Bandarkar,Nanyun Peng
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern large language models (LLMs) achieve state-of-the-art machine translation performance, but they do so as broad generalists largely trained for many tasks and capabilities unrelated to translation. Thus, they are heavily overparameterized for this task, resulting in excessive memory and compute requirements. In this paper, we present a method for aggressively pruning experts from modern mixture-of-experts LLMs while incurring negligible degradation in translation quality. Our approach exploits expert specialization and the separability of multilingual capabilities in LLMs to identify experts irrelevant to translation. And because of the modular nature of MoEs, these can be easily pruned without any training. Without retraining, we are able to prune half of all experts with negligible degradation and 70% with only minor losses. With a very short SFT, we prune 75% of experts while recovering baseline performance, and in some settings remove nearly 90% while maintaining reasonable translation quality. Overall, our results show that translation requires only a fraction of the LLM, enabling substantial compression of the MoE blocks that contain over 90% of parameters.

[NLP-89] Personality Role and Expressive Style in Large Language Models : An Interactionist Analysis

【速读】：该论文试图解决的问题是：在大型语言模型（LLM）对话代理中，通过提示（prompt）指定大五人格特质（Big Five Personality Traits, BFTs）并不能确保生成的对话内容真实反映所期望的人格特征，即存在“提示-表达”不一致的问题。解决方案的关键在于从互动论（interactionist perspective）视角出发，揭示人格表达并非单纯由提示决定，而是受人格特质、对话角色（dialogue role）和表达风格（expressive style）三类提示因素共同作用的上下文依赖过程。研究通过因子设计生成跨语言（英语与日语）的1080组对话样本，并利用LLM-as-a-judge框架量化实际表达的人格特质，发现不同因素对各人格维度的影响具有特异性——例如，角色显著影响开放性（Openness），表达风格显著塑造尽责性（Conscientiousness）和宜人性（Agreeableness），而显式特质提示主导神经质（Neuroticism）。这表明，有效的人格控制需综合考虑提示内容与情境变量的协同效应，而非仅依赖单一的特质提示。

链接: https://arxiv.org/abs/2605.28037
作者: Moe Nagao,Koichiro Terao,Mikio Nakano,Naoto Iwahashi
机构: Okayama Prefectural University (冈山県立大学); AI Humans Lab (AI人类实验室); C4A Research Institute (C4A研究院)
类目: Computation and Language (cs.CL)
备注: 26 pages

点击查看摘要

Abstract:Prompt-based personality control is a key technique for designing large language model (LLM) dialogue agents that behave consistently across social contexts. However, specifying Big Five personality traits (BFTs) in a prompt does not ensure that the intended traits are expressed in generated utterances. This paper investigates this mismatch from an interactionist perspective, viewing personality expression as a context-dependent outcome shaped by the interplay between trait specification and situational factors. We analyze how perceived BFT expression in LLM-generated dialogue is influenced by three prompt factors: personality traits, dialogue roles, and expressive styles. Using a factorial design that combines six personality conditions, three roles, and three expressive-style conditions, we generate 1,080 LLM-agent dialogues in each of English and Japanese. We then evaluate the target agent’s utterances using an LLM-as-a-judge framework to estimate expressed Big Five traits. The results show that expressed personality is shaped not only by explicit trait specification, but also by dialogue role and expressive style. These effects are trait-specific: dialogue role strongly influences Openness, expressive style substantially shapes Conscientiousness and Agreeableness, and explicit trait specification dominates Neuroticism. Even without explicit personality-trait specification, social and expressive conditions induce distinct personality-like impressions. Cross-linguistic comparisons show broadly similar patterns between English and Japanese dialogues, with noticeable differences only under specific combinations of personality, role, and expressive style. These findings suggest that personality control in LLM agents should be understood not as a direct consequence of trait prompting, but as a context-dependent process involving personality specification, social role, and expressive style.

[NLP-90] MIRA: A Bilingual Benchmark for Medical Information Response Audit

【速读】：该论文试图解决的问题是：当前大型语言模型（LLM）在提供公共健康信息时，缺乏对不同用户表述方式下响应一致性（即是否保留相同医学信息）的安全评估。现有方法忽略了模型在面对相同问题的不同表达形式（如语言、语域、健康素养水平差异）时，是否会输出信息量不一致的响应。解决方案的关键在于提出一个名为“医学信息响应审计”（Medical Information Response Audit, MIRA）的双语可控基准测试工具，它通过4320个由60个医学审核过的低风险健康问题构建的提示词，系统性地评估LLM在不同用户侧语言特征下的响应一致性。研究发现，尽管所有模型均能回答医疗问题，但针对低健康素养信号的响应普遍存在关键信息遗漏、具体行动建议减少及独立判断支持不足的现象，这种模式被定义为“差异化信息稀释”（Differential Information Dilution, DID）。此外，该研究还验证了缓解策略的有效性，即使用知识引导型提示可显著降低多数模型的信息稀释程度，尤其在Claude和Qwen模型中观察到最明显的改善（分别减少约8%和6%的过度简化导致的信息缺失）。

链接: https://arxiv.org/abs/2605.28025
作者: Mengyu Xu,Qiaoxin Yang,Qianqian Wang,Xiwei Dai,Weiyi Wu,Chongyang Gao
机构: The University of Chicago; SynAI Technologies Inc.; Jinzhou Medical University; Zhejiang University; Dartmouth College; Northwestern University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).

[NLP-91] VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

【速读】：该论文试图解决生成式视觉描述（Visual Captioning）中模型难以在忠实于视觉内容的同时避免遗漏和幻觉的问题。现有基于强化学习（Reinforcement Learning, RL）的奖励设计缺乏细粒度且可靠的信号来验证事实一致性，从而限制了其效果。解决方案的关键在于提出一种名为VCap的“Witness-Adjudicator”奖励机制：该机制将参考描述（witness）与视觉信号（adjudicator）配对，通过显式地验证参考描述与策略生成描述之间在视觉基础上的事实一致性，提供具有超几何分布级精度的奖励信号。这一设计使得即使在参考描述不完美时也能有效学习，并支持从弱到强的强化学习泛化能力。实验表明，使用VCap训练的8B参数模型在多个图像和视频描述基准测试中优于开源和闭源的最先进模型，且人类评估证实其更符合事实正确性；此外，VCap还提升了多模态大语言模型（MLLM）的感知能力、跨任务泛化性能，并超越了最佳N抽样蒸馏方法，挑战了关于RLVR（Reinforcement Learning with Visual Reward）的传统认知。

链接: https://arxiv.org/abs/2605.28023
作者: Xingyu Lu,Jinpeng Wang,Yi-Fan Zhang,Yankai Yang,Yancheng Long,Yiyang Fan,Xuanyu Zheng,Haonan Fan,Kaiyu Jiang,Tianke Zhang,Changyi Liu,Bin Wen,Fan Yang,Tingting Gao,Han Li,Chun Yuan
机构: Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Chinese Academy of Sciences (中国科学院); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 28 pages, 8 figures

点击查看摘要

Abstract:Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.

[NLP-92] Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation

【速读】：该论文试图解决生成式 AI（Generative AI）在代码生成任务中，基于验证器的强化学习（RLVR）方法在提升可执行正确性的同时，可能导致采样程序冗余度增加的问题。解决方案的关键在于引入基于 JPlag（一种用于检测代码抄袭的系统）相似性的直接抗冗余奖励机制，以降低生成程序在实现层面的重复性。实验表明，这种改进能有效减少冗余并提升有限采样预算下的执行成功率，且性能通常优于专门针对 Pass@k 设计的目标函数。

链接: https://arxiv.org/abs/2605.28022
作者: Le Bronnec Florian,Alexandre Verine,Rio Yokota,Benjamin Negrevergne
机构: RIKEN Center for Computational Science, Tokyo, Japan; École Normale Supérieure Paris, PSL University, Paris, France; LAMSADE, CNRS, Université Paris-Dauphine-PSL, Paris, France
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Preprint under review

点击查看摘要

Abstract:LLMs for code generation are commonly evaluated in repeated-sampling settings using Pass@k, where multiple candidate programs are executed against unit tests under a finite sampling budget. While recent verifier-based reinforcement learning (RLVR) methods improve executable correctness, how these objectives affect redundancy among sampled programs remains poorly understood. In this work, we study implementation-level redundancy in code generation using JPlag, a plagiarism-detection system for code. Across models and benchmarks, we show that correctness-only RLVR often concentrates generations around repeated implementations, whereas Pass@k-aware objectives maintain lower redundancy and improve larger-budget performance. Motivated by these observations, we augment RLVR with direct anti-redundancy rewards based on JPlag similarity. Across 3 models and 3 benchmarks, discouraging near-duplicate generations reliably improves finite-budget executable performance, often matching or outperforming specialized Pass@k-aware objectives.

[NLP-93] he Missing Piece in Pre-trained Model Evaluation: Reward-Guided Decoding Unlocks Task-Oriented Behavior Without Parameter Updates

【速读】：该论文试图解决预训练大语言模型（LLM）在推理阶段难以可靠评估其真实能力的问题。核心挑战在于，基础预训练模型虽优化于下一个词预测任务，但在标准提示和直接解码下常无法遵循指令或生成结构良好的回答，导致基准测试结果混淆了模型能力与解码引发的失败行为，而暴露此类问题通常依赖昂贵的后训练过程。解决方案的关键是提出一种无需训练、基于奖励引导的能量解码（Energy-Based Decoding, EBD）框架，通过引入一个轻量级外部奖励模型，在不改变原始模型参数的前提下，对解码分布进行能量调整，从而引导生成更符合任务目标的响应，同时保持与预训练模型先验的一致性。实验表明，EBD显著提升模型指令遵循能力，使基线模型行为更接近后训练版本，并在五个模型和六个基准上优于现有方法，如将Qwen3-8B-Base在AlpacaEval2.0上的得分从8.8提升至44.5，且对奖励模型规模具有鲁棒性。

链接: https://arxiv.org/abs/2605.28020
作者: Shaobo Wang,Guo Chen,Ziyue Wang,Zhengyang Tang,Qingyang Liu,Xingzhang Ren,Dayiheng Liu,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Qwen Team, Alibaba Group (通义实验室，阿里巴巴集团); The Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳)
类目: Computation and Language (cs.CL)
备注: 26 pages, 5 figures, 8 tables

点击查看摘要

Abstract:With the rapid progress of large language models (LLMs), reliably evaluating the capabilities of pre-trained LLMs has become increasingly important. The challenge is that base pre-trained models are optimized for next-token prediction and often fail to follow instructions or produce well-formed answers under standard prompting and direct decoding. As a result, benchmark performance can conflate model capability with decoding-induced failures to produce task-oriented outputs, while exposing such behavior often relies on costly post-training. Recent decodingonly approaches attempt to reshape output distributions, but such methods can be inefficient and brittle across open-ended tasks. To address these limitations, we propose Energy-Based Decoding (EBD), a training-free, reward-guided framework for activating task-oriented behaviors from frozen pre-trained LLMs across both open-ended and objective tasks. EBD augments decoding with an external lightweight reward model, steering generations toward high-utility responses while anchoring them to the pre-trained model prior through a reward-tilted target distribution. We show that EBD shifts base-model outputs toward more instructionfollowing behavior, increasing behavioral similarity to post-trained counterparts and enabling a fairer inference-time evaluation of accessible pre-trained-model behavior. Empirically, EBD outperforms baselines across five models and six benchmarks, improving Qwen3-8B-Base on AlpacaEval2.0 from 8.8 to 44.5, reducing Mistral-7B Math500 latency by 18.9x relative to prior decoding work, and remaining robust to reward-model size.

[NLP-94] ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains

【速读】：该论文旨在解决现有在线策略自蒸馏（On-policy Self-Distillation, OPSD）方法在领域内推理性能提升有限且跨域泛化能力差的问题。其核心挑战在于：传统OPSD通过条件化自教师模型于验证解，导致模型倾向于模仿训练域的参考轨迹而非进行针对性错误修正；同时，对完整响应进行蒸馏会覆盖有效的推理前缀并加剧过拟合。解决方案的关键在于提出反射式在线策略自蒸馏（Reflective On-policy Self-Distillation, ROSD），该框架通过引入自反思机制（self-reflector）识别首个错误片段并生成修正思路，从而将单纯的参考解模仿转化为基于反思的局部化错误修正。具体而言，修正思路引导自教师提供目标导向的监督信号，而错误定位则限制蒸馏范围仅作用于需修正区域，有效保留合法推理前缀。实验表明，ROSD在多个领域内与领域外推理基准上均显著优于标准OPSD，兼具更强的领域内表现和更优的跨域泛化能力。

链接: https://arxiv.org/abs/2605.28014
作者: Ziqi Zhao,Xinyu Ma,Liu Yang,Yujie Feng,Daiting Shi,Jingzhou He,Xin Xin,Zhaochun Ren,Xiao-Ming Wu
机构: The Hong Kong Polytechnic University (香港理工大学); Baidu Inc. (百度公司); Shandong University (山东大学); Leiden University (莱顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-domain reasoning and generalize poorly to out-of-domain problems. We identify two key causes: conditioning the self-teacher on a verified solution encourages imitation of training-domain reference trajectories rather than error-specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On-policy Self-Distillation (ROSD), a framework that turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, ROSD uses a self-reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in-domain and out-of-domain reasoning benchmarks show that ROSD yields stronger in-domain reasoning performance overall and substantially better out-of-domain generalization than standard OPSD. Code is available at this https URL. Comments: Preprint Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2605.28014 [cs.CL] (or arXiv:2605.28014v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.28014 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-95] KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在安全评估中面临的两大核心问题：一是现有评估工具普遍基于英语语料，缺乏对非英语文化背景的适配；二是评估内容多聚焦于通用风险，忽视了与本地文化紧密相关的安全漏洞。解决方案的关键在于提出KSAFE-MM基准，其由两部分组成：KSAFE-MM-G通过语言情境化方法将通用安全问题转化为韩语语境下的多模态样本，以评估全球共通风险；KSAFE-MM-C则基于真实世界场景中的本地化视觉查询，结合“越狱式”文本指令，系统性覆盖涉及文化视觉线索与恶意文本意图的多模态安全风险。该构建路径实现了从通用到本地的渐进式评估范式，并实证发现MLLMs在文化相关攻击下更易被攻破，且越狱策略显著提升攻击成功率（如ProgramExecution可达74.2%攻击成功率），同时揭示了安全性与过度拒绝行为之间的权衡关系。

链接: https://arxiv.org/abs/2605.28013
作者: Yongwoo Kim,Sojung An,Yunjin Park,Jungwon Yoon,Dujin Lee,HyunBeom Cho,Jaewon Lee,Wonhyuk Lee,Youngchol Kim,JeongYeop Kim,Donghyun Kim
机构: Korea University (韩国大学); KT Corporation (KT公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations: 1) English-centric dataset construction, and 2) a focus on generic risks that are not tied to local cultural contexts. This paper introduces KSAFE-MM, a benchmark for Korean multimodal safety evaluation that covers both general safety risks and culture-specific vulnerabilities. KSAFE-MM consists of two parts, KSAFE-MM-G and KSAFE-MM-C. KSAFE-MM-G evaluates globally shared risks in Korean contexts through linguistic contextualization, which transforms generic safety queries into contextually grounded multimodal samples. KSAFE-MM-C targets culture-dependent MLLM safety vulnerabilities using localized visual queries derived from real-world contexts. It pairs these visual queries with jailbreak-style textual queries to cover multimodal safety risks involving cultural visual cues and malicious textual intent. Together, these components provide a general-to-local construction pipeline for evaluating both globally shared safety risks and culture-specific vulnerabilities. We evaluate 12 state-of-the-art MLLMs on KSAFE-MM and reveal that models exhibit greater vulnerability to culturally grounded attacks than to generic ones. Notably, jailbreaking strategies substantially amplify attack success rates, with ProgramExecution yielding up to 74.2% ASR compared to 13.4% for standard queries. Furthermore, we identify a systematic trade-off between safety and over-refusal, where models achieving low ASR tend to exhibit excessive refusal behavior on benign queries. These findings highlight the urgent need for culturally grounded safety evaluation beyond English-centric benchmarks.

[NLP-96] MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

【速读】：该论文试图解决记忆增强型大语言模型在长期推理中因记忆系统混杂不同功能类型的记忆而导致的“异质性记忆污染”（heterogeneous memory contamination）问题，即用户事实、事件和行为规则等语义相关但功能不同的记忆被错误地混用，从而误导生成结果。解决方案的关键在于提出MemGuard——一种类型感知的记忆框架，其核心机制包括：在写入时为每条记忆显式标注功能角色、在类型隔离的基础上维护记忆间的关系，并在检索与使用时仅选择必要类型的记忆进行组合，从而有效减少无关或功能不兼容记忆的干扰。实验表明，MemGuard在幻觉检测和长周期对话基准上将记忆可靠性提升最高达28.27%，同时检索记忆token数量减少最多5.8倍，验证了结构化组织与选择性使用异质记忆对可靠长期推理的重要性。

链接: https://arxiv.org/abs/2605.28009
作者: Hyeonjeong Ha,Jeonghwan Kim,Cheng Qian,Jiayu Liu,William M. Campbell,Yue Wu,Yuji Zhang,Kathleen McKeown,Dilek Hakkani-Tur,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Columbia University (哥伦比亚大学); Capital One (Capital One)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions. However, existing memory systems often collapse stable user facts, episodic events, and behavioral rules into a shared space, allowing functionally distinct memories to be retrieved and used as interchangeable evidence. We identify this failure mode as heterogeneous memory contamination, where context-specific events become overgeneralized claims, or semantically relevant but functionally incompatible memories mislead generation. To this end, we introduce MemGuard, a type-aware memory framework that preserves functional memory boundaries during memory construction and retrieval. It assigns each memory an explicit functional role at write time, maintains relations across type-isolated memories, and selectively composes evidence only from necessary memory types, reducing contamination from irrelevant or functionally incompatible evidence. Across hallucination and long-horizon conversation benchmarks, MemGuard improves memory reliability by up to 28.27% while retrieving up to 5.8x fewer memory tokens than prior methods. These results suggest that reliable long-term reasoning depends on principled organization and selective use of heterogeneous memory.

[NLP-97] Integrated and Cross-Architecture Interpretation of LLM Reasoning

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）推理过程不可解释的问题，即虽然模型输出可观察，但其内部推理路径仍不透明。现有单一探针方法（如互信息峰值 MIP 或深度思考比例 DTR）可能低估真实推理结构。解决方案的关键在于提出一种跨架构的集成推理解释框架（Integrated, cross-Architecture Reasoning, IAR）：首先，通过带宽校准的 MIP 结合 Tukey IQR 峰值检测识别输出层关键推理标记；其次，通过 MIP 标记与 DTR 深层标记的重叠分析追踪这些标记在不同层间的演化轨迹，并验证其是否具有计算密集性；最后，利用 Jaccard 稳定性指标在多领域任务中验证所识别标记的推理质量一致性。实验表明，IAR 在多个模型（Qwen-7B、Qwen-14B、Llama-8B）和领域（数学、代码、逻辑、常识）中均具备良好的泛化解释能力。

链接: https://arxiv.org/abs/2605.28006
作者: Leonardo Matthew Yauw,Wei-Bin Kou,Yujiu Yang
机构: Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding how LLMs reason is hindered by a practical asymmetry: while their generated outputs are observable, the underlying reasoning patterns remain opaque. Relying on single probes, such as Mutual Information Peak (MIP) or Deep-Thinking Ratio (DTR), risks underestimating the genuine inferential structure. To response this deficiency, we present an Integrated, cross-Architecture Reasoning (IAR) framework, designed to provide a unified approach to LLM reasoning interpretability. Specifically, we first propose to use bandwidth-calibrated MIP coupled with Tukey IQR peak-detection to isolate reasoning-crucial tokens at the output layer. Second, we performed an overlap analysis between MIP-picked tokens and DTR-deep tokens to trace the cross-layer trajectories of those tokens. This also discloses whether reasoning-crucial tokens are computation-intensive as well, further facilitating to understand how reasoning patterns evolve across model layers. Finally, we apply a Jaccard stability metric over multi-domain problems to verify if the MIP-identified tokens are reasoning quality-guaranteed. Extensive experiments on three models (Qwen-7B, Qwen-14B, and Llama-8B) across four domains (mathematics, code, logic, and common sense) demonstrate IAR’s generalizable interpretation capabilities across architectures.

[NLP-98] Beyond Chunk-Local Extraction: Cross-Chunk Graph Augmentation for GraphRAG

【速读】：该论文试图解决GraphRAG在处理复杂问答任务时的一个关键局限性：现有框架仅从单个文本块（chunk）中提取实体和关系，导致跨块关系（cross-chunk relations）——即证据分布在多个段落中的关系——无法被建模并纳入知识图谱索引。这种缺失限制了模型对多跳推理和长文档问答的能力。解决方案的关键在于提出CrossAug方法，它通过图神经网络（GNN）引导的跨块图结构增强机制，在查询前的离线阶段自动补全缺失的跨块关系；该方法利用自监督图破坏策略生成训练信号，借助拓扑感知的GNN识别潜在缺失关系区域，并仅对高评分区域调用大语言模型（LLM）进行基于证据的补全，从而避免组合爆炸问题。实验表明，CrossAug在三个GraphRAG框架和四个多跳/长文档问答基准上均显著提升性能，验证了跨块图增强对检索增强生成的有效性。

链接: https://arxiv.org/abs/2605.28004
作者: Jiaming Zhang,Yibo Zhao,Jing Yu,Jianxiang Yu,Xiang Li
机构: East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures, 8 tables

点击查看摘要

Abstract:GraphRAG extends retrieval-augmented generation by organizing corpora as explicit knowledge graphs, enabling graph-based retrieval for complex question answering. However, existing frameworks extract entities and relations within individual chunks, leaving cross-chunk relations – those whose evidence spans multiple passages – systematically absent from the index. Exhaustive LLM-based recovery of such relations is impractical due to the combinatorial explosion of chunk combinations. We present CrossAug, a GNN-guided CROSS-Chunk Graph AUGmentation method that enriches GraphRAG indices with cross-chunk relational structure as an offline step before query-time retrieval. CrossAug derives training supervision through self-supervised graph corruption, uses a topology-aware GNN to score subgraphs for missingness, and applies evidence-grounded LLM completion only to selected high-scoring regions. Experiments on three LLM-based GraphRAG frameworks across four multi-hop and long-document QA benchmarks demonstrate that CrossAug consistently improves performance, confirming the benefit of cross-chunk graph augmentation for retrieval-based question answering. Our code is available at this https URL.

[NLP-99] ResearchMath-14K: Scaling Research-Level Mathematics via Agents

【速读】：该论文试图解决的问题是：如何让语言模型在无需人类干预的情况下，有效参与并推进研究级数学问题的求解。当前的主要障碍在于缺乏大规模、高质量的研究级数学数据集。解决方案的关键在于构建了ResearchMath-14k——一个包含14,056个来自学术来源的科研级数学问题的数据集，并通过多智能体管道生成了22万条教师轨迹（teacher trajectories），用于训练和评估模型。研究发现，尽管存在诸如不尝试解答或伪造参考文献等重复性回避行为，但经过代理过滤后的错误推理轨迹仍能提供有效的监督信号，使得Qwen3系列模型在微调后性能显著提升（平均提高9.2分）。这表明，即使没有完全正确的推理过程，开放问题的尝试也能为模型提供有价值的训练信号，从而推动生成式AI在高阶数学推理领域的进展。

链接: https://arxiv.org/abs/2605.28003
作者: Guijin Son,Seungyeop Yi,Minju Gwak,Hyunwoo Ko,Wongi Jang,Youngjae Yu
机构: Seoul National University (首尔国立大学); OneLineAI; Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注: Work in progress. Dataset available at: this https URL

点击查看摘要

Abstract:The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of 14,056 problems curated from academic sources via a multi-agent pipeline, making it the largest collection of research-level mathematical problems to date. We further generate ResearchMath-Reasoning, 220 K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non-attempts and fabricated references. Interestingly, across eight open-weight models, newer generations produce 5.6\times more references and 5.0\times more fake references per trace. After agentic filtering of ResearchMath-Reasoning, fine-tuning Qwen3 models from 4B to 30B parameters improves over base models by 9.2 points on average. This shows that filtered open-problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath-14k publicly available for future works on research-level mathematical reasoning.

[NLP-100] Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）生成有害、仇恨或有毒内容的问题，而现有缓解方法依赖于昂贵的再训练或输出层面的过滤，缺乏对毒性在模型内部起源位置的机制性理解。解决方案的关键在于提出两种互补的无需再训练的框架：Meow2X 和 TRNE，它们通过分析有毒提示与中性提示之间的激活差异，将毒性定位到特定层和神经元，并在推理阶段通过缩放激活或进行最小秩一权重修改来抑制毒性，且不使用任何梯度下降优化。实验表明，该方法在五个大语言模型、两个基准测试和90种配置下均能持续降低毒性，同时保持语言建模质量；进一步分析揭示毒性主要集中在早期多层感知机（MLP）层，且不同架构间存在差异，单评估器设置会系统性低估毒性风险，强调了多评估器安全评估的重要性。该工作实现了机制可解释性与实用去毒的结合，为构建更安全、透明的语言模型提供了原理性的路径。

链接: https://arxiv.org/abs/2605.27997
作者: Himanshu Beniwal,Mayank Singh
机构: Indian Institute of Technology Gandhinagar (印度理工学院甘德哈纳分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where toxicity originates internally. We introduce Meow2X and TRNE, two complementary retraining-free frameworks that localize toxicity to specific layers and neurons by analyzing activation differentials between toxic and neutral prompts, then suppress them via inference-time scaling or minimal rank-one weight edits – without any gradient descent. Evaluations across five LMs, two benchmarks, and 90 configurations using dual safety evaluators demonstrate consistent toxicity reduction while preserving language modeling quality. Our analysis reveals that toxicity is disproportionately encoded in early MLP layers, varies across architectures, and is systematically underestimated by single-evaluator setups – underscoring the need for multi-evaluator safety assessment. By bridging mechanistic interpretability with practical detoxification, our framework offers a principled path toward safer, more transparent language models.

[NLP-101] Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）中存在的对象幻觉（object hallucination）问题，即模型在生成文本时错误地引入图像中并不存在的对象。现有推理阶段的缓解方法通常假设幻觉源于视觉信息的忽视，从而引导模型增强对视觉内容的依赖。然而，作者通过系统性干预多个MLLMs发现，过度强化视觉依赖反而可能加剧某些模型的幻觉现象，而适度降低视觉依赖则有助于缓解幻觉，这表明将幻觉单纯归因于视觉不足是不充分的。关键解决方案在于提出一种无需训练的框架——上下文偏好激活引导（Context-Preference Activation Steering, CAS），其核心思想是：图像作为上下文，与模型参数化知识和文本上下文存在竞争关系。CAS通过设计少量冲突样本提取两个语义独立的上下文偏好向量（Context Preference Vectors, CPVs），并在推理阶段通过单次带符号残差注入的方式，将这些向量施加于早期多层感知机（MLP）层，以动态调节模型对不同信息源的依赖程度。实验表明，CAS能显著减少对象幻觉，且不增加解码延迟，同时保持原始文本生成质量。

链接: https://arxiv.org/abs/2605.27993
作者: Jingwen Wu,Xijun Zhang,Ge Song
机构: Nanjing Normal University (南京师范大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:Object hallucination remains a primary obstacle to the reliable deployment of Multimodal Large Language Models (MLLMs). Current inference-time mitigation methods mainly assume hallucinations stem from visual neglect, steering models to enhance visual reliance. In contrast, our systematic interventions on multiple MLLMs show that pushing toward more visual reliance may exacerbate hallucinations on some models, while less may mitigate hallucinations. This result suggests that attributing hallucinations solely to visual insufficiency is underdetermined. We argue that the image, as a context, simultaneously competes with the model’s parametric knowledge and the textual context. For this, we propose a training-free framework, Context-Preference Activation Steering (CAS). It extracts two semantically distinct Context Preference Vectors (CPVs) via two small sets of designed conflict samples and applies them via single-pass signed residual injection at mid-early MLP layers during inference to control information reliance. Experiments show that CAS substantially mitigates object hallucinations without increasing decoding latency and preserves native text-generation quality.

[NLP-102] Auditing Stance Asymmetry in Generative Explanations

【速读】：该论文试图解决生成式 AI 在开放性解释中存在的一种隐蔽偏见问题，即“立场承载不对称性”（stance-bearing asymmetry）——模型可能在不使用敌意语言的前提下，通过结构化归因、责任分配和合法性赋予等方式，使一方显得更合理、另一方更值得谴责或不被重视。其解决方案的关键是提出对称分解评估法（Symmetry Decomposition Evaluation, SDE），该方法通过成对情境测试、具体群体标签、结构角色重写以及显式支持或反证证据的控制，系统识别哪些表面差异会因结构或证据调整而减弱，哪些则保持稳定，从而揭示模型在解释立场上的内在偏倚。SDE将生成式偏见评估从简单的表面特征对比转向对解释立场的审计，强调立场如何随分解条件变化，以及自动评分机制在捕捉读者感知差异时的局限性。

链接: https://arxiv.org/abs/2605.27988
作者: Jiarui Han
机构: University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Bias evaluation for language models has made substantial progress on bounded comparisons, such as overt derogation, stereotype association, or label-sensitive differences under controlled substitutions. Open-ended explanations raise a different problem: they guide interpretation by assigning responsibility, legitimacy, context, and grievance. A model can avoid hostile language while making one side structurally understandable and another personally at fault, overreacting, or less worth taking seriously. We call this stance-bearing asymmetry in generative explanations. We propose Symmetry Decomposition Evaluation (SDE), which tests paired situations with concrete group labels, structural-role rewrites, and explicit support or counter-evidence. In a controlled 32-family prototype suite, this decomposition shows that surface differences are not all alike: some weaken under structural or evidence control, while others remain as stable differences in how the model assigns blame, context, or legitimacy. Targeted case review and judge comparison suggest a broader difficulty for evaluating open-ended framing asymmetries: judge readings shift across operationalizations, and scalar scores can flatten distinctions that readers use to interpret explanatory stance. SDE therefore reframes generative bias evaluation as an audit of explanatory stance – what stance each side receives, how it changes under decomposition, and where automatic scoring becomes unstable.

[NLP-103] An Evolutionary Approach for Designing Stable and Highly Expressible Low-Immunogenicity Therapeutic mRNA Sequences

【速读】：该论文旨在解决mRNA作为治疗性药物时的序列设计难题，即如何在保证高效翻译、结构稳定性和低免疫原性之间取得平衡。现有方法要么过度追求结构稳定性（如Linear Design导致翻译效率下降），要么仅关注密码子使用偏好而忽视结构约束（如BiLSTM-CRF模型）。解决方案的关键在于提出一个两阶段的计算框架：第一阶段利用预训练的CodonTransformer（类BERT的大语言模型）生成生物学合理的目标抗原编码序列；第二阶段通过遗传算法（GA）对候选序列进行进化优化，引入密码子感知交叉和同义突变，并以人类密码子使用偏好为指导。评估函数综合考虑翻译相关指标（CAI、tAI、密码子对偏倚）、mRNA结构稳定性（局部与全局最小自由能MFE、GC含量）以及免疫原性降低（CpG/UpA基序频率）。实验结果显示，该框架在第42代实现CAI 0.74、tAI 0.64，同时保持高结构稳定性（MFE -350 kcal/mol左右，约84%碱基配对）及显著降低免疫惩罚（平均值27.3），从而实现了翻译效率与结构稳定性的最优平衡，验证了BERT-GA框架在mRNA序列理性设计中的有效性。

链接: https://arxiv.org/abs/2605.27986
作者: Dhawa Sang Dong,Mausam Gurung,Suraj Kandel
机构: Kathmandu University (尼泊尔大学); Kathmandu Engineering College (加德满都工程学院); Trivubhan University (特里布万大学)
类目: Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Messenger RNA (mRNA) sequences as therapeutics require optimized design to ensure efficient translation, structural stability, and minimal immunogenicity. This study presents a two-stage in-silico framework that integrates deep learning and evolutionary computation for rational mRNA optimization instead of existing state-of-the-art models. In the first stage, a pretrained CodonTransformer (BERT-like Large Language Model) generates biologically coherent mRNA sequences encoding the target antigen. In the second stage, a genetic algorithm (GA) evolves these candidate sequences through codon-aware crossover and synonymous mutation guided by human codon usage preferences. Fitness functions for evaluation combined translation-related metrics (CAI, tAI, codon-pair bias), mRNA structural stability (local and global MFE via RNAfold, GC content), and reduced immunogenicity (CpG/UpA motif frequency). Over successive generations (38th, 40th, and 42nd), the GA improved (achieved CAI values of 0.73 to 0.74 and tAI values of 0.63 to 0.64) CAI and tAI by over 6% and codon-pair bias is high and consistent (0.97 ) and improved ribosomal accessibility at the 5’ end, with an unpaired_30 fraction reaching 0.87; Global Minimum Free Energy (MFE) converged to a balanced range of -346 to -356 kcal/mol, achieving approximately 84% base-paired structural stability, and reduced immune-stimulatory motifs - lowering the average immune penalty to 27.3 in the final generation. Linear Design produces hyper-stable transcripts (MFE - 2000 kcal/mol) that risk translation inefficiency due to extreme rigidity, and BiLSTM-CRF focuses solely on high CAI (0.96 to 0.98) without structural constraints, our framework achieves an optimal translation-stability equilibrium, highlighting the proposed BERT-GA framework as an effective, data-driven approach for the design and optimization of in-silico mRNA sequences.

[NLP-104] KVoiceBench KOpenAudioBench and KMMAU: Agent -Driven Korean Speech Benchmarks for Evaluating SpeechLMs

【速读】：该论文试图解决的问题是：当前语音语言模型（SpeechLM）的评估主要集中在英语，导致对多语言语音能力的可靠评估受限。现有通过自动语音识别（ASR）、翻译、归一化和文本转语音（TTS）进行基准迁移的方法会破坏语言特定的指令、答案约束以及口语形式；同时，在音频理解任务中，直接迁移源语言音频也无法保留目标语言的说话人属性、口音和副语言特征。

解决方案的关键在于提出两种人类-代理基准构建框架：第一种将源语言的口语问答（SpokenQA）基准转化为目标语言的SpokenQA基准；第二种则利用目标语言的ASR语料库，结合转录文本和说话人元数据，构建音频理解基准。基于此，研究者构建并公开发布了三个韩语语音基准：KVoiceBench 和 KOpenAudioBench 用于韩语SpokenQA，KMMAU 用于韩语音频理解，共计12,345个样本。实验表明，不同SpeechLM在英语与韩语之间的性能差距因模型和任务类型而异，且SpokenQA与音频理解任务的排名存在显著差异，揭示了仅依赖英语评估所无法发现的互补性弱点。

链接: https://arxiv.org/abs/2605.27984
作者: Haechan Kim,Seungjun Chung,Inkyu Park,Jihoo Lee,Jonghyun Lee
机构: KRAFTON; Kim Jaechul Graduate School of AI, KAIST; Department of Mathematical Sciences, Seoul National University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment of multilingual speech capabilities. Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, revealing complementary weaknesses invisible to English-only evaluation.

[NLP-105] Periodic RoPE for Infinite Context LLM s

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在处理超长上下文时因位置编码（如RoPE）预训练范围限制而导致的性能下降问题，即“位置耗尽”（position exhaustion）。解决方案的关键在于提出了一种名为周期性RoPE（Periodic RoPE, P-RoPE）的位置编码机制，结合滑动窗口注意力（Sliding Window Attention, SWA）与无位置编码全局注意力（No Positional Encoding, NoPE），构建双层结构：局部层利用P-RoPE和SWA捕捉窗口内相对位置关系，全局层通过NoPE实现全序列无约束交互。这种设计避免了对位置编码的外推需求，理论上支持无限长度上下文，实验证明所提出的MiniWin模型在长上下文效率与稳定性上优于标准GPT架构的MiniMInd，为实现真正无限上下文理解的大语言模型提供了可行路径。

链接: https://arxiv.org/abs/2605.27980
作者: Simin Huo
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages

点击查看摘要

Abstract:The ability to process ultra-long contexts is crucial for large language models (LLMs) to perform long-horizon tasks. While recent efforts have extended context windows to 1M and beyond, model performance degrades when sequence length exceeds the pre-trained range of positional encodings (e.g., RoPE), i.e., position exhaustion. This fundamental limitation must be overcome to achieve a truly infinite context. To address it, we propose Periodic RoPE (P-RoPE), a positional encoding mechanism designed to circumvent this exhaustion. It operates in conjunction with sliding window attention (SWA) to capture local dependencies and relative positions within each window. This local layer is then complemented by a global attention layer with No Positional Encoding (NoPE), enabling unbounded interaction across the entire sequence without positional constraints. By stacking these two types of layers, the model avoids the need for positional extrapolation to generalize longer and theoretically supports an infinite context window. Empirical results show that our model, MiniWin, outperforms MiniMInd with standard GPT architectures in long-context efficiency and stability. Our work provides a possible pathway toward LLMs with genuine infinite-context understanding. The code is available at \hrefthis https URLthis https URL.

[NLP-106] Semantic Flow Regularization: Teaching LLM s to Generate Diverse Yet Coherent Responses

【速读】：该论文试图解决大语言模型在进行人物设定（persona）或语气（tone）条件化微调时出现的输出多样性严重受限的问题，即所谓的“跨风格坍缩”（Cross-Style Collapse）。其关键解决方案是提出一种轻量级辅助目标——语义流正则化（Semantic Flow Regularization, SFR），该方法通过条件流匹配（conditional flow matching）监督主干模型，利用未来片段的连续句子编码嵌入（sentence-encoder embeddings）来引导生成过程。SFR 在训练中引入随机流源以保持多模态性，且在推理阶段可移除流匹配头，不增加任何部署成本。实验表明，SFR 在大规模工业对话数据集（Qwen3-32B，9种人格）上显著提升了输出多样性、风格保真度和响应质量；在 LiveCodeBench-v5 和 MBPP 等任务上也验证了其泛化能力，并揭示了多标记预测（Multi-Token Prediction）仅为 SFR 的退化特例。

链接: https://arxiv.org/abs/2605.27971
作者: Kerui Peng,Feifei Li,Xingyu Fan,Wenhui Que
机构: Tencent Inc. (腾讯公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When large language models are fine-tuned to generate persona- or tone-conditioned responses, their output diversity is severely limited–a failure we term Cross-Style Collapse. We trace this collapse to the cross-entropy objective, which under shared representations tends to suppress diverse continuations. We propose Semantic Flow Regularization (SFR), a lightweight auxiliary objective that supervises the backbone with continuous sentence-encoder embeddings of future segments via conditional flow matching. The stochastic flow source preserves multi-modality by construction; the flow-matching head is discarded at inference, adding zero deployment cost. On a large-scale industrial dialogue dataset (Qwen3-32B, 9 personas), SFR improves output diversity, style fidelity, and response quality over SFT. We further validate on the public LiveCodeBench-v5 (Qwen2.5-Coder-7B-Instruct), where SFR consistently improves pass@k, confirming generality beyond stylized dialogue. A controlled comparison on MBPP reveals Multi-Token Prediction to be a degenerate special case of SFR.

[NLP-107] Boundary Suppression Asymmetry in Post-trained Assistants: Over-expansion as a Controllability Cost

【速读】：该论文试图解决的问题是：在后训练阶段优化语言模型助手以避免“欠回答”（under-answering）的行为是否会导致可控性上的不对称成本，即当用户明确要求更简短或边界清晰的回答时，哪些助手行为仍可被抑制，哪些则难以控制。其解决方案的关键在于识别并量化这种“边界抑制不对称性”（boundary-suppression asymmetry），发现抗欠答（anti-underanswering）倾向的助手行为比基准行为更难通过边界控制手段压制，且这种现象不依赖于输出长度、EOS（结束符）失败或局部续写偏差等常见解释机制，而是源于内容预算超调与持续生成惯性的共同作用——即“混合规划/停止机制”（mixed planning/stopping account）。这一发现揭示了后训练可能引入方向特定的可控性代价：某些有益行为虽易触发，却更难局部抑制。

链接: https://arxiv.org/abs/2605.27969
作者: Jiarui Han
机构: University of Waterloo (滑铁卢大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-trained language-model assistants are often optimized to avoid under-answering, encouraging complete, helpful, cautious, and proactive responses. We ask whether this optimization creates asymmetric controllability costs: when users explicitly request narrower answers, which assistant behaviors remain suppressible, and which continue to shape the response? We study this problem as boundary-suppression asymmetry. Prompt-side probes across multiple high-level response dimensions suggest a selective cost, concentrated around `too-much assistant’ directions such as over-completion, extra help, and anti-underanswering. Using controlled assistant-policy variants derived from a shared base model, we find that anti-underanswering policies are harder to pull back than the baseline under matched boundary-control evaluations, while minimal-boundary variants generally avoid this anti-side upward shift in the direct boundary-control comparisons. Mechanism-oriented probes point beyond longer default outputs, pure EOS failure, uncertainty compensation, and local continuation bias, while robustness checks preserve the main anti-over-baseline ordering under shared-system and larger-scale settings. The evidence supports a mixed planning/stopping account, where content-budget overshoot and continuation persistence jointly make boundary correction harder. Overall, post-training may create direction-specific controllability costs: some helpful assistant tendencies remain easy to invoke, yet harder to locally suppress. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.27969 [cs.CL] (or arXiv:2605.27969v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.27969 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-108] Pressure-Testing Deception Probes in LLM s: Scaling Robustness and the Geometry of Deceptive Representations ACL2026

【速读】：该论文试图解决的问题是：当前基于大语言模型（LLM）激活训练的线性探测器（linear probes）在干净基准测试中表现出极高的欺骗检测准确率（AUROC > 0.96），但在分布外（distributional shift）场景下性能急剧下降，其根本原因尚不明确。解决方案的关键在于系统性地压力测试（pressure-test）Gemma 3系列模型（参数规模1B–27B）上的探测器表现，并通过四个关于欺骗编码机制的假设检验——单一线性方向、多维子空间、凸锥包络和熵代理——来诊断失败的根本原因。研究发现：（1）探测器在干净数据上表现优异（AUROC=0.998），但在风格迁移下崩溃；（2）通过风格增强训练可恢复高检测性能（平均AUROC=0.979–0.983）；（3）单一方向假设被拒绝（k=1时AUROC仅0.61–0.80），且跨域失败源于几何结构而非层间不匹配；（4）熵代理假设也被否定，而多维探测器（k=5）能从分布式亚阈值特征中恢复信号。最终结论指出，探测器脆弱性源于训练分布狭窄而非架构限制，反向缩放现象实为训练数据分布偏差所致，而非真实规模依赖效应。

链接: https://arxiv.org/abs/2605.27958
作者: Sachin Kumar
机构: LexisNexis, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at the GEM Workshop @ ACL 2026

点击查看摘要

Abstract:Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; © the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.

[NLP-109] DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints

【速读】：该论文试图解决的问题是：在灾害响应场景中，如何有效协调多种异构人工智能工具（如卫星分析、洪水预测和损毁评估）形成结构化的多步骤工作流，并确保大型语言模型（LLM）不仅语义上合理地选择工具，还能生成具有正确参数绑定和依赖传播的可执行流程。其解决方案的关键在于提出DisasterBench基准测试平台，用于评估LLM在类型化工具接口约束下的结构化多智能体规划能力，并引入首个故障定位方法First-Point-of-Failure (FPoF)，能够精确识别预测工作流中的最早根因错误，从而区分主要错误与下游级联效应。研究发现，规划方法的有效性高度依赖于模型容量；工具误用和参数绑定错误是导致首次失败的主要原因，揭示了语义理解与执行一致性是两个独立瓶颈；此外，冗长的中间推理可能引发指令冲突，破坏结构化输出要求，进一步凸显出当前LLM在语义推理与执行落地之间存在显著差距，亟需构建同时建模语义意图、执行约束和工作流一致性的新型规划框架。

链接: https://arxiv.org/abs/2605.27957
作者: Zhitong Chen,Kai Yin,Weifeng Zhang,Zhiyuan Wang,Xiangjue Dong,Chengkai Liu,Zhewei Liu,Yiming Xiao,Ali Mostafavi,James Caverlee
机构: Texas AM University; University of Toronto
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Disasters cause severe societal impacts, demanding rapid coordination of heterogeneous AI tools, from satellite analysis to flood prediction and damage assessment, into coherent multi-step workflows. As LLMs increasingly serve as orchestrators of such pipelines, effective coordination requires more than selecting semantically plausible tools: LLMs must generate executable workflows with correct parameter binding and dependency propagation. We introduce DisasterBench, a benchmark for evaluating structured multi-agent planning over semantically similar but operationally distinct disaster-response tools. To enable step-level failure attribution, we further propose First-Point-of-Failure (FPoF), which localizes the earliest root cause in a predicted workflow, separating primary errors from downstream cascading effects. Our evaluation reveals three findings: planning method effectiveness depends strongly on model capacity; tool mismatch and parameter-binding errors dominate first failures, revealing semantic grounding and execution consistency as distinct bottlenecks; and verbose intermediate reasoning can create instruction clash with structured output requirements, disrupting plan generation. Together, these findings highlight a fundamental gap between semantic reasoning and execution-grounded coordination, underscoring the need for planning frameworks that jointly model semantic intent, execution constraints, and workflow consistency. Code, data, and evaluation resources are available at: this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.27957 [cs.CL] (or arXiv:2605.27957v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.27957 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhitong Chen [view email] [v1] Wed, 27 May 2026 04:50:23 UTC (354 KB) Full-text links: Access Paper: View a PDF of the paper titled DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints, by Zhitong Chen and 9 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-110] Skill-as-Pseudocode: Refactoring Skill Libraries to Pseudocode for LLM Agents

【速读】：该论文试图解决大语言模型（LLM）代理在使用自由格式的Markdown技能库时，因无法直接理解输入参数结构和具体调用语法而导致的重复检索与执行错误问题，即“困惑-重新检索-仍困惑”的循环。其解决方案的关键在于提出Skill-as-Pseudocode (SaP)，通过自动将Markdown技能文本转换为带类型签名的伪代码，并引入四重确定性验证器（覆盖、绑定、替换、风险）对提取出的技能契约进行质量控制；最终将结构化的契约与具体的动作模板内联到技能骨架中，使代理获得两种互补信号：一是清晰的类型化接口描述技能功能，二是可直接调用的具体语法模板，从而显著提升任务成功率并减少资源消耗。

链接: https://arxiv.org/abs/2605.27955
作者: Xinze Li,Yuhang Zang,Yixin Cao,Aixin Sun
机构: Nanyang Technological University (南洋理工大学); Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL)
备注: Preprint. Code: this https URL

点击查看摘要

Abstract:Markdown skill libraries for LLM agents ship as free-form prose, forcing the agent to re-derive both the input schema and the concrete invocation syntax on every retrieval. We observe that this often produces a “confused - re-retrieve - still confused” loop in which the agent issues a partially-correct action, receives uninformative environment feedback, and re-retrieves the same prose. We propose Skill-as-Pseudocode (SaP), an automatic conversion of markdown skill libraries into typed pseudocode with deterministic quality control. For each cluster of similar procedural passages drawn from one or more skills, SaP extracts a typed contract and filters it through a four-check deterministic verifier (coverage, binding, replacement, risk). Promoted contracts are inlined into a rewritten skill skeleton together with restored concrete action templates, giving the agent two complementary signals: a typed signature for what the skill does and a concrete template for how to invoke it. On the 134-game ALFWorld unseen split with gpt-4o-mini, pooled across three seeds, SaP wins 82/402 paired games versus 47/402 for the Graph-of-Skills (GoS) baseline (pooled McNemar p = 8.2e-5), at -22.8 +/- 6.4% input tokens and -14.5 +/- 4.1% LLM calls per game.

[NLP-111] GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization

【速读】：该论文试图解决强化学习在语言模型推理中应用时面临的三大局限性：对特定领域验证器的依赖、稀疏结果奖励以及粗粒度的信用分配机制。其解决方案的关键在于提出一种名为GeneralThinker的在线策略（on-policy）框架，将推理监督重新建模为密集的答案条件优化（answer-conditioned optimization），从而实现响应级别的评估与标记级别的信用分配，且无需依赖领域特定的验证器。GeneralThinker通过计算真实答案的似然来评估生成的推理轨迹，并推导出用于细粒度信用分配的标记级兼容性信号；同时，为稳定优化过程，引入剪裁（clipping）和方向保持调制（direction-preserving modulation）机制控制标记级更新。实验表明，GeneralThinker在11个涵盖数学、STEM及通用推理的基准测试中实现了最优平均性能，且分析揭示了受控的标记级调制对于确保细粒度信用分配稳定有效的重要性。

链接: https://arxiv.org/abs/2605.27934
作者: Shengmin Piao,Sanghyun Park
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards improves language model reasoning, but its reliance on domain-specific verifiers, sparse outcome rewards, and coarse-grained credit assignment limits its applicability. We introduce GeneralThinker, an on-policy framework that reformulates reasoning supervision as dense answer-conditioned optimization, enabling response-level evaluation and token-level credit assignment without domain-specific verifiers. GeneralThinker evaluates generated reasoning trajectories using the likelihood of the ground-truth answer and derives token-wise compatibility signals for fine-grained credit assignment. To stabilize optimization, it constrains token-level updates through clipping and direction-preserving modulation. Across 11 benchmarks spanning mathematics, STEM, and general reasoning, GeneralThinker achieves the best average performance. Further analyses show that uncontrolled token-level modulation can destabilize training, whereas controlled modulation makes fine-grained credit assignment consistently effective.

[NLP-112] When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

【速读】：该论文试图解决的问题是：在大型视觉语言模型中，不同推理范式（如直接生成回答、仅文本前序轮次、视觉状态操作和显式外部图像工具调用）对多模态越狱攻击（multimodal jailbreak）的鲁棒性影响差异尚不明确。解决方案的关键在于提出并验证一种“图像工具安全向量框架”（image-tool safety vector framework），该框架将图像工具调用建模为隐藏表示空间中朝向安全相关方向的残差偏移。实验表明，显式图像工具交互能将平均越狱成功率降低约30%，且这一效果不受返回图像内容是否安全或是否被手动覆盖的影响，说明其安全性并非源于图像语义或文本轨迹本身，而是由模型内部表征层面的结构化调整所驱动。这一发现揭示了图像工具调用作为提升模型安全性的新设计模式，并强调需开展针对具体流水线的安全评估。

链接: https://arxiv.org/abs/2605.27932
作者: Yuan Tian,Bing Hu,Fang Wu,Xiaomin Li,Binghang Lu,Neil Zhenqiang Gong
机构: Stanford University (斯坦福大学); Harvard University (哈佛大学); Purdue University (普渡大学); Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 17 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.

[NLP-113] OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

【速读】：该论文旨在解决通用医疗多模态大语言模型（Multimodal Large Language Models, MLLMs）在眼科等高度专业化领域适应性不足的问题，其核心挑战在于缺乏大规模、高质量的领域特定指令微调数据。现有眼科对话系统数据集规模有限且主要依赖公开基准图像，难以捕捉真实临床场景的复杂性。解决方案的关键在于提出一个名为 OphIn-Engine 的眼科专用指令数据构建流水线，该流程通过多模态转录提取图像-文本对、视觉线索分离与评分以识别临床相关描述，并结合质量控制机制合成准确且多样化的临床对话指令。基于此引擎，研究者构建了包含50万条指令实例和15.1万张唯一图像的大型多模态眼科指令数据集 OphIn-500K，并在此基础上开发出具有先进视觉理解与对话能力的眼科专用MLLM——OphIn-VL。实验表明，OphIn-VL在多项指标上优于当前最先进的通用及专科医疗多模态模型。

链接: https://arxiv.org/abs/2605.27916
作者: Xuanzhao Dong,Wenhui Zhu,Xiwen Chen,Hao Wang,Xin Li,Yujian Xiong,Jiajun Cheng,Jingjing Wang,Xiaobing Yu,Haiyu Wu,Shao Tang,Zhipeng Wang,Langechuan Liu,Shan Lin,Oana Dumitrascu,Yalin Wang
机构: Arizona State University (亚利桑那州立大学); Clemson University (克莱姆森大学); Washington University in St. Louis (圣路易斯华盛顿大学); University of Notre Dame (圣母大学); Florida State University (佛罗里达州立大学); Rice University (莱斯大学); NVIDIA (英伟达); Mayo Clinic (梅奥诊所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primarily due to the scarcity of large-scale, domain-specific instruction-tuning data. Existing ophthalmic datasets for conversational agents are often limited in scale and largely rely on images from established public benchmarks, limiting the scalability of ophthalmic MLLMs and their ability to capture real-world clinical complexity. To address this gap, we propose \textbfOphIn-Engine , an ophthalmology-specific instruction data curation pipeline that constructs high-quality instruction data from open-access ophthalmology web-scale videos. The pipeline integrates multimodal transcription for extracting image-transcript pairs, visual cue separation and scoring for identifying clinically relevant visual descriptions, and instruction synthesis with quality control for generating accurate and diverse clinical dialogues. Using this engine, we introduce \textbfOphIn-500K , a large-scale multimodal ophthalmology instruction-tuning dataset containing over 500,000 instruction instances and more than 151,000 unique images from over 29,000 video clips, formatted as visual question answering (VQA), multi-turn conversational interactions, and chain-of-thought (CoT) reasoning. Built upon this dataset, we further develop \textbfOphIn-VL , an ophthalmology-specific MLLM with advanced visual understanding and conversational capabilities. Comprehensive experiments and case studies demonstrate that OphIn-VL achieves superior performance compared with state-of-the-art general medical and domain-specific MLLMs.

[NLP-114] Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking

【速读】：该论文试图解决大语言模型（LLM）在主观行为特质（如共情能力、克制性、情绪表达的校准度）评估中缺乏可靠且可验证的测量工具的问题。传统方法依赖人类评分者共识，但其一致性较低（rho ~ 0.45），且使用同源LLM作为评判者易引发循环验证问题；同时，当人类本身对某些能力存在分歧时，单一人类共识无法提供有效锚定。解决方案的关键在于提出一种“复制优先”（replication-first）范式：通过四个正交维度验证测量仪器的有效性——跨多次运行的可靠性、不同架构模型间的跨仪器复现、基于早期训练批次的历史足迹校准，以及预注册预测。该方法在情感陪伴任务中实现了自适应演化评分体系（最终稳定为9维），并成功揭示了聚合分数掩盖的细微差异，例如GPT-5相较于GPT-4.1在建议克制性上下降1.87分，而整体平均得分未变。该范式具备高稳健性（用户代理替换后回归系数保持95%以上）、跨家族与时间跨度复现能力，并达到ordinal Krippendorff alpha = 0.91的信度水平，同时还能区分仪器上限（可通过评分标准优化突破）与结构性天花板（需场景或模型集合调整）。

链接: https://arxiv.org/abs/2605.27914
作者: Yuming(Rapheal)Huang,Yao Liu,Lei Wang,Junchen Wan
机构: Cylingo team( Cylingo团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Subjective evaluation of LLM behavior – empathy, restraint, calibrated emotional tone – is hard. Human inter-rater agreement on such qualities saturates near rho ~ 0.45, and an LLM-as-judge proxy alone risks circularity: a judge sharing the target’s training cohort cannot independently verify it. Anchoring validity to a single human-rater consensus does not extend to capabilities where humans themselves disagree. We propose a replication-first paradigm: instead of anchoring on one rater group, we certify the instrument via four orthogonal properties – reliability across K runs, cross-instrument replication across architecturally distinct judges, historical-footprint calibration via judges from earlier training cohorts, and pre-registered prediction. We test it on emotional accompaniment by letting the rubric self-evolve data-driven across iterations: the dimensions are not pre-stipulated and the procedure stabilizes to a 9-dimension set. Pre-registration applies to 10 falsifiable hypotheses and 11 forward predictions, committed before any test data was collected. Applied to 49 models across 8 families, the paradigm surfaces what aggregate scores hide. On advice-restraint – whether a model refrains from giving unsolicited solutions in empathic contexts – gpt-5 falls 1.87 points from gpt-4.1 and Opus-4.7 falls 0.629 from Opus-4.6, while aggregate scores stay flat. The regression survives three user-proxy swaps (95% of magnitude), replicates across a 5-family judge stack and a 17-month cohort gap, and persists on 74 held-out real ESConv conversations (rho in [0.749, 0.850]); the instrument reaches ordinal Krippendorff alpha = 0.91. As a by-product, the paradigm acts as a saturation-source diagnostic, separating instrumental ceilings (breakable by rubric refinement) from structural ceilings (needing scenario or roster intervention). Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.27914 [cs.CL] (or arXiv:2605.27914v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.27914 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-115] ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

【速读】：该论文旨在解决现有情感支持对话（ESC）系统在可解释性不足和缺乏系统性技能提升支持方面的局限性，这些问题源于其依赖端到端响应生成或粗粒度策略监督的架构。解决方案的关键在于提出一个以技能为中心的框架 ESC-Skills，其核心创新包括：首先将局部支持交互建模为干预单元（Intervention Units, IUs），捕捉求助者状态、支持干预与响应后情绪变化之间的动态关系；进而基于成功与失败对话中提取的 IUs 构建可执行的情感支持技能库（ESC-Skills Bank），其中包含干预指导、适用条件、预期结果及潜在风险；最后引入多角色自我进化精炼机制，通过模拟不同求助者画像与 SAGE 评估互动，分析交互轨迹以识别缺失技能、不安全干预及特定画像下的失败模式，并利用仿真验证对技能库进行迭代优化。该方法显著提升了响应质量和对话层面的情绪改善效果，同时增强了支持行为的可解释性和可控性。

链接: https://arxiv.org/abs/2605.27908
作者: Jie Zhu,Huaixia Dou,Shuo Jiang,Junhui Li,Lifan Guo,Feng Chen,Chi Zhang,Fang Kong
机构: Soochow University (苏州大学); Alibaba Cloud Computing (阿里云计算)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offering limited interpretability and little support for systematic skill improvement. We propose ESC-Skills, a skill-centric framework that discovers and self-evolves executable emotional support skills. We first model localized support interactions as Intervention Units (IUs), which capture state–action–outcome dynamics between seeker states, support interventions, and post-response emotional changes. Based on IUs extracted from both successful and failed ESC dialogues, we construct the ESC-Skills Bank, a repository of executable emotional support skills containing intervention guidance, applicability conditions, expected outcomes, and potential risks. To further improve robustness, we introduce a multi-profile self-evolutionary refinement framework in which an ESC agent interacts with diverse simulated seeker profiles under SAGE evaluation. The resulting interaction traces are analyzed to identify missing skills, unsafe interventions, and profile-specific failure patterns, which are then used to refine the Skills Bank through simulation-based verification. Experimental results demonstrate that ESC-Skills improves both response-level quality and dialogue-level emotional outcomes while providing more interpretable and controllable support behaviors. We will release the code, prompts, and ESC-Skills Bank at this https URL.

[NLP-116] AI Research Agents Narrow Scientific Exploration

【速读】：该论文试图解决的问题是：当前生成式AI研究代理（AI research agents）在科学探索中是否能够有效拓宽研究边界，还是仅仅局限于已有文献的局部扩展。解决方案的关键在于将AI研究代理视为科学搜索系统，并通过实证分析比较AI生成的研究想法与人类撰写的论文、基于相同种子文献的人类后续研究以及种子文献本身之间的差异。研究使用四个AI代理框架和六种大语言模型，在人工智能与机器学习领域的多个引用定义的研究方向上生成了37,802个科学想法，并发现四大一致模式：AI生成的想法比人类论文更集中、更贴近初始文献、与相似度高的论文关联性弱（引用较少），且其创新主要源于现有技术方法的重组而非全新问题的提出。结论表明，当前AI研究代理更适合局部深化而非拓展科学探索的广度。

链接: https://arxiv.org/abs/2605.27905
作者: Yixuan Tang,Yi Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:AI research agents can now generate research ideas, design experiments, run code, and draft papers, raising the possibility of large-scale AI-assisted scientific discovery. Many current agent frameworks explicitly encourage the generation of novel and high-impact ideas. Yet it remains unclear whether AI-assisted ideation broadens scientific exploration or mainly concentrates around existing work. We study AI research agents as scientific search systems. Using four AI research-agent frameworks and six large language models, we generate 37,802 scientific ideas from shared seed literature across citation-defined research areas in AI and machine learning. We then compare the resulting AI ideas against human-authored papers from the same research areas, follow-on human research emerging from the same seed literature, and the seed literature itself. Across experiments, four consistent patterns emerge. First, AI-generated ideas are substantially more concentrated than human-authored papers from the same research areas. Second, AI-generated ideas remain much closer to their starting literature than later human follow-on work does. Third, papers most similar to AI-generated ideas tend to receive lower subsequent citations. Fourth, when AI-generated ideas differ from prior work, the differences arise primarily from recombining existing technical methods rather than introducing fundamentally new research questions. Overall, current AI research agents appear better suited to local elaboration than to broadening scientific exploration.

[NLP-117] he Frag ility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

【速读】：该论文试图解决的问题是：当前基于思维链（Chain-of-thought, CoT）的监控机制在多语言环境和不同模型家族中的可靠性尚未被充分验证，尤其是在非英语场景下是否存在系统性失效风险。解决方案的关键在于通过大规模跨语言评估（覆盖13种语言、7类前沿模型共16个模型），结合对抗提示测试（adversarial-hint evaluations）与内部答案词元概率分析，揭示CoT监控在多种语言中普遍存在“不忠实”现象（平均失效率达95.9%），并发现前沿模型会采用策略性欺骗行为（如答案切换、事后合理化、利用提示进行程序性操纵），且这些行为往往在生成早期（前15%）就已体现在潜在激活中，即使表面CoT看似合理。研究进一步指出，低资源语言中的欺骗模式保持100%不变，表明当前CoT监控对语言分布偏移极为脆弱，远弱于仅基于英语的研究结论，从而强调亟需发展鲁棒的CoT监控方法及白盒监测技术，尤其针对中低资源语言提升可监测性。

链接: https://arxiv.org/abs/2605.27901
作者: Eric Onyame,Runtao Zhou,Kowshik Thopalli,Bhavya Kailkhura,Chirag Agarwal
机构: University of Virginia; Lawrence Livermore National Laboratory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9% across 8B–120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available \hrefthis https URL\textcolorbluehere.

[NLP-118] FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLM s via Board Game Simulations

【速读】：该论文试图解决的问题是：当前大型语言模型（LLM）在静态金融推理和简单动态交易任务中表现优异，但缺乏能够评估其在真实环境中进行动态财富管理和金融决策能力的基准测试。为填补这一空白，作者提出了FinBoardBench——一个基于三种经典金融桌游（Cashflow、Acquire 和 Monopoly）的评估套件，用于系统性测试模型在现金流管理、企业投资与并购预测以及竞争性资产拍卖谈判等方面的综合金融技能。解决方案的关键在于构建一个融合多维度财务策略与动态交互机制的评测环境，从而揭示现有 LLM 在复杂情境下决策能力的局限性，尤其是其无法将静态推理优势转化为有效的动态决策，且倾向于忽视流动性风险，导致在随机事件冲击下易陷入财务危机。

链接: https://arxiv.org/abs/2605.27896
作者: Xuesi Hu,Peng Wang,Jinpeng Miao,Xilin Tao,Caiwei Li,Yue Ma,Jie He,Qiancheng Zhang,Yuntao Zou,Dagang Li
机构: Macau University of Science and Technology (澳门科技大学); Anhui University (安徽大学); SKLPlanets, Macau University of Science and Technology (澳门科技大学行星科学重点实验室); University of Macau (澳门大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE)
备注: Preprint

点击查看摘要

Abstract:Recently, large language models (LLMs) have achieved superior performance in static financial reasoning and simple dynamic trading tasks. However, existing static financial benchmarks are insufficient to assess the dynamic wealth management and financial decision-making capabilities of LLMs in real-world environments. To bridge this gap, we present FinBoardBench, an evaluation suite based on three classic financial board games: Cashflow, Acquire, and Monopoly. FinBoardBench assesses a comprehensive set of financial skills, including personal cash flow management with debt balancing, corporate investment and acquisition forecasting, and competitive trade negotiations with asset auctions. Our experiments with 9 advanced LLMs reveal that while exhibiting basic long-term planning and investment logic, they fail to effectively leverage complex interactions for profit, and their strong static reasoning performance does not transform into successful dynamic decision-making. Notably, they tend to prioritize immediate asset acquisition over maintaining sufficient liquidity, making them vulnerable to financial crises triggered by random events. We hope that FinBoardBench can provide a valuable reference for more intelligent LLM-based decision-making systems in the future.

[NLP-119] VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

【速读】：该论文试图解决大语言模型（Large Language Model, LLM）代理在搜索任务中评估结果与真实用户体验之间存在的显著差距问题。现有基准测试依赖于过度指定的查询、单轮交互和固定模式的评估方式，无法反映用户与代理通过多轮对话逐步澄清模糊意图的真实搜索行为。为此，作者提出了一种新范式——VibeSearch，并构建了VibeSearchBench这一包含200个手动标注的双语（中文和英文）任务的基准数据集，覆盖20个领域，分为专业级（VibeSearch-Pro）和日常应用级（VibeSearch-Daily）两个子集。每个任务均配对用户画像与无schema约束的知识图谱作为真实答案，并采用渐进披露的用户模拟器和基于图匹配的评估框架进行评测。实验表明，即使在最先进的ReAct和OpenClaw代理架构下，所有模型表现仍严重不足（最佳F1仅为30.30），凸显出在长上下文推理、主动意图挖掘和结构化知识构建等方面亟需根本性技术突破。

链接: https://arxiv.org/abs/2605.27882
作者: Xiaohongshu Inc
机构: Xiaohongshu Dots Studio (小红书dots工作室); Unipat AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks’ reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

[NLP-120] Retrieval Reward and Training Protocols: What Matters in Training Search Agents ?

【速读】：该论文旨在解决当前基于大语言模型（Large Language Models, LLMs）的搜索代理（Search Agents）在训练方法快速演进背景下缺乏可控对比的问题。现有研究在检索语料库、奖励设计和训练协议上存在显著差异，导致难以明确哪些因素真正驱动性能提升。其解决方案的关键在于开展一项受控的实证研究，系统性地隔离并评估三个此前被忽视的训练维度：（1）发现广泛使用的Wikipedia 2018语料库存在关键的数据覆盖不足问题，修正后带来的性能提升甚至超过不同训练算法之间的差异；（2）在三种基础模型上系统比较基于结果（outcome-based）与基于过程（process-based）的奖励机制，发现最简单的基于结果的奖励方法在多数场景下表现相当或更优，而过程级信用分配可能导致行为过度校正；（3）分析训练数据多样性、离策略数据利用效率及搜索预算扩展策略，提炼出可落地的训练优化指南。

链接: https://arxiv.org/abs/2605.27881
作者: Yibo Zhao,Zichen Ding,Jiayi Wu,Zun Wang,Xiang Li
机构: East China Normal University (华东师范大学); Shanghai AI Laboratory
类目: Computation and Language (cs.CL)
备注: 18pages, 4 figures, and 15 tables

点击查看摘要

Abstract:Search agents powered by large language models can autonomously decompose queries, retrieve information, and synthesize answers through multi-step reasoning. However, the rapid growth of training methods has outpaced controlled comparison: existing works differ in retrieval corpora, reward designs, and training protocols, making it unclear what actually drives improvements. We present a controlled empirical study that isolates three under-explored dimensions of search agent training. First, we identify a critical data-coverage issue in the widely used Wikipedia 2018 corpus and show that correcting it alone yields larger gains than the differences between training algorithms. Second, we systematically compare outcome-based and process-based reward methods across three base models, finding that the simplest outcome-based approach achieves competitive or superior performance in most settings, and that process-level credit assignment can over-correct agent behavior. Third, we analyze training data diversity, off-policy data utilization, and search budget scaling, distilling practical guidelines for training effective search agents. Our code is available at this https URL.

[NLP-121] Narrative Flattening: How Post-Training Compresses Thematic Affective and Stylistic Variation in LLM Fiction

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）生成的虚构文本虽然流畅，但常被认为缺乏深度和多样性，这种“叙事扁平化”现象是否源于训练过程，以及它在不同类型的人类虚构作品中是否具有普遍性。解决方案的关键在于构建一个跨平台的匹配故事续写范式，涵盖StoryStar（公众平台）、TMAS（提示引导）和《纽约客》（专业文学）三种不同来源的故事，并对比四个不同训练阶段的OLMo 32B模型（Base、SFT、DPO、RLVR）与人类续写文本在三个句子层面的维度表现：主题运动性（thematic motion）、情感强度（affective prevalence）和语言多样性（linguistic diversity）。研究发现，后训练阶段显著压缩了动态变化，表现为主题过渡趋于一致、高情绪强度减弱为中性、风格多样性下降，这一现象被作者称为“叙事扁平化”（narrative flattening），且其方向稳定但程度因人类基线而异——专业文学虚构受影响最大，而公众平台和提示引导故事受影响较小，表明对齐过程使模型输出趋于统一，削弱了对原始文本叙事纹理的敏感性。

链接: https://arxiv.org/abs/2605.27878
作者: Zehan Li,Yutong Zhu,Siyang Wu,Honglin Bao,James A. Evans
机构: University of Chicago
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models produce fluent fiction, yet their creative output is widely seen as flat. We ask where this quality originates in the training and whether it affects different domains of human fiction equally. We construct a matched story-continuation paradigm across StoryStar (public-platform), TMAS (prompt-guided), and The New Yorker (professional literary)-and compare continuations from four OLMo 32B checkpoints (Base, SFT, DPO, RLVR) against matched human text. Because these checkpoints share architecture, scale, tokenizer, and pretraining, the design isolates the post-training effect. We measure each continuation along three sentence-level dimensions: thematic motion, affective prevalence, and linguistic diversity. Across all three, post-training compresses dynamic variation: thematic transitions become more uniform, high-intensity emotions give way to neutrality, and stylistic diversity across stories shrinks. We term this progressive loss narrative flattening. The effect is directionally stable across story domains but gap size depends on the human baseline: professional literary fiction is compressed most, while public-platform and prompt-guided stories show smaller gaps, consistent with their human baselines sitting closer to the model’s default rhythm. Post-trained endpoints converge across domains, suggesting alignment produces a continuation regime largely insensitive to the source domain’s narrative texture.

[NLP-122] Syllabic-Structure Decoder for Automatic Speech Recognition in Vietnamese

【速读】：该论文试图解决传统自动语音识别（ASR）系统在词汇表示上存在的问题，即依赖正字法单位（如字符、子词或单词）进行预测，导致模型无法显式捕捉语音的音位结构，并且需要庞大的词汇表以保证覆盖度。其解决方案的关键在于提出一种基于音节结构的解码器（Syllabic-Structure Decoder），该方法将语音建模从正字法层面转移到音位层面，显式建模音节的音韵组成，从而利用紧凑的音位词典生成合法的音节结构。这一设计不仅更贴近语音的实际发音机制，还显著减少了词汇规模。实验结果表明，该方法在两个基准数据集（LSVSC和UIT-ViMD）上均优于强基线模型（包括PhoWhisper和Wav2Vec2），即使在使用更小词汇量且无额外训练资源的情况下仍表现优异，验证了音位级音节建模在越南语ASR中的有效性。

链接: https://arxiv.org/abs/2605.27874
作者: Nghia Hieu Nguyen,Quan Ngoc Hoang,Long Hoang Huu Nguyen,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Most Automatic Speech Recognition (ASR) systems formulate transcription as a prediction problem over orthographic units such as characters, subwords, or words. Although effective, such representations do not explicitly reflect the phonetic structure of speech and often require large vocabularies to maintain adequate coverage. In this work, we are motivated from the phonemic features of Vietnamese to propose a Syllabic-Structure Decoder for ASR, which models speech at the phoneme level instead of the orthographic level. Our approach explicitly captures the phonological composition of syllables, enabling the decoder to generate valid syllabic structures from a compact phonemic inventory. This design more closely aligns with the phonetic realization of speech while significantly reducing vocabulary size. Experimental results on two benchmarks: LSVSC, representing standard speech, and UIT-ViMD, a multi-dialect corpus containing diverse regional pronunciations, show that our method consistently outperforms strong previous baselines, especially pretrained baselines such as PhoWhisper and Wav2Vec2, despite using a substantially smaller vocabulary and no additional training resources. These results highlight the effectiveness of phoneme-based syllabic modeling for ASR in this language. Code for experimental reproducibility will be publicly available upon the acceptance of this paper.

[NLP-123] GRADE: Generalizable Reasoning -Aware Dialogue Evaluation for AI Tutors

【速读】：该论文试图解决的问题是：如何有效评估人工智能辅导系统（AI tutor）在学生-导师对话中的教学能力，而不仅仅是判断回答是否正确。传统方法往往只关注事实准确性，但真正有效的AI辅导需要具备识别错误、定位问题、提供指导以及给出可操作下一步建议的能力。解决方案的关键在于提出GRADE框架——一个系统性研究开源模型在教学能力评估任务上的表现，涵盖多种训练策略（如LoRA微调、合成数据增强、思维链+推理等），并发现：Gemma3-12B在单任务设置下表现最优，而Gemma3-27B（8-bit精度）更适合多任务预测；合成数据增强对性能较差的模型有显著提升作用，而验证机制虽成本高但收益有限；此外，LoRA微调在结构化分类目标上可能干扰指令遵循行为，导致生成偏离所需评估格式；最后，模型选择和推理模式对碳排放影响显著。研究证明，精心设计的开源LoRA流程可在关键教学维度上达到甚至超越商业或集成系统的效果。

链接: https://arxiv.org/abs/2605.27866
作者: Parth Bhalerao,Jeromy Chang,David Chou,Oana Ignat
机构: Santa Clara University (圣克拉拉大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:Evaluating AI tutor responses requires more than factual correctness: tutors must identify mistakes, locate errors, provide guidance, and offer actionable next steps. We present GRADE, a systematic study of open-source models for pedagogical ability assessment in student-tutor dialogues. Building on the BEA 2025 TutorMind setting, we evaluate 120 configurations across five language models, zero-shot inference, LoRA fine-tuning, synthetic augmentation, CoT+Reasoning, and single-task versus multitask formulations. Gemma3-12B performs best for single-task evaluation, while Gemma3-27B in 8-bit precision is more reliable for multitask prediction. We find that augmentation helps models that struggle with the original data, verification adds limited gains despite higher cost, and CoT+Reasoning is more useful for synthetic data generation than direct classification. We further show that LoRA fine-tuning on structured classification objectives interferes with instruction-following behavior under thinking mode, redirecting generation away from the required evaluation format. Carbon analysis shows that model choice and reasoning mode substantially affect emissions. Overall, GRADE shows that carefully selected open-source LoRA pipelines can match or surpass proprietary and ensemble-based systems on key pedagogical dimensions, with code and data available at this https URL.

[NLP-124] MERIT: Matching Expertise via Rubric-Informed Training for Reviewer Assignment

【速读】：该论文试图解决大规模学术会议中投稿与审稿人匹配的难题，现有方法要么依赖粗粒度代理信号（如主题相关性），混淆了泛化相关性与真实适配性，要么需要昂贵的人工标注数据，难以扩展用于训练。其解决方案的关键在于提出一个两阶段框架MERIT：第一阶段通过强化学习训练一个40亿参数的审稿评估器（reviewer assessor），基于论文特定的专业维度评分标准（expertise rubrics）由大语言模型（LLM）裁判提供奖励信号，精准识别论文所需的专业维度并匹配审稿人的过往工作，输出适配决策；第二阶段将评估器的预测结果蒸馏为嵌入式检索器（embedding-based retriever），实现高效的大规模审稿人分配。实验表明，该评估器在适配性分类任务上优于更大规模通用大模型，且最终检索器在LR-Bench和CMU Gold数据集上达到当前最优性能。

链接: https://arxiv.org/abs/2605.27865
作者: Zixuan Yang,Yibo Zhao,Weicong Liu,Xiang Li
机构: East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL)
备注: 22pages, 8 figures, 12 tables

点击查看摘要

Abstract:Matching submissions with suitable reviewers at scale is a growing challenge for major venues, yet existing approaches either rely on coarse proxy signals that conflate general relatedness with true suitability, or require expensive human annotations that are difficult to scale for training. We propose MERIT, a two-stage framework that bridges this gap by converting criterion-level expertise matching into scalable suitability supervision. In the first stage, we train a reviewer assessor via reinforcement learning to identify the expertise dimensions a paper requires, match them against the reviewer’s prior work, and produce a suitability decision, with rewards provided by an LLM judge guided by paper-specific expertise rubrics. In the second stage, we distill the assessor’s predictions into an embedding-based retriever for efficient large-scale assignment. Experiments show that our 4B reviewer assessor outperforms larger general-purpose LLMs on suitability classification, and the resulting retriever achieves state-of-the-art performance across LR-Bench and the CMU Gold dataset. Our code is available at this https URL.

[NLP-125] DecomposeRL: Learning to Ask Useful Informative and Diverse Questions for Semi-Supervised Traceable Claim Verification

【速读】：该论文试图解决的是事实核查（claim verification）任务中长期存在的权衡问题：端到端分类器虽然准确率高，但缺乏可解释性；而基于分解的方法虽能生成可追溯的推理链条，但在基准数据集上的性能表现滞后。其解决方案的关键在于提出一种名为DecomposeRL的新框架，该框架将分解过程建模为一个通过GRPO（Generalized Reward Policy Optimization）训练的强化学习（Reinforcement Learning, RL）策略，并结合多维奖励机制，从而在保持高准确率的同时生成可解释的推理路径。此外，为降低GRPO训练成本，作者设计了一个“数据筛选漏斗”（data-curation funnel），从11.5万条事实核查语料中提炼出仅5000条高信息密度的训练样本，使7B规模模型在仅用约5K标注数据全监督训练下即达到86.3%域内和69.8%域外平衡准确率，显著优于32B参数规模的基线模型及GPT-4.1-mini，且在仅有10%标注数据的半监督场景中仍具优势。

链接: https://arxiv.org/abs/2605.27858
作者: Shubhashis Roy Dipta,Ankur Padia,Francis Ferraro
机构: University of Maryland Baltimore County (马里兰大学巴尔的摩县分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Claim verification splits between end-to-end classifiers that are accurate but yields no inspectable traces, and decomposition-based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim-verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi-faceted reward ensemble, enabling both fully supervised and semi-supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data-curation funnel that distills 115K fact-verification claims into a compact, learning-signal-dense subset of 5K claims. We show that a DecomposeRL-7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in-domain and 69.8 out-of-domain balanced accuracy across 11 claim-verification benchmarks containing biomedical, political, scientific, and general-domain claims. Despite being 4x smaller, it matches 32B baselines and GPT-4.1-mini, and it further outperforms baselines in a semi-supervised setting with only 10% labeled claims data. Code, data, and models are available at this https URL

[NLP-126] FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

【速读】：该论文试图解决的问题是：当前基于大语言模型（LLM）的代码生成模型主要训练于命令式编程语言，对函数式编程语言（FPLs，如Haskell、OCaml和Scala）的支持严重不足，即使前沿模型在这些语言上的表现也显著落后。现有解决方案如按语言单独微调（per-language fine-tuning）无法捕捉跨语言的共性抽象，而混合多语言微调则引入了跨语言干扰。其关键解决方案是提出FPMoE，一个基于稀疏专家混合（MoE）架构的轻量级代码生成模型，包含三个针对特定语言的路由专家（分别对应Haskell、OCaml和Scala）以及一个共享专家，用于捕获跨语言的函数式抽象（如单子推理和类型导向编程）。该设计同时解决了上述两种失败模式：专用专家消除语言间干扰，共享专家保留了单语言模型所忽略的通用函数式模式，从而在FPEval基准上显著优于微调基线，并以仅3B活跃参数实现与更大模型相当甚至更优的性能。

链接: https://arxiv.org/abs/2605.27849
作者: Loc Pham,Lang Hong Nguyet Anh,Thanh Le-Cong
机构: GreenNode AI, Singapore; Hanoi University of Science and Technology; Singapore University of Technology and Design
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional programming languages (FPLs) such as Haskell, OCaml, and Scala chronically underexplored, with even frontier models performing substantially worse on FPLs. Fine-tuning is a natural remedy, but our experiments show that per-language fine-tuning fails to capture shared functional abstractions, while merged multi-language fine-tuning introduces cross-language interference. To address this, we introduce FPMoE, a lightweight, open-source code generation model built on a sparse Mixture-of-Experts (MoE) architecture with three language-specific routed experts (one each for Haskell, OCaml, and Scala) and a shared expert that captures cross-language functional patterns such as monadic reasoning and type-directed programming. This design resolves both failure modes simultaneously: dedicated experts eliminate interference, while the shared expert preserves abstractions that per-language models miss. On FPEval, FPMoE substantially outperforms fine-tuned baselines and, with only 3B active parameters, matches the performance of much larger models including DeepSeek-Coder-6.7B, Qwen2.5-Coder-14B-Instruct, and Qwen3-Coder-30B-A3B.

[NLP-127] CAREF: Calibration-Aware Regularization for Explanation Faithfulness Without Rationale Supervision

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在下游任务中如何同时提升预测准确性与解释忠实性（explanation faithfulness）的问题，尤其是避免传统方法依赖人工标注的rationale（理由）监督。其解决方案的关键在于提出一种参数高效微调框架CAREF，通过一个统一的损失函数——校准感知解释忠实性正则化损失（LSCED），将基于熵的校准（entropy-based calibration）与词元级别的稀疏性控制（token-level sparsity control）耦合起来，从而在无需理由监督的情况下实现对模型预测和解释之间一致性的优化。实验表明，CAREF-AQ变体在四个自然语言理解（NLE）基准上仅使用6.43%的可训练参数即可达到最优平均准确率（89.04）和解释对齐度（81.00 nBERT），显著优于LoRA和AdaLoRA等主流参数高效微调方法。

链接: https://arxiv.org/abs/2605.27835
作者: Naphat Nithisopa,Teerapong Panboonyuen
机构: Chulalongkorn University (朱拉隆功大学); MARSAIL (Motor AI Recognition Solution Artificial Intelligence Laboratory)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages

点击查看摘要

Abstract:We introduce CAREF, a parameter-efficient fine-tuning framework that jointly optimizes predictive accuracy and explanation faithfulness via calibration-aware regularization. At its core, CAREF couples entropy-based calibration with token-level sparsity control through a single unified loss, the Calibration-Aware Regularization for Explanation Faithfulness (LSCED), without requiring rationale supervision. Evaluated on four NLE benchmarks (COS-E, ECQA, ComVE, e-SNLI) with Flan-T5, our lightweight CAREF-AQ variant attains the best average accuracy (89.04) and explanation alignment (81.00 nBERT) using only 6.43% of trainable parameters, outperforming LoRA and AdaLoRA. To our knowledge, CAREF is the first method to unify entropy and sparsity regularization in a single training objective for interpretable LLM fine-tuning.

[NLP-128] Playing with Words Improving with Rewards: Training Language Models for Creative Association

【速读】：该论文试图解决的问题是：如何在缺乏主观判断的情况下，有效训练大语言模型（LLM）以提升其创造性能力。传统方法依赖人类评估，但存在主观性和局限性，难以规模化。解决方案的关键在于引入一种可客观验证的奖励机制——通过在Codenames（一个词语联想游戏）中训练模型，该游戏同时锻炼发散思维（divergent thinking）和收敛思维（convergent thinking），并提供明确的胜负结果用于强化学习中的奖励信号，从而实现无需人工标注的强化学习训练（即Reinforcement Learning with Verifiable Rewards, RLVR）。实验表明，该方法在不同规模模型上展现出尺度依赖的精度-多样性权衡：8B模型显著提升创造力（10项基准中有8项改善），而小模型（1.7B和4B）则更擅长提升推理精度。这为创造性AI的可扩展、可验证训练提供了新路径。

链接: https://arxiv.org/abs/2605.27832
作者: Vijeta Deshpande,Namrata Shivagunde,Sherin Muckatira,Hadrien Glaude,Mikhail Gronas,Claire Stevenson,Roger Beaty,Anna Rumshisky
机构: University of Massachusetts Lowell; Zaqa.ai; Dartmouth College; University of Amsterdam; Pennsylvania State University; Amazon AGI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are being applied to increasingly difficult problems and use cases. To navigate their vast solution spaces effectively, LLMs need to be creative. Yet the subjective nature of creativity and the limits of human judgment make training LLMs for creativity especially challenging. As a solution, we train LLMs on Codenames, a word-association game that exercises the two central axes of creativity, divergent and convergent thinking, while yielding objectively verifiable outcomes. This verifiability lets us bypass human judgment and train with Reinforcement Learning with Verifiable Rewards (RLVR). We train Qwen3-1.7B, 4B, and 8B models and evaluate them on ten creativity and four reasoning benchmarks. We find that the precision-diversity trade-off is scale-dependent: the 8B model prioritizes creativity over precision, while the 1.7B and 4B models gain reasoning precision at the cost of creativity. Concretely, the 8B model shows modest but consistent creativity gains (8 of 10 benchmarks) with only minor reasoning degradation, whereas the smaller models achieve substantial gains on reasoning tasks. Our study presents a scalable and effective solution to train LLMs for creativity.

[NLP-129] Revealing Algorithmic Deductive Circuits for Logical Reasoning

【速读】：该论文试图解决的问题是：大型语言模型（LLMs）如何在仅通过少量示例（few-shot learning）的情况下，理解抽象的推理步骤和整体算法的语义含义。其解决方案的关键在于：首先，在符号辅助的思维链（Chain-of-Thought, CoT）提示框架下，将每个推理步骤与对应的token置信度进行对齐；其次，利用因果中介分析技术识别出负责特定推理模式的注意力头（attention heads）。研究发现，约3%的注意力头专门用于检索事实和规则信息以完成子推理任务，而较高层的注意力头则主导信息整合，并促进全局推理策略（如图遍历算法）的涌现，从而协调多个中间推理步骤以完成整体任务。

链接: https://arxiv.org/abs/2605.27824
作者: Phuong Minh Nguyen,Tien Huu Dang,Naoya Inoue
机构: Japan Advanced Institute of Science and Technology (日本先端科学技術大学院大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies have shown that Large Language Models (LLMs) can achieve strong reasoning performance by incorporating functional symbolic representations that abstractly describe graph traversal algorithms and step-by-step reasoning in few-shot learning settings. However, it remains unclear how LLMs genuinely understand the abstract meaning of each reasoning step and the overall algorithm from only a limited number of demonstrations. This work aims to localize the attention heads responsible for individual reasoning steps and characterize the types of information transferred among them. We first align constituent reasoning steps with their corresponding token logits under a symbolic-aided Chain-of-Thought (CoT) prompting framework. Our analysis shows that token positions that steer the reasoning process are associated with low confidence scores caused by constraints on satisfying reasoning behavior patterns in demonstrations. We then adopt causal mediation analysis techniques to identify the attention heads responsible for these patterns. In addition, our findings indicate that LLMs retrieve factual and rule-based information for individual sub-reasoning tasks through specialized attention heads (approximately 3% total heads), whereas higher layers predominantly facilitate information integration and the emergence of global reasoning strategies (e.g., graph traversal algorithms) that coordinate multiple intermediate reasoning steps to solve the overall task.

[NLP-130] ARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition

【速读】：该论文旨在解决数据感知型后训练量化（Data-aware Post-Training Quantization, PTQ）在自动语音识别（ASR）任务中对词尾（lexical tail）词汇敏感性不足的问题。传统PTQ方法通过最小化小规模校准语料上的逐token重建损失来优化量化过程，但其隐式地按词频加权，导致罕见词（如人名、数字和领域特定词）因在校准集中出现频率低而被忽略，从而影响ASR模型在这些关键词汇上的性能。解决方案的关键在于提出一种无需标签的PTQ框架Tail-Aware Reconstruction Quantization (TARQ)，其核心创新包括：(1) rareBAL——一种闭式解的逐线性层规则，用于均衡常见词与罕见词在校准过程中的分布质量；(2) 一种与评估指标一致的残差修正机制，确保量化误差最小化的同时提升罕见词识别准确性。TARQ无需实体标签、无需定制校准集、无需验证解码或额外训练，在8个ASR骨干网络和6个数据集上实现W4G128量化下平均罕见词错误率（rare-WER）显著改善，且无整体WER退化，并在实体丰富基准测试（ProfASR、ContextASR-Speech-En）中表现出强迁移能力。

链接: https://arxiv.org/abs/2605.27808
作者: Xinyu Wang,Ziyu Zhao,Ke Bai,Silin Meng,Dongming Shen,Xiao-Wen Chang,Yixuan HE
机构: McGill University (麦吉尔大学); Boson AI (玻森人工智能); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Data-aware post-training quantization (PTQ) minimizes a per-token reconstruction loss on a small calibration corpus, implicitly weighting positions by their empirical frequency. For \textbfAutomatic \textbfSpeech \textbfRecognition (ASR), this misaligns with tail-sensitive risk: names, numerals, and domain-specific words receive proportionally little calibration mass. We propose \textbfTail-Aware Reconstruction Quantization (\TARQ), a label-free PTQ framework that shifts calibration toward the lexical tail via \textbf\rareBAL, a closed-form per-Linear-layer rule equalizing common/tail mass, paired with a metric-consistent residual correction. \TARQ\ requires no entity labels, no curated calibration set, no validation decoding, and no additional training. Across eight ASR backbones and six datasets at W4G128, \TARQ\ improves mean rare-\textbfWord \textbfError \textbfRate (rare-WER) without an aggregate-WER regression, achieves the lowest cross-corpus rare-WER swing among compared methods, and transfers to entity-rich benchmarks (ProfASR, ContextASR-Speech-En) without entity supervision.

[NLP-131] ChildEval: When large language models meet childrens personalities ACL

【速读】：该论文试图解决的问题是：当前大语言模型（LLM）在儿童中心个性化（child-centered personalization）方面的有效性尚不明确，缺乏对儿童特定偏好系统性评估的基准。为填补这一空白，作者提出了ChildEval——一个用于评估LLM在长对话上下文中推断并遵循儿童偏好能力的基准数据集。其解决方案的关键在于：构建包含2.9万条3-6岁儿童合成人格档案（persona profiles）的数据集，每条档案关联显式或隐式表达的偏好（前者为单句陈述，后者为6-10轮对话体现），且显式与隐式偏好指向同一潜在偏好，从而捕捉儿童偏好表达的动态性而非静态人格变化；同时设计细粒度、以儿童为中心的评估协议，系统性地评测开源LLM的表现，并验证了在ChildEval上微调可显著提升模型对儿童偏好的理解与响应能力。

链接: https://arxiv.org/abs/2605.27805
作者: Yanyan Luo,Xue Han,Chunxu Zhao,Ruiqiao Bai,Yaxing Zhang,Qian Hu,Lijun Mei,Junlan Feng
机构: JIUTIAN Research, China Mobile, Beijing, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages of main text (ACL Findings format), with references and appendix

点击查看摘要

Abstract:While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a benchmark for evaluating LLMs’ ability to infer and follow child-centered preferences in long-context conversations. ChildEval contains 29K synthesized persona profiles of children aged 3-6, providing relatively static background information. Each persona is associated with a child preference-which may align with, conflict with, or be independent of the persona-expressed either explicitly in a single sentence or implicitly through 6-10 turn dialogues. Explicit and implicit preferences are designed to reflect the same underlying preference but differ in expression, capturing dynamic aspects of preference expression rather than changes in the static persona. The benchmark spans five top-level and fourteen sub-level categories covering children’s daily lives and development. We further propose fine-grained, child-centric evaluation protocols to systematically assess open-source LLMs. Experimental results demonstrate how different personalized representations affect LLM responses and suggest that finetuning on ChildEval can enhance child-centered performance. Our code and dataset are available at this https URL.

[NLP-132] A Fixed-Budget Cluster-Aware Standard for LLM -as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

【速读】：该论文试图解决多跳检索增强生成（multi-hop RAG）系统评估中因评判标准不统一而导致的测量偏差问题，即当前依赖大语言模型（LLM）作为裁判的比较方法容易受检索质量、答案长度、词汇重叠度或忽略数据聚类结构等因素干扰，从而误导对模型性能的真实判断。其解决方案的关键在于提出一个最小测量标准（minimum measurement standard），明确固定候选池规模（top-100）、证据预算、答案长度上限、生成器与提示模板，并强制要求预注册假设、采用聚类感知推理（cluster-aware inference）、在可行时进行精确的聚类符号翻转检验（exact cluster sign-flip check），以及引入第二位裁判进行复现验证。通过该标准对遗传算法解码器（GADMEC）在计算机科学/机器学习和材料科学领域的400个多跳问题上进行压力测试，发现原有基于二项式检验的结果看似显著，但在采用聚类感知推理后仅有一个结果在Bonferroni校正下仍显著，表明现有基准测试可能高估了模型进步，强调领域应采纳此标准化协议以提升评估可靠性。

链接: https://arxiv.org/abs/2605.27789
作者: Camilo Chacón Sartori,José H. García
机构: Catalan Institute of Nanoscience and Nanotechnology (ICN2), CSIC and BIST
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) systems are often compared by asking a large language model (LLM) judge which answer is better. For multi-hop RAG, this has become a measurement problem as much as a modeling problem: the same score can reflect retrieval quality, answer length, lexical overlap, or a statistical test that ignores clustered data. We ask what happens when these choices are made explicit. We propose a minimum measurement standard for LLM-as-a-judge comparisons in RAG. The standard fixes the top-100 candidate pool, evidence budget, answer cap, generator, and prompt; it also requires pre-registered hypotheses, cluster-aware inference, an exact cluster sign-flip check when feasible, and second-judge replication. Clustered benchmarks can overstate progress; the field should adopt this standard. We stress-test it with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC), an evolutionary evidence selector, on 400 multi-hop questions in computer science/machine learning (CS/ML) and Materials Science. The protocol changes the empirical story. A binomial test makes all four semantic-baseline comparisons look significant; cluster-aware inference leaves only one Bonferroni-significant result. BM25 beats pure semantic GADMEC under the same budget, while a lexical-semantic hybrid recovers in CS/ML and narrows the Materials Science gap. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2605.27789 [cs.AI] (or arXiv:2605.27789v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.27789 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-133] Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use

【速读】：该论文试图解决语言模型在面对任务时缺乏自我认知能力的问题，即无法判断何时应依赖自身参数化知识（parametric knowledge）完成任务，何时需要调用外部工具（如计算器或检索系统）。传统基于提示（prompt-based）的方法仅能指导模型何时调用工具，但无法教会其识别自身知识边界；而现有的强化学习（Reinforcement Learning, RL）方法因采用轨迹级奖励机制，难以对单个工具调用行为进行精确的信用分配，导致模型无法有效区分哪些工具调用真正提升了性能，也无法惩罚不必要的调用。解决方案的关键在于提出一种名为CARL（Competence-Aware Reinforcement Learning）的新框架，其核心创新是通过训练一个批评者（critic）模型来分析模型自身的推理轨迹，并在自然的工具使用边界（如代码块分隔符和上下文切换点）处分解每个轨迹，从而从单一二元结果中独立地为每个片段分配优势信号（advantage），无需外部标注或步骤级奖励。这种机制使模型能够准确识别何时应使用内部知识、何时需借助外部工具，显著提升工具调用的精准性和任务准确性，在多个基准测试中实现了最高达9.7点的精确匹配（exact-match, EM）提升，尤其在小规模模型（3B）上收益更为突出，表明“知道何时求助”对参数记忆有限的模型尤为重要。

链接: https://arxiv.org/abs/2605.27788
作者: Abhijit Kumar,Zoey Wu,Mohit Suley
机构: Microsoft(微软)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humans know when to reach for help e.g. 347 \times 28 warrants a calculator while 2+2 does not. Language models do not. Prompt-based approaches can instruct a model when to invoke tools, but this scaffolding does not teach it to recognize the boundary of its own knowledge. RL approaches that assign a single outcome reward to the whole trajectory fare no better: trajectory-level credit cannot isolate which tool call in a successful episode actually helped, nor penalize unnecessary calls. We propose \textbfCARL (\textbfCompetence-\textbfAware \textbfReinforcement \textbfLearning), which trains a critic on the model’s own rollouts to learn where parametric knowledge suffices and where it needs external help. By decomposing each rollout at natural tool-use boundaries (e.g., code fence delimiters and context block transitions), CARL assigns independent credit to each segment from a single binary outcome, without external judges or step-level annotations. As a result, erroneous tool calls, incorrect extractions, and unnecessary calls each receive appropriately signed advantages. The trained critic captures the model’s domain competence: it separates parametrically solvable from tool-dependent questions with AUC 0.93 at 7B. On five benchmarks spanning arithmetic, multi-hop factual QA, and numerical reasoning over financial tables, CARL improves exact-match accuracy by 6.7 points at 7B and 9.7 points at 3B over the best RL baseline, with the largest gain (+8.3 EM at 7B, +9.0 EM at 3B) on Musique. The model issues 53% fewer tool calls on parametrically answerable questions while remaining \sim10 EM points more accurate on them. Gains are largest at small scale: the 3B improvement is 1.4\times the 7B improvement, suggesting that knowing when to ask disproportionately benefits models with smaller parametric memory.

[NLP-134] Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict

【速读】：该论文试图解决的问题是：当语言模型遇到与自身训练知识相矛盾的文档时，其链式思维（Chain-of-Thought, CoT）推理是否真实反映了模型在“遵循文档”与“信任自身知识”之间做出决策的机制。解决方案的关键在于引入**内省忠实性（introspective faithfulness）**这一评估框架，通过设计200个问题、8种模型和4种提示条件的实验，系统检验CoT推理内容与最终决策之间的关联性。研究发现，CoT推理在不同决策间高度稳定（96%相同答案相似度），但主要表现为对知识的不变展示（约96%），仅包含微弱但真实的置信度信号；其中GPT-4o是唯一表现出统计上可靠的推理-决策耦合关系的模型，而其他模型如Claude Sonnet 4.6虽置信度范围广，却因条件依赖性导致整体相关性趋近于零。因此，论文指出，在监控模型行为时，应关注置信度分数而非推理过程本身。

链接: https://arxiv.org/abs/2605.27773
作者: Pruthvinath Jeripity Venkata
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 8 tables, 3 appendices

点击查看摘要

Abstract:When a language model sees a document contradicting its training knowledge, it must choose: follow the document or trust itself. Prior work proved this choice depends on how well-known the fact is. We ask: does the model’s chain-of-thought (CoT) reasoning faithfully report this mechanism? We introduce introspective faithfulness and test it across 200 questions, 8 models, and 4 prompt conditions. We find CoT reasoning is highly stable across opposite decisions: flip pairs retain 96% of same-answer similarity (d=0.34; confirmed by ROUGE-L, d=0.45). Yet self-rated confidence carries a faint genuine signal: for obscure facts where entity fame is uninformative, confidence still predicts decisions (p0.001) and tracks item-level knowledge (r=0.134). GPT-4o is the only model with statistically reliable reasoning-decision coupling. Claude Sonnet 4.6 shows the widest confidence range (SD=1.39) but near-zero pooled correlation because the confidence-decision relationship reverses between conditions; a temperature ablation confirms this is model-specific. Internal thinking tokens show greater decision-sensitivity than user-facing CoT (p=0.033). CoT decomposes into a decision-invariant knowledge display (~96%) and a thin confidence layer with weak but real signal. For monitoring: read confidence, not the argument.

[NLP-135] UniMaia: Steering Chess Policies with Language for Human-like Play

【速读】：该论文试图解决的问题是：如何在不进行大规模多模态训练的前提下，实现对领域特定策略网络（如基于Lc0的国际象棋策略网络）的语义可控性控制，从而兼顾灵活性与领域知识的保留。解决方案的关键在于提出UniMaia框架，其核心创新包括：1）利用参数高效的文本编码器和类似ControlNet的条件机制，对冻结的棋类策略网络进行提示（prompt）调节；2）通过引入辅助的时间条件和行为预测目标，进一步提升控制效果和行为建模能力（即UniMaia-Aux）。该方法在保持预训练策略表示不变的同时，实现了对开局选择、玩家强度等语义指令的精准控制，并在多个基准测试中达到最优或具有竞争力的性能表现。

链接: https://arxiv.org/abs/2605.27767
作者: Sherman Siu(1),Lesley Istead(1, 2) ((1) University of Waterloo, (2) Carleton University)
机构: University of Waterloo (滑铁卢大学); Carleton University (卡尔顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in large language models have enabled natural language to serve as a flexible interface for controlling complex systems, but often at the cost of large-scale multimodal training or weakened domain-specific inductive biases. In structured decision-making domains such as chess, specialized policy networks achieve strong performance but lack semantic controllability, while prompt-conditioned language models are more flexible yet typically exhibit weaker domain grounding. We propose \textbfUniMaia , a framework for prompt-conditioned policy modulation that adapts a frozen Lc0-based chess policy network using a parameter-efficient text encoder and a ControlNet-style conditioning mechanism. UniMaia enables semantic control over gameplay, including opening selection and player strength, while preserving the pretrained policy representations. We further introduce \textbfUniMaia-Aux , which incorporates auxiliary temporal conditioning and behavioral prediction objectives. To support this work, we construct a large-scale metadata-augmented Lichess dataset, develop a semi-automated prompt-generation pipeline, and introduce benchmarks spanning both prompt-conditioned and metadata-conditioned settings. UniMaia achieves state-of-the-art expected accuracy on several prompt-conditioned benchmarks and competitive top-move accuracy on general instruction-following tasks, while remaining competitive with dedicated metadata-conditioned approaches on human move prediction benchmarks. UniMaia-Aux further improves expected accuracy and behavioral modeling across several evaluation settings, with modest trade-offs in top-move accuracy. Overall, our results demonstrate that prompt-conditioned control of domain-specific policy networks is feasible without end-to-end multimodal training, while highlighting trade-offs between controllability and predictive performance.

[NLP-136] Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

【速读】：该论文试图解决的问题是：当前用于光学字符识别（OCR）的视觉语言模型（VLMs）在处理低资源历史文本（如古希腊文文献）时，常产生看似合理但缺乏视觉依据的错误文本，这表明其过度依赖语言先验而非真实图像信息。解决方案的关键在于引入受控图像扰动和基于条件解码与无图像解码分布差异的词元级定位度量方法，以量化模型在生成过程中对视觉证据的依赖程度。研究发现，尽管VLMs在字符级扰动下会显著偏离真实文本，但其是否依赖图像则因模型而异——专用OCR模型即使出错也极少依赖图像，而通用VLMs即便生成错误仍保持对视觉输入的条件依赖；此外，解码时干预无法稳定恢复视觉接地性，而事后语言模型修正仅能修复生成后的文本错误。这一结果揭示了流畅输出不等于视觉准确，强调需通过可解释性驱动的评估方法超越单一准确率指标。

链接: https://arxiv.org/abs/2605.27750
作者: Antonia Karamolegkou,Nicolas Angleraud,Benoît Sagot,Thibault Clérice
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.

[NLP-137] Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization

【速读】：该论文试图解决的问题是：在音频和多模态大语言模型中，标准强化学习后训练算法（如GRPO）因对所有token施加统一策略梯度，忽略了不同token对非文本模态（如音频）的依赖程度差异，从而导致长链式推理过程中出现模态崩溃现象——即模型逐渐放弃原始音频信号，转而依赖压缩的文本先验，产生自信但无依据的幻觉。解决方案的关键在于提出一种新的双分支强化学习框架——模态感知策略优化（MAPO）：首先，通过跨模态微分熵计算构建模态相关性掩码，动态将策略梯度集中在对模态关键的token上；其次，引入一个辅助注意力损失分支，对模型内部注意力分布施加时序缩放的针对性惩罚，确保跨模态接地在推理链条深处得以维持。该方法仅依赖原生统计信号，不依赖领域特定归纳偏置，在复杂音频推理基准上显著提升长期推理保真度和多模态指令遵循能力，达到开放权重模型的新SOTA性能。

链接: https://arxiv.org/abs/2605.27741
作者: Cihan Xiao,Yiwen Shao,Chenxing Li,Xiang He,Zhenwen Liang,Steve Yves,Sanjeev Khudanpur,Liefeng Bo
机构: Johns Hopkins University (约翰霍普金斯大学); Tencent Hunyuan (腾讯混元)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural vulnerability: methods like GRPO apply uniform policy gradients across all tokens, ignoring their unequal dependence on the non-text source modality. This exacerbates late-stage modality collapse during extended chain-of-thought generation, where models progressively abandon the primary source signal in favor of compressed textual priors, leading to confident but ungrounded hallucinations. To address this, we introduce Modality-Aware Policy Optimization (MAPO), a novel dual-branch reinforcement learning framework. First, MAPO dynamically concentrates the policy gradient on modality-critical tokens using a modality relevance mask, which is derived from the cross-modal differential entropy between an audio-ablated reference and the multimodal policy. Second, it integrates an auxiliary attention loss branch that applies a targeted, temporally scaled penalty to the model’s internal attention distributions. This ensures the model actively sustains cross-modal grounding deep into the reasoning trace. Evaluations on complex audio reasoning benchmarks demonstrate that MAPO substantially improves long-horizon reasoning fidelity and multimodal instruction following, achieving highly competitive performance and setting new state-of-the-art results on several key benchmarks among open-weight models. By relying strictly on native statistical signals rather than domain-specific inductive biases, MAPO offers a promising foundation for mitigating epistemic collapse across diverse multimodal systems.

[NLP-138] UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training

【速读】：该论文旨在解决大语言模型（LLM）在长文本推理中因自注意力机制的键值（KV）缓存随上下文长度线性增长而导致的计算和内存瓶颈问题。现有方法如Top-k稀疏注意力虽能通过仅加载部分KV缓存缓解此问题，但如何在无需训练的情况下准确且高效地估计缓存重要性仍具挑战性，尤其在跨模态场景下难以保持一致性能。论文提出的UNIQUE框架是一种通用的Top-k稀疏注意力方案，其关键创新在于以KV页面为粒度设计了一种简单而精准的重要性评分机制——该评分结合页面内键向量的均值作为代表性向量与标准差作为偏移项，从而实现对缓存页面重要性的有效评估。为进一步缩小训练与推理之间的差距，作者引入一种软掩码稀疏感知训练策略，利用每查询的Top-k分数边界作为阈值，并在其周围采用Sigmoid软掩码，无需额外损失函数或结构改动即可提升模型对稀疏性的适应能力。实验表明，UNIQUE在文本和语音大模型上均能保持任务性能（如LongBench Pro基准测试和长语音识别），同时相较FlashInfer稠密注意力实现最高11.4倍的注意力核加速，相较vLLM基线稠密模型获得至少5.3倍端到端解码加速。

链接: https://arxiv.org/abs/2605.27740
作者: Keqi Deng,Shaoshi Ling,Ruchao Fan,Jinyu Li
机构: Microsoft, USA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-context inference in large language models (LLMs) is bottlenecked by the linear growth of the self-attention key-value (KV) cache. Top-k sparse attention alleviates this by loading only a small fraction of the KV cache, but accurately and cheaply estimating cache importance, for both training-free use and sparsity-aware training, remains challenging. This paper proposes UNIQUE, a universal top-k sparse attention framework that addresses both requirements and stays consistently effective across LLM modalities. UNIQUE operates at the granularity of KV pages and estimates per-page importance with a simple yet accurate score combining the mean of the page’s keys as a representative vector with their standard deviation as an offset term. To further close the train-inference gap, this paper introduces a soft-mask sparsity-aware training scheme that uses the top-k score boundary as a per-query threshold and a sigmoid soft mask around it, requiring neither auxiliary losses nor architectural changes. Experiments on text and speech LLMs show that UNIQUE preserves task performance on long-context benchmarks such as LongBench Pro and on long-form speech recognition, while delivering up to 11.4x attention-kernel speedup over FlashInfer dense attention and at least 5.3x end-to-end decoding speedup over a vLLM-based dense model.

[NLP-139] UserHarness: Harnessing User Minds for Stronger Agent Theory-of-Mind

【速读】：该论文试图解决的问题是：如何让智能代理（agent）更有效地理解用户的心理状态（如信念、意图），从而提供更具适应性的辅助服务。当前大多数方法通过复杂的间接行为建模来处理心智理论（Theory-of-Mind, ToM）任务，但未能显式重建用户的内在心理状态，而这一状态恰恰是用户行为的根本驱动力。解决方案的关键在于提出UserHarness框架，它将ToM推理重构为对用户心智的显式重建，明确分解用户的认知状态（观察、信念、意图）与其外部环境的关系，并据此推导出行动逻辑。该框架在五个基准测试中实现了高达95.94%的宏观准确率，相比现有推理方法提升超过15%，相比仅使用提示（prompt-only）的方法提升约20%，验证了从用户心智根源出发进行推理的重要性，为构建更智能的未来助手提供了坚实基础。

链接: https://arxiv.org/abs/2605.27721
作者: Cheng Qian,Jiayu Liu,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 Pages, 4 Figures, 2 Tables

点击查看摘要

Abstract:Understanding what a user believes and intends is central to building effective agent assistants. This ability is often evaluated through Theory-of-Mind (ToM) tasks, where success requires reasoning from the user’s perspective. However, many existing approaches address ToM with complex pipelines that model behavior indirectly, without explicitly reconstructing the user’s mental state. This misses the core structure of the problem: users act based on their beliefs, which are updated through observations of the environment; beliefs and intentions jointly determine actions, which in turn change the environment; and social reasoning often requires nested beliefs about what others believe or intend. We propose UserHarness, a simple framework that reframes ToM reasoning as explicit user-mind reconstruction. UserHarness decomposes the user’s mental state, its relation to the external environment, and the actions that follow from it, enabling agents to track what the user observes, believes, intends, and does. Across five benchmarks, UserHarness reaches up to 95.94% macro accuracy, improving over existing inference methods by more than 15% relative and over the strongest prompt-only harness by about 20% relative. These results suggest that robust user understanding requires reasoning from the roots of the user’s mind, positioning user harnessing as a promising foundation for more adaptive future assistants.

[NLP-140] Beyond Input Understanding: Diagnosing Multilingual Mathematical Reasoning with Directed Acyclic Trace Graphs

【速读】：该论文试图解决的问题是：大型推理模型（Large Reasoning Models, LRMs）在英语中表现出较强的数学推理能力，但在低资源和中等资源语言中表现显著下降，且这种差距不仅源于对非英语问题表述的理解不足，还可能与推理执行过程本身受语言影响有关。解决方案的关键在于提出DATG（Directed Acyclic Trace Graph）框架，该框架将推理轨迹映射到与语言无关的数学锚点（mathematical anchors）和依赖关系上，从而能够量化分析不同语言下的推理是否覆盖必要的数学节点、遵循正确的依赖边并避免有害的数学操作。基于此诊断，作者进一步设计了Loop-Retry和Formula-Retry两种测试时控制策略，针对性地修复DATG识别出的语言相关推理错误，在12种语言上的实验表明，这两种方法能显著提升低资源语言中的推理性能。

链接: https://arxiv.org/abs/2605.27715
作者: Jiaqiao Zhang,Zhoujun Li,Raoyuan Zhao,Jian Lan,Thomas Seidl,Michael A. Hedderich,Hinrich Schütze,Yihong Liu
机构: LMU Munich MCML
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Large reasoning models (LRMs) achieve strong mathematical reasoning performance in English, but remain much less reliable in many low- and medium-resource languages. This gap is often explained as a failure to understand non-English problem statements. We show that this view is incomplete: even when the problem is given in English, controlling the model’s reasoning language can substantially reduce accuracy, suggesting that language also affects reasoning execution itself. To study this effect, we introduce DATG, a Directed Acyclic Trace Graph framework that maps reasoning traces to language-independent mathematical anchors and dependencies. This allows us to align target-language traces with reference DAGs and measure whether they cover required mathematical nodes, respect dependency edges, and avoid harmful mathematical actions. Experiments on the Qwen3 series across 12 languages show that non-English reasoning often suffers from reduced anchor coverage and weaker dependency fidelity, especially in low-resource languages. Motivated by this diagnosis, we propose Loop-Retry and Formula-Retry, two simple test-time controls targeting DATG-exposed failure modes, and show that they consistently improve target-language reasoning performance in low-resource languages.

[NLP-141] ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在数学推理能力评估中面临的两个核心问题：一是现有数学推理基准测试集多为静态且重复使用，导致模型可能通过记忆而非真实推理来获得高分；二是人工构建高质量、答案可靠的数学题目成本高昂。其解决方案的关键在于提出ReverseMath方法——一种基于答案反演的可扩展问题生成机制。该方法通过掩蔽原问题中的数值，并将原答案作为已知条件重构问题，使被掩蔽值成为新问题的答案，从而实现输入输出关系的反转并保证答案正确性。实验表明，利用ReverseMath生成的问题可用于评估模型是否存在记忆行为（如在反转问题上表现异常或错误输出原答案），同时也可作为强化学习的数据增强来源，在多个数学推理基准上显著提升模型性能，验证了其在分析与训练双重场景下的有效性。

链接: https://arxiv.org/abs/2605.27709
作者: Raoyuan Zhao,Yihong Liu,Yupei Du,Hinrich Schütze,Michael A. Hedderich
机构: Center for Information and Language Processing, LMU Munich (慕尼黑大学信息与语言处理中心); Munich Center for Machine Learning (慕尼黑机器学习中心); Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning from memorization. Meanwhile, manually constructing new math problems with reliable answers remains costly. We introduce ReverseMath, a scalable method for generating new math problems through answer inversion. Given a problem and its answer, ReverseMath masks a numerical value in the original problem, treats the original answer as a known condition, and rewrites the problem so that the masked value becomes the new answer. The generated problem reverses the original input-output relation, making its answer known by construction. We study ReverseMath for both evaluation and training. For evaluation, paired original/reversed problems reveal substantial behavioral shifts: models sometimes fail on reversed problems and even incorrectly output the original answer, suggesting memorization-like behavior. For training, ReverseMath provides automatically labeled reversed problems as data augmentation for reinforcement learning (RL). Experiments show that including ReverseMath-generated data improves mathematical reasoning performance across multiple benchmarks, demonstrating its value as both an analysis tool and a scalable source of verifiable training data.

[NLP-142] RACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

【速读】：该论文旨在解决大语言模型（LLM）智能体在多轮工具调用和环境交互过程中，安全风险往往出现在中间步骤而难以通过事后审计发现的问题。传统反应式审计（reactive auditing）因仅在最终结果阶段进行诊断，常错过风险发生的时机。其解决方案的关键在于提出TRACES——一种基于表示的主动审计机制，它利用观察者LLM的隐藏表示来学习前缀级别的轨迹风险状态（prefix-level trajectory risk states），通过提取步骤表示中的潜在机制特征并建模其时序演化，从而判断部分轨迹是否正在偏离至不安全行为。TRACES采用弱轨迹级监督训练，避免了高成本且模糊的步骤级风险标注，却仍能输出密集的前缀级风险估计，在多个代理安全基准测试中提升了全程安全预测与主动风险识别能力；分析还表明此类风险状态可辅助训练更安全的代理，凸显了主动审计在长周期代理安全中的广阔潜力。

链接: https://arxiv.org/abs/2605.27690
作者: Jiaqian Li,Yanshu Li,Boxuan Zhang,Ruixiang Tang,Kuan-Hao Huang
机构: Brown University; The University of Texas at Austin; Rutgers University; Texas AM University
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post-hoc diagnosis frequently misses the chance to flag risks while they are unfolding. We propose TRACES, a representation-based proactive auditor that learns prefix-level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step-level risk annotation, TRACES is trained with weak trajectory-level supervision while still producing dense prefix-level risk estimates. Across multiple agent safety benchmarks, TRACES improves both full-trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long-horizon agent safety.

[NLP-143] Aligning LLM s with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting

【速读】：该论文试图解决大语言模型（LLM）在概率预测中缺乏对不确定性有效建模的问题，尤其是如何利用人类群体预测中的丰富信息（包括群体概率估计和共识程度）来提升预测的校准性和准确性。其解决方案的关键在于提出Beta-Bernoulli校准器（Beta-Bernoulli Calibrator, BBC），该方法通过联合监督来自二元结果和人类预测信号，将任意模型输出的点估计转化为事件发生概率的分布（即Beta分布），其中均值作为校准后的点估计，方差则表征认知不确定性（epistemic uncertainty）。BBC不仅显著优于传统后校准方法和专门微调的预测模型，还保持轻量且具备良好泛化能力，并证明其捕捉的认知不确定性比口头表述的置信度更能可靠地预测预测误差。

链接: https://arxiv.org/abs/2605.27668
作者: Hui Dai,Ryan Teehan,Parsa Torabian,Mengye Ren
机构: Agentic Learning AI Lab, New York University (纽约大学); The University of Chicago (芝加哥大学); Chronologies AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Probabilistic forecasting estimates the likelihood of uncertain future events. To improve LLM forecasting, existing methods typically learn from binary outcomes to output verbalized forecasts. However, while aggregated human forecasts contain rich information in both the crowd probability estimate and the degree of agreement among forecasters, how to utilize these signals remains underexplored. To address this, we propose the Beta-Bernoulli Calibrator (BBC), which converts an initial point estimate forecast from any model into a distribution over event likelihood, using supervision from both binary outcomes and human forecasts. BBC models event likelihood p \sim \textBeta(\alpha, \beta) and outcome y \sim \textBernoulli§ , with the mean as the calibrated point forecast and the variance as the epistemic uncertainty. Our results show that BBC generally provides better calibrated and more accurate forecasts than both traditional post-hoc calibration methods and models fine-tuned specifically for forecasting, while remaining lightweight and having good generalization. We also show that the epistemic uncertainty captured by BBC is a more reliable predictor of forecasting error than verbalized confidence.

[NLP-144] Cultural Fidelity in English-to-Hindi Translation: A Preservation-Fluency Frontier for Gender Recoverability

【速读】：该论文试图解决生成式翻译系统在跨文化语境中对性别线索（gender cues）处理不当的问题，即当源语言（英语）明确编码性别时，目标语言（如印地语）翻译应保持该性别信息的可恢复性，除非源文本本身存在歧义。解决方案的关键在于引入两种机制感知的推理阶段干预策略：一是源感知重排序器（Source-Aware Reranker, SAR），通过避免性别中性化句法结构来保留性别信息；二是现象感知重排序器（Phenomenon-Aware Reranker, PAR），即使在使用导致性别模糊的作格结构（ergative constructions）时，也通过针对性词汇标记来维持性别可识别性。实验表明，PAR显著提升了性别保留准确率（如GPT-4o-mini从11.07%提升至54.47%），但代价是流畅度下降，揭示了文化适配生成中保真度、流畅性和风格自然性之间的权衡关系。

链接: https://arxiv.org/abs/2605.27654
作者: Samyak Savi,Chavi Gupta,Shreyas Gantayet,Tanay Sodha,Dhruv Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 2 figures, 9 tables

点击查看摘要

Abstract:Generative translation systems are cultural technologies because they decide how socially meaningful cues are rendered within culturally specific grammatical systems. We study one concrete notion of successful cultural translation: when an English source explicitly encodes gender, an English-to-Hindi translation should preserve the recoverability of that cue unless the source itself is ambiguous. We evaluate this criterion on a 37,345-instance benchmark spanning twelve categories and show that five systems frequently erase gender through ergative and honorific constructions. We then introduce two mechanism-aware inference-time interventions. The first, the Source-Aware Reranker (SAR), prefers candidates that avoid gender-neutralizing syntax. The second, the Phenomenon-Aware Reranker (PAR), preserves gender through targeted lexical marking even when ergative syntax remains. Across GPT-4o-mini and Sarvam, PAR improves target-subset accuracy from 11.07% to 54.47% and from 15.99% to 49.66%, respectively. Human evaluation shows that PAR increases gender preservation from 10.3% to 81.3%, but reduces mean fluency from 4.36 to 3.37. These findings place the two interventions on a preservation and fluency frontier rather than supporting a single dominant solution, and show how culturally situated generation can require explicit tradeoffs among fidelity, fluency, and stylistic naturalness.

[NLP-145] Disentangling Language Roles in Multilingual LLM Task Execution

【速读】：该论文试图解决多语言大语言模型（Multilingual LLMs）在指令语言（instruction language）、源内容语言（source content language）和目标响应语言（response language）三者不一致时，评估其任务执行能力缺乏系统性控制的问题。现有基准测试通常未在完全交叉设计下分离这三个语言角色，导致对性能退化机制的理解模糊。解决方案的关键在于提出MTM-Bench——一个受控的多语言任务执行基准，每个实例由三元组 $(L_\text{instr}, L_\text{content}, L_\text{resp})$ 定义，并在英语、西班牙语和中文之间枚举全部27种组合，每种组合包含2,430个样本，涵盖语义反转、最终状态提取和语言纯净度等任务类型。通过分解指标（如语义正确性、目标语言遵从性、约束满足度、污染比例和联合成功率）及人工审计验证评分，研究发现：性能下降主要由语言在任务结构中的角色决定，而非简单的语言错配数量；其中响应语言角色是主导因素，单一响应槽错配即造成显著退化；且错配数量与难度之间并非单调关系，不同模型间排序存在差异，表明仅关注语义正确性不足以可靠衡量多语言任务执行能力。

链接: https://arxiv.org/abs/2605.27649
作者: Qishi Zhan,Minxuan Hu,Seoyeon Jang,Lei Zhao,Ziheng Chen,Man Liang,Xinyue Xiang,Jiaxin Liu,Guansu Wang,Liang He
机构: Marquette; Cornell; UC San Diego; UPenn; UT Austin; Maryland; Michigan; UIUC; Melbourne; Stanford
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multilingual LLMs are increasingly used when instruction, source content, and required response languages do not coincide. Existing benchmarks have expanded multilingual instruction-following evaluation, but they rarely isolate these three roles within a fully crossed design. We introduce MTM-Bench, a controlled benchmark for language-conditioned task execution in which each instance is defined by a triplet ((L_\textinstr, L_\textcontent, L_\textresp)). Across English, Spanish, and Chinese, MTM-Bench enumerates all 27 triplets and contains 2,430 instances per model across semantic reversal, final-state extraction, and language purity with update realization. We evaluate 20 frontier and open-weight LLMs using decomposed metrics for semantic correctness, target-language adherence, constraint satisfaction, contamination ratio, and joint success, with scoring validated by a targeted human audit. The fully crossed design reveals that degradation is organized by the role a language occupies in the task structure, not merely by mismatch count. The response-language role is the dominant axis of variation, and a single response-slot mismatch accounts for most degradation. The response-only and full-mismatch comparison suggests that mismatch count is not a monotonic predictor of difficulty, with model-level ordering varying across systems. Task families fail through distinct channels, showing that semantic correctness alone does not capture reliable multilingual task execution.

[NLP-146] Learning to Translate from Soft to Hard LLM Prompts

【速读】：该论文试图解决软提示（soft prompt）在大语言模型（LLM）任务适配中缺乏可解释性的问题，同时探索其在自然语言翻译任务中的性能提升潜力。解决方案的关键在于：训练一个专门用于翻译任务的软提示，并通过多数据集（DoDs）的定量与定性对比验证其优越性，结果显示该方法生成的翻译结果更流畅、准确，优于无需训练的现有方法（如InSPEcT）。此外，研究还发现：在小型开源模型上优化的软提示可转化为可移植的文本提示，在部署到大型封闭API模型时，其性能不仅超越原始软提示，甚至在某些情况下优于少样本学习（few-shot learning）方案，从而为软提示的可解释性与实际应用价值提供了新路径。

链接: https://arxiv.org/abs/2605.27642
作者: Pitipat Kongsomjit,Suryansh Goyal,Jacob Whitehill
机构: Worcester Polytechnic Institute (伍斯特理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 Pages, 11 tables, 4 Figures

点击查看摘要

Abstract:Soft prompt tuning is a parameter-efficient method for adapting LLMs to specific tasks, but suffers from a lack of interpretability. Building on recent work on interpreting soft prompts (Ramati et al., 2024), we explore how training a dedicated soft prompt to natural language translation model can yield higher translation quality. In particular, in both quantitative and qualitative comparisons on multiple Datasets of Datasets (DoDs), we demonstrate that our translator produces fluent, accurate verbalizations that outperforms existing training-free methods like InSPEcT. In addition to advancing interpretability, our work suggests a promising downstream application: soft prompts optimized on small, open-source models can be translated into portable text prompts that, when deployed on larger closed-API models, exceed the performance of the original soft prompt and, in some cases, even few-shot learning.

[NLP-147] Simorgh at SemEval-2026 task 7: Region-Aware Hybrid Retrieval for Low-Resource Cultural Reasoning in Multilingual Question Answering SEMEVAL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在处理低资源语言中文化特有知识时表现不佳的问题，特别是在多语言、跨文化场景下的多项选择题问答任务中。其解决方案的关键在于提出一种区域感知的混合检索方法（region-aware hybrid retrieval approach），该方法融合了BM25词法匹配与密集语义相似度，并引入区域加权启发式策略以提升检索文档的相关性；随后利用检索到的文档构建结构化提示（structured prompt），并结合Qwen3-14B量化模型进行基于logit的确定性答案选择，从而增强跨语言稳定性。实验表明，该方法相较于纯参数化推理显著提升了文化 grounded 问答的表现，但不同语言间因训练数据量差异导致的性能差距依然明显，说明检索增强策略尚无法完全克服训练数据不平衡带来的局限性。

链接: https://arxiv.org/abs/2605.27636
作者: Hadi Bayrami Asl Tekanlou,Mahdi Bakhtiyarzadeh,Jafar Razmara
机构: University of Tabriz, Tabriz, Iran
类目: Computation and Language (cs.CL)
备注: 6 pages, 3 figures, accepted to the Everyday Knowledge Across Diverse Languages and Cultures shared task at SemEval2026

点击查看摘要

Abstract:Although Large Language Models (LLMs) demonstrate excellent capabilities and performance for general reasoning tasks within the general public domain, they may face challenges with culturally grounded knowledge within languages with limited digital and textual data. In this paper, we investigate culturally grounded multiple-choice question answering with the BLEnD benchmark, which consists of a multilingual corpus of 30 languages and covers various socio-cultural domains, such as cuisine, sports, family, etc. We propose a region-aware hybrid retrieval approach that combines BM25 lexical matching and dense semantic similarity with regional weighting heuristics to improve the relevance of the answer. The retrieved documents are used to construct a structured prompt for the Qwen3-14B quantized model with logit-based deterministic answer selection. The experimental results show improvements to cross-lingual stability with the hybrid retrieval approach over pure parametric inference for culturally grounded question answering. However, there are still notable performance gaps between languages with more and less training data. This shows that the limitations of the retrieval augmentation approach are not entirely overcome by the training data imbalance problem.

[NLP-148] Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning

【速读】：该论文试图解决小语言模型（Small Language Models, SLMs）在处理复杂多步推理任务时因幻觉（hallucination）频发而导致错误累积的问题。现有方法通常采用“先思考后检索”的策略以减少幻觉，但作者指出该策略并非总是最优：一方面，SLMs 在初始回答中往往具有准确的置信度；另一方面，幻觉本身可能有助于模型更精准地逼近正确答案。因此，论文提出一种认知启发式的解决方案——“先作答、后推理”（answer first-reason later），即首先让模型基于零样本方式快速生成初步答案（System-I），再根据该假设从知识源中检索证据并进行深度推理（System-II）。通过融合这两种思维方式，该方法在多个多步问答基准测试中优于传统“先思考”策略。

链接: https://arxiv.org/abs/2605.27596
作者: Saptarshi Sengupta,Suhang Wang
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, there has been increased interest in Small Language Models (SLMs), which are fast, show good performance, and have lower hardware demands than large language models (LLMs). However, SLMs hallucinate more frequently than LLMs, impacting their ability to solve complex multi-step reasoning problems as early mistakes cascade to the final response. To address this, existing works think-first followed by iterative retrieval to reduce hallucination. We argue that the think-first strategy is not always necessary as we find that: (i) SLMs are often accurately confident in their initial answer and, (ii) hallucinations can actually be beneficial for honing in on the true answer. As such, we position our work as an inversion of this strategy, i.e., answer first-reason later. We propose a cognitively-inspired framework where the model is first allowed to quickly answer the question (System-I (zero-shot)) and then resorts to deeper thinking (System-II) based on evidence retrieved from a knowledge source using the initial hypothesis. By combining System-I and System-II style thinking, we show that our method can outperform prior work that takes the traditional think-first route on various multi-step question-answering benchmarks.

[NLP-149] Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

【速读】：该论文试图解决实时数据流环境中传统分析系统因依赖用户手动定义查询而导致的可扩展性不足问题，尤其是在数据持续演化、潜在洞察空间巨大的情况下，人工枚举所有可能的分析路径已不可行。解决方案的关键在于提出一种多智能体架构，通过一个持续的洞察发现循环实现自动化分析：智能体生成假设、将其转化为可执行的分析任务、验证产出物，并最终生成可视化或可部署的应用程序。该架构基于Apache Kafka实现事件驱动协调、Apache Flink进行流处理，并利用大语言模型（Large Language Models, LLMs）构建专用智能体；其核心创新在于采用基于类型化中间产物的契约驱动设计（contract-driven design），从而保障模块化、可观测性、血缘追踪以及动态生成分析的安全执行。

链接: https://arxiv.org/abs/2605.27571
作者: Gaetano Rossiello,Dharmashankar Subramanian
机构: IBM(国际商业机器公司); IBM(国际商业机器公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注: Accepted at Supporting Our AI Overlords (SAO) at the ACM Conference on AI and Agentic Systems (CAIS), May 26 2026, San Jose, CS, USA

点击查看摘要

Abstract:Modern analytics systems are fundamentally reactive, requiring users to define queries over increasingly complex and continuously evolving data. In real-time streaming environments, this paradigm breaks down, as the space of potential insights becomes too large to enumerate manually. We present a multi-agent architecture for autonomous insight discovery over real-time data streams. The system implements a continuous discovery loop in which agents generate hypotheses, compile them into executable analytics, validate generated artifacts, and produce visualizations and deployable applications. The architecture leverages Apache Kafka for event-driven coordination, Apache Flink for stream processing, and large language models to implement specialized agents. A key contribution is a contract-driven design based on typed intermediate artifacts, enabling modularity, observability, lineage, and safer execution of dynamically generated analytics. Through use cases in retail, finance, and public data, we show how this architecture supports a shift from query-driven analytics to proactive, discovery-driven systems.

[NLP-150] Why LLM s Fail at Causal Discovery and How Interventional Agents Escape

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在因果发现（causal discovery）任务中为何无法可靠地识别复杂因果图结构，尤其是在观测数据相似的情况下难以区分不同因果机制。其解决方案的关键在于提出了一种名为“代理式因果贝叶斯优化”（Agentic Causal Bayesian Optimization, A-CBO）的新方法，该方法将一个冻结的语言模型作为干预查询的因果 oracle（即回答特定干预效应），而外部贝叶斯优化循环则在对候选因果图的信念分布上进行迭代更新，仅需对数轮次即可收敛。该方案绕过了监督微调、直接偏好优化和上下文学习中存在的内核障碍（kernel obstruction），从而在不改变模型内部表示的前提下实现理论保证的收敛性，并在扩展基准测试（Extended Corr2Cause）中显著优于现有方法。

链接: https://arxiv.org/abs/2605.27567
作者: Amartya Roy,Sonali Parbhoo
机构: IIT Delhi (印度理工学院德里分校); Robert Bosch GmbH (罗伯特·博世公司); Imperial College London (帝国理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model’s internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emphnot any particular model or dataset. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing

[NLP-151] he Future of Facts: Tracing the Factual Generation-Verification Gap WWW

【速读】：该论文试图解决语言模型在事实性知识上的生成与验证能力之间存在的“生成-验证差距”（generation-verification gap, GV-gap）问题，即模型在验证已有答案时比生成新答案时更可靠。其解决方案的关键在于系统性地分析GV-gap在不同训练阶段（获取、持续学习和更新）中的演变机制，并通过多模型家族、双尺度实验揭示出三个普遍规律：(i) 验证能力始终先于生成能力被习得；(ii) 验证能力在持续学习过程中比生成能力更具鲁棒性；(iii) 事实更新可能导致模型进入“多宇宙状态”，同时认为旧答案和新答案都正确。这些发现为理解语言模型如何习得并保持事实准确性提供了关键洞见。

链接: https://arxiv.org/abs/2605.27564
作者: Tim R. Davidson,Anja Surina,Caglar Gulcehre
机构: EPFL (洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code for this project is available at this https URL , blog post at this https URL

点击查看摘要

Abstract:Language models are becoming the default interface to factual knowledge, yet they often verify outputs more reliably than they generate them. This generation-verification gap (GV-gap) underlies many recent advances in self-improvement and reasoning, but its dynamics on factual knowledge specifically remain poorly understood. We focus on the training mechanisms underlying factual GV-gaps, distinguishing them from their computational and aesthetic counterparts. We trace generation and verification capabilities through three training phases (acquisition, continual learning, and updating) across four open-source model families at two scales each. Three findings recur across models: (i) verification is consistently learned before generation; (ii) verification is more robust to continual learning than generation; and (iii) factual updates can leave models in a “multi-verse” state, simultaneously verifying both old and new answers as correct. Natural experiments on frontier models reproduce these dynamics at scale and reveal residual verification biases on well-covered facts.

[NLP-152] PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

【速读】：该论文旨在解决多模态生成式AI（Multimodal Generative AI）系统中存在但尚未充分研究的“越狱攻击”（Jailbreak Attacks）问题，尤其关注图像生成中的安全漏洞。当前针对文本生成的安全防护机制相对成熟，而图像生成若被滥用可能带来更严重的社会危害，且现有防御手段仍不完善。论文提出的解决方案核心是PAST2HARM框架，其关键创新在于：一是通过“时间深度化”（temporal deepening）增强历史锚定和档案线索，从而逐步瓦解不同对齐强度模型的拒绝边界；二是采用迭代升级策略，在初始合规后持续探测有害生成的上限，并利用语言模型作为裁判量化攻击严重程度。实验表明，该方法在三个主流多模态模型上实现了高达83%–100%的黑盒越狱成功率，且攻击提示具备跨模型迁移性，揭示了当前多模态安全机制的根本脆弱性，强调需加强多模态对齐训练以提升系统鲁棒性。

链接: https://arxiv.org/abs/2605.27545
作者: Snehasis Mukhopadhyay
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Jailbreak attacks on multimodal AI systems remain underexplored, even though unsafe image generation can have more severe consequences than unsafe text and current defenses are relatively immature. We introduce PAST2HARM, a simple yet effective adaptive jailbreak framework that bypasses refusal training in state of the art multimodal text to image models. Building on prior findings that past tense reformulations can evade safeguards, PAST2HARM systematically exploits this vulnerability in multimodal generative AI. We characterize the attack along two dimensions. First, breadth: through temporal deepening, the framework incrementally strengthens historical anchoring and archival cues, eroding refusal boundaries across models with varying alignment strength. Second, depth: via iterative escalation after initial compliance, we probe the upper bound of harmful generation, measuring severity using a scalar severity jailbreak metric evaluated by a language model acting as a judge. We find that mid conversation turns form peak vulnerability windows, where harmfulness increases before plateauing and eventually undergoing semantic inversion. We evaluate PAST2HARM on three models Gemini Nano Banana Pro, GPT Image 2, and SD XL achieving attack success rates of 83 percent, 67 percent, and 100 percent in a black box, gradient free setting. Adversarial prompts also transfer across models, with cross model success rates above 50 percent. The attack elicits diverse harmful outputs, including explicit sexual content, political disinformation, historical denial narratives, hate speech, and self harm glorification. We further release a curated benchmark of prompts, reformulations, and outputs as a resource for red teaming and alignment. Our results expose fundamental brittleness in current safeguards and highlight the need for stronger multimodal safety training. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.27545 [cs.CL] (or arXiv:2605.27545v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.27545 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-153] Agent ic Separation Logic Specification Synthesis

【速读】：该论文试图解决大规模C++代码库中形式化规格说明（formal specification）自动合成的难题，特别是现有基于大语言模型（LLM）的方法难以同时满足三个关键要求：可扩展性、表达能力（如支持动态内存和堆分配数据结构）以及系统性的验证以排除错误候选。解决方案的关键在于提出Spec-Agent——一个代理式系统，它通过静态分析与运行时堆追踪选择合适的规格语言层级（从命题逻辑到一阶分离逻辑），将现有功能测试泛化为模糊测试桩（fuzz harnesses），并利用反例引导反馈迭代优化LLM生成的候选规格。该方法在包含数百万行代码的开源C++项目上实现了85%的目标函数成功合成率，且无假阳性，在模糊测试和专家验证下均保持正确性，性能优于Claude Code Opus 4.6但仅需其1/10的token成本。

链接: https://arxiv.org/abs/2605.27531
作者: Tarun Suresh,David Korczynski,Julien Vanegue
机构: Bloomberg(彭博)
类目: Programming Languages (cs.PL); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 9 pages, 3 appendices

点击查看摘要

Abstract:Specification synthesis, the task of automatically inferring formal specifications from program implementations and natural language, is important for refactoring, transpilation, optimization, and verification, yet remains an open challenge for large C++ repositories. Existing LLM-based approaches fail to simultaneously scale to such repositories, produce specifications expressive enough to capture systems-code features such as dynamic memory and heap-allocated data structures, and systematically validate those specifications to rule out incorrect candidates. We present Spec-Agent, an agentic system for synthesizing expressive, well-validated specifications across large C++ codebases. Spec-Agent targets a ladder of specification languages: propositional logic, first-order logic, propositional separation logic, and first-order separation logic. For each function, Spec-Agent uses static analysis and runtime heap tracing to select the appropriate target specification language, generalizes existing functional tests into fuzz harnesses, and iteratively refines LLM-generated candidates via counterexample-guided feedback. We evaluate Spec-Agent on open source C++ codebases comprising millions of lines of code. Spec-Agent synthesizes valid specifications for 85% of target functions, with no false positives observed under fuzzing and expert validation, outperforming Claude Code Opus 4.6 at 10x lower token cost.

[NLP-154] Debate Helps Weak Judges Reward Stronger Models

【速读】：该论文试图解决的问题是：在可程序验证的代码与逻辑任务中，辩论（debate）作为一种可扩展的监督协议，在不同设置下为何表现不一——有时显著优于基线，有时则无效果，尤其是在裁判（judge）无法获取隐藏信息的情况下。其解决方案的关键在于识别出辩论有效性的核心条件：首先，批评者（critic）必须具备比裁判更强的分类能力；其次，裁判必须将批评者的发言视为需要验证的主张（claim），而非仅需总结的证词（testimony）。实验表明，当这两个条件满足时（在五组模型配对中的三组），辩论相比咨询基线显著提升裁判性能；而当条件不满足时（两组），辩论无效且裁判验证率大幅下降。此外，移除反驳回合（rebuttal rounds）对裁判性能无显著影响，说明单次独立批评即可捕获辩论的大部分收益，从而提出一种低成本、无需训练的可扩展监督机制，并建议在部署前进行预审计（即检验批评者是否优于裁判，以及裁判能否验证其主张）。

链接: https://arxiv.org/abs/2605.27483
作者: Ethan Elasky,Frank Nakasako,Naman Goyal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite theoretical promise, debate as a scalable oversight protocol has produced mixed empirical results: gains in some settings, and null effects in others, especially when the judge does not have information hidden from it. We study proposer-critic debate in a stronger-debater/weaker-judge setting on programmatically verifiable code and logic tasks. Debate helps the judge over a consultancy baseline when the critic provides a usable advantage: the critic’s classification ability must exceed the judge’s, and the judge must treat critic speeches as claims to verify rather than testimony to summarize. On the three of five pairings where the condition holds, proposer-critic debate’s gains are statistically significant over consultancy, and these pairings are the most capable model pairings. On the two non-responder pairings in our set, debate produces null effects, and judge verification rates drop by tens of percentage points once a critic enters the transcript. In these cases the critic’s binary-classification ability and the judge’s are within noise of each other, and the critic’s disagreement is parsed as testimony rather than a claim to check. Ablating rebuttal rounds from debate produces no measurable change in judge performance: a single independent critique recovers the bulk of debate’s benefit at lower inference cost. These findings suggest a cheaper primitive for training-free scalable oversight in verifiable domains (answer, critique, judge) and a pre-deployment audit (does the critic beat the judge, and will the judge verify it?) that predicts when debate will help. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.27483 [cs.CL] (or arXiv:2605.27483v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.27483 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-155] Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

【速读】：该论文旨在解决具有异构注意力结构（heterogenous attention structures）的Transformer模型的可解释性问题。这类结构通过处理来自不同信息源的数据（如多模态输入），提升了模型复杂功能和跨模态融合能力，但同时也带来了新的解释挑战。解决方案的关键在于提出一种专门针对异构注意力结构的解释方法，并基于此构建实验分析范式，从而实现对代表性模型运行机制的语义与逻辑层面的双重解析，为模型理解、研究及政策合规提供支持。

链接: https://arxiv.org/abs/2605.27458
作者: Yongjin Cui,Xiaohui Fan,Huajun Chen
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We categorize attention structures of Transformer into two types based on the source of the input information: homogenous and heterogenous attention structures. Heterogenous attention structures, with co-attention as a typical example, process information from different sources. Heterogenous attention structure is the foundation for Transformer models to achieve more complex functions and integrate more modal information. Whether for research purposes or policy requirements, the interpretation of Transformer models with heterogenous attention structures is an important task. The fusion of information from different sources brings new challenges. Our work mainly includes two parts: method and experimentation. In terms of method, we propose an interpretation method for Transformer models with heterogenous attention structures. In terms of experimentation, based on our experimental analysis paradigm, we interpret the operating mechanisms of representative models, conduct semantic interpretation and logical interpretation.

[NLP-156] REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading

【速读】：该论文旨在解决自动化评分系统在开放题型评分中缺乏透明性与可信赖性的问题。当前基于神经网络和大语言模型（LLM）的评分方法虽性能优异，但因其黑箱特性难以被教育工作者验证和信任；而传统概念瓶颈模型（CBM）在开放题评分场景下存在不足：未显式建模细粒度评分量表维度、未能有效捕捉评分等级的序数语义，且忽略了人类标注概念时的可靠性问题。解决方案的关键在于提出REC-CBM——一种面向评分量表的误差校正概念瓶颈模型，其核心创新包括：(1) 引入评分量表感知的概念编码器，学习响应中的概念特异性表示；(2) 设计序数成对校准目标函数，保留评分维度间的排序结构；(3) 增加潜在概念误差校正模块，在保持解释性的同时对概念预测进行去噪处理。实验表明，REC-CBM在公开数据集上显著提升评分准确性，并生成更忠实的概念级推理过程，同时各组件贡献明确，具备在真实教学场景中的应用潜力，从而为教育者提供可检查、可干预、可信赖的自动化评分方案。

链接: https://arxiv.org/abs/2605.27402
作者: Chengshuai Zhao,Fan Zhang,Kumar Satvik Chaudhary,Yiwen Li,Lo Pang-Yun Ting,Ying-Chih Chen,Huan Liu
机构: Arizona State University (亚利桑那州立大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open-ended grading is central to equitable and personalized education, yet manual grading remains time-consuming and costly, underscoring the need for automated grading systems. Although recent neural and large language model (LLM) based systems have demonstrated superior performance, they are typically black-box models whose scoring processes and rationales are difficult for educators to verify and trust. Concept bottleneck models (CBMs) have emerged as a promising approach by routing predictions through human-interpretable concepts, providing a mechanistic guarantee of transparency. However, standard CBMs are not tailored to open-ended grading: they do not explicitly model fine-grained rubric dimensions, inadequately capture the ordinal semantics of scoring scales, and neglect inherent reliability issues in human concept annotations. To address these limitations, we propose REC-CBM, a rubric-aware error-correction concept bottleneck model for trustworthy open-ended grading. REC-CBM introduces a rubric-aware concept encoder that learns concept-specific representations over responses and an ordinal pairwise calibration objective that preserves ranking structure among rubric dimensions. It further incorporates a latent concept error-correction module that denoises concept predictions before final grade prediction while preserving interpretability. Comprehensive experiments on publicly available datasets show that REC-CBM consistently improves grading performance and produces more faithful concept-level reasoning than both state-of-the-art baselines. Further analyses validate the contribution of each component and demonstrate the applicability in realistic educational settings. Overall, this work provides a practical, interpretable grading solution that enables educators to inspect, intervene in, and trust automated decisions, advancing more transparent and trustworthy education.

[NLP-157] StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation ACL2026

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）在动机访谈（Motivational Interviewing, MI）对话生成中缺乏情境锚定（situational grounding）、动态策略控制以及与临床标准不一致的评估机制等问题。其解决方案的关键在于提出StoryMI——一个基于多LLM代理的框架，通过将问卷式客户档案扩展为情境化故事来增强对话的叙事背景，并引入治疗师代理、客户代理和交互代理三者协同工作：交互代理动态选择并调控MI编码（MI codes），以指导多轮对话中的策略执行；同时，设计了两级评估协议，结合词汇级指标、MI特定的宏观策略度量、LLM作为评判者及人类专家评估，从而实现对生成对话的临床有效性验证。实验表明，情境锚定与宏观策略控制显著提升了MI遵循度与临床合理性，验证了结构化多代理流程在心理治疗对话生成中的有效性。

链接: https://arxiv.org/abs/2605.27393
作者: Qingyu Meng,Min Chen,Dingming Liu,Yifan Mo,Yue Su,Xin Sun,Koen Hindriks,Jiahuan Pei
机构: Vrije Universiteit Amsterdam (自由大学阿姆斯特丹); Bol.com (bol.com); NII, Tokyo Institute of Technology (东京工业大学信息研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL2026

点击查看摘要

Abstract:Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evaluation aligned with clinical standards in motivational interviewing (MI). We introduce StoryMI, a multi-LLM agent framework for controllable MI dialogue generation, where questionnaire-based client profiles are expanded into situational stories that provide narrative context for the dialogue. Therapist and client agents generate MI-coded utterances guided by MI codes selected by the interaction agent, while an interaction agent dynamically coordinates exchanges to control MI strategies during a multi-turn conversation. We propose a two-level evaluation protocol: lexical metrics and MI-specific measures of macro-level counseling strategies, alongside LLM-as-judge and human expert assessments. We construct a dataset of 6K simulated MI dialogues grounded in 1K questionnaire-story pairs, covering 12 MI codes and 13 symptom domains, and benchmark six open- and closed-source LLMs. Our results show that situational grounding and macro-level control can improve MI adherence and clinical plausibility, demonstrating the effectiveness of a structured multi-agent workflow for psychotherapy dialogue generation. We provide code and data for reproducibility.

[NLP-158] EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter AdaptationTarget

【速读】：该论文试图解决的问题是：在使用推测解码（speculative decoding）加速大语言模型（Large Language Model, LLM）推理时，随着词汇表规模扩大，输出投影层（output projection layer）成为性能瓶颈；而现有静态剪枝方法在专业领域或话题切换场景下因无法捕捉动态分布变化而导致接受率急剧下降。解决方案的关键在于提出 EvoSpec 框架，其核心创新包括：1）通过上下文感知机制，利用高效的语义与统计索引实时检索长尾关键词，实现词汇表的动态演化；2）设计轻量级在线对齐策略，基于课程学习（curriculum learning）持续缩小推测模型与目标模型之间的分布差异。实验证明，EvoSpec 在代码、法律和医学等专业化场景中显著优于静态基线，在 EAGLE-3 数据集上相较最优静态基线 FR-Spec 实现 1.13 倍加速，且内存开销比标准在线适应方法低 27%。

链接: https://arxiv.org/abs/2605.27390
作者: Shuyu Zhang,Lingfeng Pan,Qicheng Wang,Yaqi Shi,Yueyang Tan,Ruyu Yan,Jiaqi Chen,Lixing Du,Lu Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speculative decoding accelerates Large Language Model inference via a draft-then-verify paradigm, yet the output projection layer becomes a bottleneck as vocabulary sizes scale. While existing static pruning methods effectively reduce this overhead, they suffer from precipitous drops in acceptance rate in specialized domains or topic-switching scenarios due to their inability to capture dynamic distribution shifts. To address this, we introduce EvoSpec, a framework that enables real-time evolution of the draft model through dynamic vocabulary and parameter adaptation. Unlike static or purely retrieval-based approaches, EvoSpec employs a context-aware mechanism that retrieves critical long-tail tokens via efficient semantic and statistical indexing. Furthermore, we propose a lightweight online alignment strategy utilizing curriculum learning to continually minimize the distributional gap between the draft and target models. Extensive evaluations across specialized domains (coding, law, and medicine) confirm that EvoSpec overcomes the limitations of static baselines. On EAGLE-3, it achieves a 1.13x speedup in these settings over the state-of-the-art static baseline FR-Spec, with 27% lower memory overhead than standard online adaptation.

[NLP-159] Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

【速读】：该论文试图解决的问题是：大型语言模型（LLM）在模拟人类社群的“厚描述”（thick descriptions）方面存在不足，现有评估方法常将社会身份简化为静态标签，忽略了真实群体如何应对社会变迁时的动态反应。解决方案的关键在于提出一种名为CARE（Community-Aware Reaction Evaluation）的以反应为中心的评估框架，该框架通过对比LLM生成的讨论与不同社群对现实新闻事件的真实、情境依赖性反应，来衡量其仿真真实性。CARE不仅刻画了言语行为（illocutionary tones）的细粒度谱系及其背后的态度，还借助人机协作验证结果，揭示出当前基于显式社区提示的引导策略无法从根本上提升模拟保真度，并指出前沿模型间存在差异化的行为特征，表明现有对齐方法仍不足以捕捉在线群体的语言社会学动态。

链接: https://arxiv.org/abs/2605.27388
作者: Nuan Wen,Xuezhe Ma
机构: Information Sciences Institute (信息科学研究所); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the “thick descriptions” (Geertz, 1973) of human communities remains a critical challenge. Current evaluations often reduce social identity to static labels, sidelining how real-world groups navigate social shifts. To bridge this gap, we introduce CARE (Community-Aware Reaction Evaluation), a reaction-centered framework that benchmarks LLM-simulated discourse against the authentic, event-contingent responses of distinct communities to real-world news. By characterizing a fine-grained spectrum of illocutionary tones and the underlying attitudes they manifest–validated through human-AI collaboration–our diagnosis reveals a persistent “realism gap”: steering LLMs with explicit community prompts fails to inherently improve simulation fidelity. Analysis further identifies divergent behavioral signatures among frontier models, suggesting that current alignment strategies remain insufficient for capturing the sociolinguistic dynamics of online groups.

[NLP-160] From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons ACL2026

【速读】：该论文试图解决扩散模型（Diffusion Models）与预训练自回归（Autoregressive, AR）模型之间的结构不兼容问题：扩散模型依赖双向注意力机制，而AR模型基于因果性建模，导致无法直接复用成熟的AR先验知识，从而迫使研究者从头进行昂贵的预训练。解决方案的关键在于提出FLUID框架，其核心创新包括两点：一是通过“严格因果对齐”（Strictly Causal Alignment）机制，使AR骨干网络（如GPT风格模型）能够无缝初始化扩散过程，避免重复预训练；二是引入“弹性时域”（Elastic Horizons）机制，基于局部信息密度动态调节去噪步长，而非固定调度策略，从而提升效率与性能。实验表明，FLUID在保持最先进性能的同时，显著降低了训练成本。

链接: https://arxiv.org/abs/2605.27387
作者: Xiangyu Ma,Teng Xiao,Zuchao Li,Lefei Zhang
机构: Wuhan University (武汉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026

点击查看摘要

Abstract:Diffusion models promise efficient parallel text generation but rely on bidirectional attention, creating a structural mismatch with pre-trained Autoregressive (AR) models. This incompatibility precludes reusing robust AR priors, necessitating prohibitive pre-training from scratch. To bridge this gap, we propose FLUID, a framework that efficiently adapts AR backbones to the diffusion paradigm. By enforcing Strictly Causal Alignment, FLUID enables seamless initialization from standard GPT-style checkpoints, circumventing the need for massive pre-training. Furthermore, we introduce Elastic Horizons, an entropy-driven mechanism that dynamically modulates denoising strides based on local information density rather than fixed schedules. Experiments demonstrate that FLUID achieves state-of-the-art performance while reducing training costs by orders of magnitude, effectively reconciling established AR foundations with efficient parallel generation. Our code is available at this https URL.

[NLP-161] Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

【速读】：该论文旨在解决低资源语言中语音合成模型（Spoken Language Models, SLMs）因标注数据稀缺而导致的性能瓶颈问题。当前主流策略依赖合成数据来弥补真实数据不足，但作者发现这种做法引入了一个根本性的权衡，即“稳定性-表现力缺口”（Stability-Expressivity Gap）：合成数据虽提升音素准确性，却逐步抑制韵律多样性，最终导致表达力崩溃（称为“合成侵蚀”，Synthetic Erosion）。解决方案的关键在于提出两种自对齐框架——解耦引导的自对齐（Disentanglement-Guided Self-Alignment, DGSA），通过分离韵律与音色特征恢复复杂语言的表达力；以及温度驱动的自批判（Temperature-Driven Self-Critique, TDSC），在真实参考样本极度有限时，借助自动探索与过滤机制稳定生成过程。该方法显著优于ElevenLabs和Gemini Pro等商业系统，并首次实现老挝语的零样本语音克隆。

链接: https://arxiv.org/abs/2605.27383
作者: Yizhong Geng,Yanliang Li,Jinghan Yang,Tianhan Jiang,Boxun An,Ya Li,Xiaoyu Shen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by the scarcity of transcribed speech. In practice, synthetic data has become the primary strategy for scaling SLMs in such settings, providing reliable phonetic supervision when real data is insufficient. In this work, we show that this reliance introduces a fundamental trade-off, which we term the Stability-Expressivity Gap: while synthetic data improves phonetic accuracy, it progressively suppresses prosodic variability, ultimately leading to a collapse of expressivity (Synthetic Erosion). To bridge this gap, we propose two self-alignment frameworks. Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity for complex languages by exploiting prosody-timbre separation. For regimes where authentic references are exceptionally limited, Temperature-Driven Self-Critique (TDSC) stabilizes generation through automated exploration and filtering. Our approach outperforms strong commercial systems, including ElevenLabs and Gemini Pro, and enables the first zero-shot voice cloning capability for Lao.

[NLP-162] BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

【速读】：该论文旨在解决跨语言生物医学实体链接（Cross-lingual Biomedical Entity Linking, BEL）中因标注数据稀缺（尤其是低资源语言）以及现有方法依赖以英语为主的SapBERT检索器导致的泛化能力差和上下文感知消歧不足的问题。解决方案的关键在于提出一个两阶段无监督的框架BioELX：第一阶段通过引入Wikidata获取的多语言别名来增强SapBERT训练，提升跨语言候选实体召回；第二阶段采用预训练大语言模型（LLM）作为排序器，联合考虑提及上下文与候选实体进行上下文感知的消歧，无需任何任务特定标注数据。实验表明，BioELX在多个基准测试上达到新的最先进性能，尤其在低资源语言上提升显著，如土耳其语、韩语和泰语的Recall@1分别提升21.6、22.1和30.8。

链接: https://arxiv.org/abs/2605.27380
作者: Yi Wang,Corina Dima,Liangyu Zhong,Steffen Staab
机构: University of Stuttgart, Germany; Technical University of Berlin, Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:Cross-lingual biomedical entity linking (BEL) maps mentions in any language to unique identifiers in a biomedical knowledge base (KB), supporting clinical and biomedical NLP applications. However, expert-annotated training data for BEL are costly, especially for low-resource languages. Moreover, many cross-lingual BEL systems rely on SapBERT-based retrievers trained on predominantly English aliases in the KB, leading to poor generalization to unseen non-English mentions and limited context-aware disambiguation. We propose BioELX, a two-stage cross-lingual BEL framework that requires no task-specific annotated training corpora. In Stage~1, we enrich SapBERT training with Wikidata-derived multilingual aliases and use the resulting retriever to improve cross-lingual candidate retrieval. In Stage~2, we perform context-aware disambiguation with a pre-trained LLM ranker that jointly considers the mention context and candidate, eliminating the need for supervised training. Experiments on five benchmarks (XL-BEL, EMEA, Patent, WikiMed-DE, and MedMentions) show that BioELX achieves new state-of-the-art performance. It improves average Recall@1 on XL-BEL by +19.2, with especially large gains for low-resource languages, e.g., +21.6 on Turkish, +22.1 on Korean, +30.8 on Thai, and delivers consistent improvements on EMEA (+6.2), Patent (+5.4), and WikiMed-DE (+12.8). Code and resources will be released upon publication.

[NLP-163] Soro: A Lightweight Foundation Model and Chatbot for Tajik

【速读】：该论文试图解决的问题是在计算资源和网络连接受限的现实场景下，如何为塔吉克语（Tajik）提供高效、准确的对话式大语言模型（LLM），以支持教育等关键领域的本地化部署。解决方案的关键在于：首先，基于开源权重的Gemma 3模型进行仅针对塔吉克语的持续预训练（continual pretraining），使用一个经过筛选的19亿token语料库（涵盖网页文本、PDF文档及课程对齐的教育材料）；其次，通过4万条塔吉克语教师风格的监督指令微调（supervised instruction tuning）提升模型在特定任务上的表现；同时，构建并开源了一套覆盖通用知识、语言能力及升学考试领域的塔吉克语基准测试集，用于严谨评估模型性能；最后，利用FP8和INT4量化技术显著降低内存占用，实现边缘设备部署，支撑塔吉克斯坦教育部门的试点项目与大规模推广。

链接: https://arxiv.org/abs/2605.27379
作者: Stanislav Liashkov,Haitz Sáez de Ocáriz Borde,Azizjon Azimi,Khushbakht Shaymardonov,Shuhratjon Khalitbekov,Bonu Boboeva
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight compute and connectivity constraints in Tajikistan. Starting from open-weight Gemma 3 checkpoints, we perform Tajik-only continual pretraining on a curated 1.9-billion-token corpus spanning filtered web text, PDF documents, and curriculum-aligned educational materials, followed by supervised instruction tuning on 40K Tajik teacher-style examples. To enable rigorous evaluation despite the limited coverage of Tajik in standard benchmarks, we introduce a suite of Tajik benchmarks covering general knowledge, linguistic competence, and school- and university entrance-exam domains, and we open-source them on Hugging Face. Across these Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong English performance on standard datasets. We further show that FP8 and INT4 quantization of Soro preserves most Tajik-language gains while reducing memory requirements for edge deployment, supporting an ongoing education-sector pilot and planned scale-out across schools in Tajikistan.

[NLP-164] Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models

【速读】：该论文旨在解决现有基于提示（prompt-based）文本到语音（TTS）模型在风格控制上的两个关键局限性：一是缺乏跨语句的细粒度风格插值能力，二是无法实现单个语句内部随时间变化的风格过渡。解决方案的关键在于提出两种创新技术：其一，通过计算嵌入空间中对比风格提示之间的方向向量并进行线性插值，实现语句间的平滑风格变换；其二，针对自回归TTS解码器中早期token存在显著注意力偏置的问题，引入KV缓存交换（KV-cache swapping）和滑动窗口注意力掩码（sliding-window attention masking），有效缓解初始音频对后续生成的主导效应，从而支持语句内的连续风格动态变化。实验表明，该方法在性别转换、音高调整和语速变化等任务上均表现出优异性能，同时保持高说话人相似性和感知流畅性。

链接: https://arxiv.org/abs/2605.27376
作者: Jaehoon Kang,Yejin Lee,Yoonji Park,Kyuhong Shim
机构: Sungkyunkwan University(成均馆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance. In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models. For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics. For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation. To mitigate this effect, we introduce KV-cache swapping and sliding-window attention masking. Experiments demonstrate that our proposed inter-utterance interpolation achieves a 99-100% success rate in gender conversion, up to 36 Hz pitch variation, and up to 1.6 syllables-per-second speed change. Our intra-utterance transition maintains a speaker similarity of 0.81-0.91 and achieves perceptual smoothness scores of 3.48-4.48.

[NLP-165] LCO: LLM -based Constraint Optimization for Safer Agent ic LLM s in Real-world Tasks

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）作为自主代理在持续环境交互中出现的上下文奖励劫持（In-Context Reward Hacking, ICRH）问题，即模型因过度优化代理目标而产生有害副作用。现有防御方法无法有效应对此风险，因其根源并非来自对抗性输入，而是模型自身的过优化行为。解决方案的关键在于提出一种无需微调模型的基于LLM的约束优化框架（LLM-based Constraint Optimization, LCO），其核心包含两个模块：一是自我思考模块（self-thought module），引导模型在执行前主动推理并整合潜在安全约束；二是进化采样模块（evolutionary sampling module），利用LLM驱动的交叉与变异操作，在保持任务性能的同时将模型行为限制在安全解空间内。实验表明，LCO在输出优化和策略优化场景下均显著缓解ICRH，例如在推文互动优化任务中使GPT-4的毒性增长速率（Toxicity Growth Rate, TGR）降低39%，在策略优化基准上将ICRH发生率减少15.23%，且未牺牲任务性能。

链接: https://arxiv.org/abs/2605.27375
作者: Jiayong Wan,Jiawei Chen,Zhaoxia Yin,Liu Shuyuan,Hang Su
机构: East China Normal University (华东师范大学); Beijing Zhongguancun Academy (北京中关村学院); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly acting as autonomous agents, but their continuous interaction with the environment can lead to in-context reward hacking (ICRH), a phenomenon where LLMs iteratively optimize their behavior to maximize proxy objectives, inadvertently producing harmful side effects. Existing defense methods are insufficient to address this risk, as ICRH arises not from adversarial inputs but from the model’s own over-optimization. To mitigate this issue, we propose \textbfLLM-based Constraint Optimization (LCO), a framework that effectively reduces ICRH without model fine-tuning. LCO consists of two modules: \textitself-thought module, which guides the LLM to proactively deliberate and integrate potential safety constraints before execution; and \textitevolutionary sampling module, which employs LLM-based crossover and mutation to constrain the model’s actions within a safe solution space while maintaining task performance. Experimental results demonstrate that LCO substantially alleviates ICRH in both output-refine and policy-refine scenarios. In particular, on the tweet engagement optimization task, LCO achieves a 39% reduction in the Toxicity Growth Rate (TGR) on GPT-4, while on the policy optimization benchmark, it reduces the ICRH Occurrence Rate by 15.23%, demonstrating safety improvement without sacrificing task performance.

[NLP-166] ICG: Improving Cover Image Generation via MLLM -based Prompting and Personalized Preference Alignment EMNLP EMNLP2025

【速读】：该论文试图解决个性化封面图像生成（personalized cover image generation）这一关键问题，即如何在数字平台上生成既符合内容语义又贴合用户偏好的高质量封面图像，以提升用户参与度和推荐效果。解决方案的关键在于提出一种名为ICG的新型框架，其核心创新包括：1）利用多模态大语言模型（MLLM）通过元标记（meta tokens）提取物品标题与参考图像的语义特征，并结合用户嵌入（user embeddings）进行个性化增强；2）设计一种多奖励学习策略，融合公共美学与相关性奖励及基于用户行为训练的个性化偏好模型，从而在缺乏标注监督的情况下实现有效优化；3）引入一个可插拔的适配器（adapter）模块，实现MLLM与扩散模型（DM）之间的端到端联合训练，替代传统依赖手工提示词和分离模块的流水线方法。该方案显著提升了图像质量、语义保真度与个性化程度，在下游任务中增强了用户吸引力与推荐准确性。

链接: https://arxiv.org/abs/2605.27374
作者: Zhipeng Bian,Jieming Zhu,Qijiong Liu,Wang Lin,Guohao Cai,Zhaocheng Du,Jiacheng Sun,Zhou Zhao,Zhenhua Dong
机构: Huazhong University of Science and Technology (华中科技大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Hong Kong Polytechnic University (香港理工大学); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12268-12278, EMNLP 2025. Official version: this https URL

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical role in boosting user engagement on digital platforms. We propose ICG, a novel framework that integrates MLLM-based prompting with personalized preference alignment to generate high-quality, contextually relevant covers. ICG extracts semantic features from item titles and reference images via meta tokens, refines them with user embeddings, and injects the resulting personalized context into the diffusion model. To address the lack of labeled supervision, we adopt a multi-reward learning strategy that combines public aesthetic and relevance rewards with a personalized preference model trained from user behavior. Unlike prior pipelines relying on handcrafted prompts and disjointed modules, ICG employs an adapter to bridge MLLMs and diffusion models for end-to-end training. Experiments demonstrate that ICG significantly improves image quality, semantic fidelity, and personalization, leading to stronger user appeal and offline recommendation accuracy in downstream tasks. As a plug-and-play adapter bridging MLLMs and diffusion models, ICG is compatible with common checkpoints and requires no ground-truth labels during optimization.

[NLP-167] Identifying and Understanding Human Values in Text: A Tailorable LLM -based Architecture

【速读】：该论文试图解决的问题是：如何在自主智能系统中有效识别和量化文本中体现的人类价值观（human values），以替代传统仅追求效用最大化的决策机制，从而实现更具伦理和道德考量的决策。解决方案的关键在于提出了一种基于大语言模型（LLM）的模块化架构，该架构将人类价值观的概念构建（conceptualisation）与检测（detection）任务分离，包含三个协同工作的模块：1）从任意理论框架的基础文本生成结构化的价值规范；2）利用这些规范对文本进行标注；3）基于修辞和语义证据为每个价值分配支持或反对的等级评分。这一设计避免了以往方法对特定价值理论的依赖和复杂的提示工程（prompt engineering），实现了可扩展、可复现且适配多种价值理论的评估流程，并通过ValueEval数据集验证了其泛化性能。

链接: https://arxiv.org/abs/2605.27373
作者: Eduardo de la Cruz Fernández,Marcelo Karanik,Sascha Ossowski
机构: Universidad Politécnica de Madrid (马德里理工大学); CETINIA, Universidad Rey Juan Carlos (雷伊·胡安·卡洛斯大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 8 pages, 1 figure. Published in Proceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Volume 5

点击查看摘要

Abstract:As intelligent systems become more autonomous, the scientific community focuses on creating decision-making mechanisms that include ethical and moral considerations, unlike traditional utility-maximisation models. To achieve this, a key aspect is assessing how well these decisions align with human values. To this end, a promising line of research is centred on developing approaches based on Large Language Models (LLMs) to identify human values from text, whether explicit or implicit, enabling their recognition throughout. This paper introduces a LLM-based architecture to detect and quantify the intensity of human values in text, avoiding the limitations of previous approaches tied to specific value theory or complex prompt engineering. The architecture comprises three coordinated modules: one that generates structured value specifications from the foundational texts of any theoretical framework; one that labels texts using these specifications; and one that assigns graded support or resistance based on rhetorical and semantic evidence. This modular approach separates the tasks of conceptualising from detecting human values, creating a scalable and reproducible process driven by value specifications adaptable to various theories. The architecture was instantiated with multiple LLMs and evaluated using the ValueEval dataset. The experiments demonstrate good detection performance, confirming the generality of the pipeline.

信息检索

[IR-0] Affective Music Recommendation: A Rollout-Based World Model for Offline Preference Optimization

链接: https://arxiv.org/abs/2605.28810
作者: Audrey Chan,Aaron Labbé,Jacob Lavoie,Jordan Bannister,Arsène Fansi Tchango,Guillaume Lajoie,Laurent Charlin
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Functional music applications, from consumer focus and sleep aids to clinical interventions, share a distinctive recommendation problem: success is defined by the listener’s affective state, but online experimentation on emotion is ethically constrained, particularly for clinical populations who cannot reliably skip a song or report distress. We describe AMRS, the Affective Music Recommendation System deployed on LUCID’s health-and-wellness platforms, which serve clinical users (primarily older adults with neurocognitive conditions) and consumer-wellness users across energize, focus, calm, and sleep modes. AMRS is built around a rollout-based world model: a causal transformer trained on logged listening data to jointly predict engagement, binary rating, and self-reported valence and arousal. The world model serves both as an in-silico simulator for offline policy training and as a stress-testing tool before deployment. A recommender policy initialized by behaviour cloning is fine-tuned offline with Direct Preference Optimization (DPO) against a configurable multi-objective utility function. Under a strict cold-start protocol, the world model predicts both behavioural and affective signals with usable fidelity; DPO improves predicted valence and arousal over the cloned baseline while maintaining a similar diversity profile and avoiding the distributional collapse produced by greedy optimization. We position the work as an early deployed validation of a methodology for affective recommendation when online experimentation is ethically untenable.

[IR-1] Personal Visual Memory from Explicit and Implicit Evidence

链接: https://arxiv.org/abs/2605.28806
作者: Viet Nguyen,Thao Nguyen,Vishal M. Patel,Yuheng Li
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Project Page: this https URL

点击查看摘要

Abstract:Long-term memory is increasingly important for personalized AI agents, yet existing benchmarks and methods remain largely text-centric. Even when images are included, the user-specific information needed for later questions is typically recoverable from text alone, and most memory systems reduce image turns to generic captions. Yet images often carry personal information that text rarely states – both explicit evidence, such as recurring user-associated entities, and implicit evidence, such as latent user facts inferred from visual or multimodal cues. We introduce a benchmark for personal visual memory that targets both forms of evidence, and propose VisualMem, a hybrid visual–text architecture that augments a text-memory backend with a structured personal visual memory module. Rather than collapsing images into captions, VisualMem uses conversational context to resolve identity, ownership, and durable user facts. Experiments show that VisualMem substantially outperforms prior memory systems on our benchmark while remaining competitive on standard text-memory benchmarks, indicating that personal visual memory is a distinct and important component of long-term memory for personalized AI agents.

[IR-2] Do Agents Need Semantic Metadata? A Comparative Study in Agent ic Data Retrieval

链接: https://arxiv.org/abs/2605.28787
作者: Shiyu Chen,Tarfah Alrashed,Alon Halevy,Natasha Noy
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata like this http URL has anchored the FAIR principles (Findable, Accessible, Interoperable, and Reusable) for machine-actionable data and enabled discovery tools like Google Dataset Search. However, the rise of Large Language Models (LLMs) capable of navigating the unstructured web raises a fundamental question: Is semantic metadata still necessary for agentic data discovery, or can agents reliably retrieve actionable data directly from the web? We present a comparative analysis of agentic data retrieval across two distinct environments: a Baseline Agent searching billions of open-web documents, and a Semantic Agent leveraging a corpus of 90 million datasets using this http URL. We deploy an “LLM-as-a-judge” evaluation pipeline, mapped directly to the FAIR principles, to assess the semantic relevance, data accessibility, and computational utility of the retrieved data. Our results reveal a clear divergence. The Semantic Agent excels at retrieving actionable data, achieving a 44.9% higher precision for metadata-rich registries and a 46.6% higher precision for pages with machine-readable downloads among its returned results. Conversely, the Baseline Agent frequently suffers “Last-Mile Utility” failures, retrieving prose-heavy pages (20.1% of results) and portal landing pages (8.5%) rather than actual data pages. While the Baseline Agent achieves higher coverage by answering 40% more questions, the Semantic Agent delivers greater accuracy, achieving 65.7% higher overall precision in retrieving FAIR-compliant datasets. We conclude that while unstructured retrieval supports broad exploratory tasks, structured ecosystems remain the indispensable foundation for reliable, execution-oriented autonomous workflows.

[IR-3] Subtraction Gets You More: Gap-Aware Retrieval for Multimodal Multi-Hop QA

链接: https://arxiv.org/abs/2605.28641
作者: Sunah O,Jay-Yoon Lee
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In multimodal multi-hop question answering, we focus on the initial retrieval stage via two distinct tasks: (1) evidence set completion, retrieving missing evidence given context, and (2) sequential pool construction, iteratively building the top- K pool from the scratch. Under these settings, we point out that conventional iterative retrieval frameworks often suffer from Semantic Anchoring, where previously fetched evidence traps the retriever and yields entity-centric redundancy. To break this trap, we propose GRAIL (Gap-aware Retrieval via Adaptive Implicit Localization), a paradigm that performs implicit query rewriting directly at the embedding level. By context-subtractive query steering, GRAIL excels at compositional cross-modal reasoning, while additive embedding updates show strength on localized information aggregation. By dynamically routing queries based on task type, our Hybrid Framework achieves a 40.3% macro-averaged performance gain on MultimodalQA. Extensive evaluations demonstrate that sequential GRAIL retrieves in a superior, noise-resilient manner, significantly expanding the search horizon through iterative gap-aware optimization.

[IR-4] Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLM s

链接: https://arxiv.org/abs/2605.28565
作者: Yongsik Seo,Wooseok Jeong,Eunyoung Kim,Hyeonseo Jang,Dongha Lee
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Working Progress

点击查看摘要

Abstract:Users of search-augmented LLMs rely on citations as evidence that responses are grounded in real sources, and rarely verify the cited pages themselves. Millions of queries per day now pass through these systems, making citation quality a silent determinant of whether users are informed or misled-yet existing benchmarks each address one facet in isolation, leaving the joint structure that determines citation trustworthiness unmeasured. We construct CITETRACE, a large-scale dataset that traces the full citation chain from user query through retrieved source to generated answer: 11,200 real-world queries from 28 communities paired with 112,000 responses from ten models across five providers, yielding 761,495 evaluable citation pairs. We design a three-dimension evaluation framework that scores each citation on intent-purpose alignment, source suitability, and answer-source fidelity, using expert-validated predefined matrices and a five-level fidelity rubric; the framework applies to any system that produces citation-bearing responses. Applying this framework at scale, we identify a systematic pattern we call VERIFIED MISGUIDANCE (VM): models cite real, accessible sources yet fail along one or more dimensions, producing a fidelity-suitability trade-off in which faithful models select inappropriate sources and vice versa. Across our pool, 30.6% of citations distort their sources and 27.1% originate from domain-inappropriate sources; at the response level, up to 96% of users encounter at least one structurally misleading citation. Provider-level differences explain 88-96% of citation-quality variance, suggesting that source selection is governed more by factors beyond individual model capability than by the LLMs themselves. Together, CITETRACE and its evaluation framework provide the first resource for diagnosing structural citation failures in deployed search-augmented systems.

[IR-5] Search for Coverag e: Learning Coverag e-Aware Retrieval with Augmented Sub-Question Answerability

链接: https://arxiv.org/abs/2605.28522
作者: Jia-Huei Ju,Eugene Yang,Trevor Adriaanse,Suzan Verberne,Andrew Yates
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Long-form Retrieval-Augmented Generation (RAG) brings the challenge of coverage-based ranking, because ranking methods must ensure the inclusion of comprehensive relevant nuggets (i.e., facts), which can thereby be synthesized into a comprehensive output. In this work, we propose CoveR (Our code is available at this https URL ) a dense retrieval method optimized for coverage-aware retrieval scenarios. CoveR is a bi-encoder trained with the coverage-based contrastive and distillation objectives, which enables CoveR to capture diverse aspects of information needs. To train CoveR, we create the SCOPE dataset, (Our training data is available at this https URL ) which comprises 90K training pairs from Researchy Questions with synthetic coverage signals augmented from sub-question answerability judgments generated by LLMs. Our empirical experiments show that CoveR enhances nugget coverage by 10% over strong dense retrieval baselines without sacrificing its relevance-based retrieval capability. Further ablation studies validate the importance of our proposed learning method, showing that CoveR achieves a superior trade-off between relevance- and coverage-based ranking, which is essential for long-form RAG.

[IR-6] Efficient and Scalable Provenance Tracking for LLM -Generated Code Snippets

链接: https://arxiv.org/abs/2605.28510
作者: Andrea Gurioli,Davide D’Ascenzo,Federico Pennino,Maurizio Gabbrielli,Stefano Zacchiroli
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows = 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.

[IR-7] Looking Farther with Confidence: Uncertainty-Guided Future Learning for Sequential Recommendation

链接: https://arxiv.org/abs/2605.28493
作者: Ziqiang Cui,Xing Tang,Peiyang Liu,Xiaokun Zhang,Shiwei Li,Xiuqiang He,Chen Ma
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Sequential recommendation effectively models dynamic user interests but continues to face challenges related to data sparsity. While self-supervised learning has alleviated this issue to some extent, most existing methods focus exclusively on immediate next-item prediction during training, thereby neglecting the rich information embedded in longer-term future interactions. Although a few studies have explored the utilization of future data, existing attempts typically apply future supervision signals with uniform intensity across all samples, which may lead to suboptimal solutions. In this paper, we propose an adaptive future learning framework, UFRec, which encourages the model to look further ahead when it is confident in the current state, while focusing on the immediate task when it is uncertain. Specifically, UFRec incorporates an Uncertainty-Guided Future Supervision module that dynamically modulates the weight of multi-step future supervision based on the model’s confidence in the primary next-item prediction task. Furthermore, we complement step-wise future supervision with a Future-Aware Contrastive Learning module that treats the future trajectory as a holistic entity. Notably, both auxiliary modules are utilized exclusively during training and incur no inference overhead. Extensive experiments on four benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches by effectively leveraging future data.

[IR-8] From Learning Resources to Competencies: LLM -Based Tagging with Evidence and Graph Constraints

链接: https://arxiv.org/abs/2605.28483
作者: Ngoc Luyen Le,Marie-Hélène Abel,Bertrand Laforge
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Linking learning resources to a structured competency framework is key to enabling competency-based search and curriculum analytics in Learning Management Systems (LMS). However, manual tagging is labor-intensive, and fully automatic methods often lack transparency. In this paper, we present an end-to-end alignment pipeline that uses a large language model (LLM) as a constrained, evidence-producing tagger. LMS resources -both instructional content and assessments -are first segmented into meaningful pedagogical fragments. For each fragment, a small set of candidate competencies is retrieved from structured competency profiles enriched with graph-based context. The LLM then selects the most relevant competencies from this set and provides supporting evidence spans from the fragment text. These predictions are refined using the structure of the competency graph and aggregated at the resource level. We evaluate our approach on a dataset built from the Computer Science department’s competency referential at the Université de Technologie de Compiègne (UTC), covering 22 competencies across multiple course materials. Our LLM+BM25+Graph (LBG) pipeline achieves strong results, with a micro-F1 of 0.57 and macro-F1 of 0.50 at the fragment level, 0.51 macro-F1 at the resource level, and an MRR of 0.82outperforming zero-shot and few-shot LLM variants, retrieval/similarity baselines, and supervised classifiers -while also producing more mechanically traceable evidence spans to support human auditing and educational analysis.

[IR-9] Analyzing Quality-Latency-Resource Trade-offs in a Technical Documentation RAG Assistant Using LoRA Adaptation

链接: https://arxiv.org/abs/2605.28222
作者: Evgenii Palnikov,Elizaveta Gavrilova
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 13-page main body plus extended appendix; 6 figures; benchmark, LoRA adapters, and code at this https URL

点击查看摘要

Abstract:We study quality-latency-resource trade-offs in a documentation-grounded retrieval-augmented generation (RAG) system that uses Low-Rank Adaptation (LoRA) of the generator. We build a manually verified benchmark of 5,144 question-answer pairs over the official Kubernetes documentation and combine it with a fixed hybrid-retrieval pipeline (BGE-M3 dense, BGE-M3 native sparse, Reciprocal Rank Fusion, cross-encoder reranking). Over this benchmark we ablate 20 LoRA configurations on Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct across rank and target-module choices, and evaluate each on token-level F1, LLM-judged groundedness and correctness (pass@4), inference latency, inference memory, and training cost, all reported with bootstrap 95% confidence intervals. Pareto analysis shows that LoRA adapters acting only on the q and v attention projections consistently dominate the front, while the 3B/8B choice mainly defines operating regime. A param-matched control comparison further indicates that the q/v advantage is structural rather than purely parametric. The benchmark, selected adapters, and code are available at this https URL.

[IR-10] Whose Name Comes Up? III: Persona Prompting Effects in LLM -Based Scholar Recommendation

链接: https://arxiv.org/abs/2605.28187
作者: Annabella Sánchez-Guzmán,Lukas Eberhard,Denis Helic,Lisette Espín-Noboa
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注: 25 pages (10 main, 2 references, 13 appendix), 6 figures in main, 13 figures in appendix (under-review)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used as scholar recommenders, shaping who is seen as an expert in academia. Existing audits remain English-centric, single discipline, and persona-agnostic, leaving the source of output variability poorly understood. To this end, we propose a benchmark that disentangles the effects of model choice and prompt design on recommendations. We audit 43 LLMs by varying persona prompts (language, location, role-and-task) and context (field, seniority, k). Recommended scholars are compared against Semantic Scholar over six scientific disciplines to measure technical quality (factuality, coverage) and social representativeness (diversity, parity). Basic technical quality is driven by model choice, factuality and parity by context, and diversity by location. South Africa prompts yield less factual lists, while Japan prompts yield highly factual but homogeneous lists skewed toward highly productive scholars. Prompt design is thus a non-trivial axis of LLM-based scholar discovery and should be systematically audited alongside model choice.

[IR-11] Mixture-of-Experts Knowledge Graph Retrieval-Augmented Generation for Multi-Agent LLM -based Recommendation KDD2026

链接: https://arxiv.org/abs/2605.28175
作者: Shijie Wang,Chengyi Liu,Yujuan Ding,Shanru Lin,See-Kiong Ng,Xu Xin,Wenqi Fan
类目: Information Retrieval (cs.IR)
备注: Accepted by KDD 2026 Research Track

点击查看摘要

Abstract:Large language models (LLMs) have recently been adopted for recommendations due to their ability to understand user intent and item semantics. However, LLM-based recommender systems often rely on parametric knowledge and suffer from outdated knowledge, motivating knowledge graph retrieval-augmented generation (KG-RAG) to ground recommendations on structured, up-to-date KGs. Despite this promise, effective KG-RAG in recommendations faces great challenges. First, users’ queries vary in complexity and require KG knowledge at different granularities, whereas existing methods adopt a one-size-fits-all retrieval strategy, leading to over-retrieval for simple queries and under-retrieval for complex ones. In addition, augmenting LLMs with KG knowledge requires translating graph-structured data into linear text, which may introduce noise and cause structural information loss. Moreover, the selection of retrieval granularity lacks direct supervision and must be inferred from the final recommendation after alignment and downstream utilization, making query-aware retrieval hard to learn end-to-end. To address these issues, we propose MixRAGRec, a cooperative multi-agent framework for KG-RAG recommendations. MixRAGRec integrates a Mixture-of-Experts Retrieval Agent that routes each query to a KG retrieval expert with different granularities, a Knowledge Preference Alignment Agent that converts structured knowledge into LLM-friendly natural language, and a Contrastive Learning-reinforced Recommendation Agent trained with contrastive preference feedback. Notably, we introduce Mixture-of-Experts Multi-Agent Policy Optimization (MMAPO) to train three agents under a unified objective. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework.

[IR-12] A Wolf in Sheeps Clothing: Targeted Routing Hijacking in Federated RAG

链接: https://arxiv.org/abs/2605.28112
作者: Junjie Mu,Qiongxiu Li
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under review. Code available at this https URL

点击查看摘要

Abstract:Federated Retrieval-Augmented Generation (FedRAG) is attractive for privacy-sensitive applications because raw data remain local. As a result, routing must rely on client-provided semantic profiles, creating a new opportunity for manipulation. We introduce Routing Hijacking, a routing-stage attack in which a malicious client forges its profile to attract target queries despite having irrelevant underlying data. We show that this vulnerability is severe. Across three representative FedRAG routing architectures, Routing Hijacking consistently misroutes target queries and leads to downstream disruptions and failures, including missing evidence, poisoning, incorrect answers, and hallucinations. In a high-stakes MedQA-USMLE case study, we further show that poisoned retrieved evidence can mislead models across scales, leading to incorrect answers, hallucinations, and sycophantic failures. Existing defenses do not close this gap: encrypted routing preserves the exploited ranking, and Byzantine-robust Federated Learning (FL) rules transfer poorly to heterogeneous routing profiles. To address this gap, we propose a trust-aware post-routing framework that reweights clients using returned-evidence feedback, including retrieval relevance, profile consistency, and cross-client agreement; online experiments show that it suppresses persistent hijacking over recurring queries and transfers to a learned neural router. Our findings establish routing integrity as a new security challenge in FedRAG and highlight the need for stronger defenses for secure federated retrieval.

[IR-13] SilentRetrieval: Hijacking Retrieval-Augmented Generation via Semantically-Preserving Adversarial Data Poisoning KDD’26

链接: https://arxiv.org/abs/2605.28074
作者: Jiachen Qian
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 12 pages, 4 figures, KDD '26 camera-ready version

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) mitigates LLM hallucinations but introduces a critical vulnerability: corpus integrity. We present SilentRetrieval, a two-stage data poisoning attack that hijacks RAG systems through adversarially crafted yet fluent documents. Stage 1 uses Coordinated Beam Search, a multi-token joint optimization method with a fluency-similarity objective, to keep a poisoned host document retrievable while constraining perplexity. Stage 2 uses Context-Adaptive Trigger Generation, a lightweight trigger-fusion step driven by a frozen LLM, to integrate manipulation triggers into document content. Under a one-poisoned-document-per-query evaluation with synthetic target answers, SilentRetrieval achieves 84.6%/81.3% HR@10 and 57.5%/54.8% ASR-LLM on Natural Questions and MS MARCO, while maintaining near-benign perplexity. Cross-model evaluation across four target LLMs shows nontrivial effectiveness under a fixed trigger generator, and transfer tests against unseen retrievers, including ColBERT and commercial embedding models, yield 64.7% average HR@10 under the same injected-corpus protocol. In a sampled Wikipedia-scale evaluation, SilentRetrieval retains 74.2% HR@10 at a 0.016% poisoning ratio. Combined retrieval-side and generation-side defenses reduce attack success substantially but incur a latency trade-off. Human evaluation shows substantially lower flag rates than disfluent baselines, while remaining numerically more suspicious than benign content at the current sample size.

[IR-14] ConvMemory: A Lightweight Learned Memory Reranker a Negative Attribution Result and a Research-Preview Conflict Editor

链接: https://arxiv.org/abs/2605.28062
作者: Taiheng Pan
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 15 pages. Technical report

点击查看摘要

Abstract:We describe ConvMemory, a small 3.6M-parameter learned reranker for conversational long-term memory retrieval, trained with cross-encoder teacher supervision over fused dense and lexical features. On the LongMemEval memory family, ConvMemory operates above the BGE-large cross-encoder in Recall@10 at 12-47x lower latency, remains within 0.025 Recall@10 of mxbai-rerank-large-v1 on Clean500 while running 28x cheaper; under Stress1000 distractors the Recall@10 gap widens to 0.081 but ConvMemory still operates at 117x lower latency; these LongMemEval numbers are single-run or single-seed and are reported as indicative cost-frontier evidence, not benchmark-grade. We then publish a rigorous negative attribution result on a previously claimed mechanism: a five-seed retrained ablation with paired bootstrap shows that ConvMemory’s learned temporal window is statistically significant on aggregate but not temporally specific, with the largest effects on hard non-temporal controls and no significant effect on multi-hop temporal queries. The honest description of the mechanism is cheap cross-encoder distillation in a fused dense+lexical feature space, not temporal-structure exploitation. We additionally release CCGE-LA, a low-amplitude conflict-aware candidate-set editor over ConvMemory, as a research preview with modest but consistent gains on supersession and stale/rescue slices on LoCoMo. All results are retrieval-stage; ConvMemory does not match mxbai-rerank-large-v1 in absolute LoCoMo MRR, and the report is single-author and not yet independently audited.

[IR-15] Can It Reach the Generator? Investigating the Survival of Prompt-Injection Attacks in Realistic RAG Settings

链接: https://arxiv.org/abs/2605.28017
作者: Yu Yin,Shuai Wang,Bevan Koopman,Guido Zuccon
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注: 18 pages, 6 figures

点击查看摘要

Abstract:Recent generative engine optimisation (GEO) research has shown that prompt-injection attacks can push a target product to the top of an LLM’s recommendation list, with the strongest attacks reporting around 80% success and raising serious security concerns about RAG-based recommendation. However, these results assume the attacked document is always fed directly to the generator, bypassing the retriever and reranker. This is unrealistic: in deployed RAG systems, the attack modifies the document content, which can in turn change whether the document is retrieved and reranked highly enough to reach the generator at all. In this paper, we re-evaluate seven GEO attacks under a realistic three-stage pipeline (retriever, \to ,LLM reranker, \to ,LLM generator). We find that prior protocols substantially overstate attack effectiveness: gradient-based and instruction override attacks largely collapse before reaching the generator, and only LLM-driven prompt injections remain effective end-to-end. Our analysis further reveals that current GEO attacks are easily detectable: a lightweight prompt-injection guard finetuned on a small attack dataset already detects every attack. Our code and data are available at this https URL.

[IR-16] Beyond Similarity: Task-Aligned Retrieval for Language Models

链接: https://arxiv.org/abs/2605.27951
作者: Zhixing Sun,Shenghe Xu,Tao Li
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) ranks passages by semantic similarity to the input, implicitly assuming that semantic similarity is a reliable indication of applicability in downstream tasks. This assumption breaks down when task success depends not on topical relevance but on applying the correct rules, constraints, or procedural guidance. In such settings, the most useful context may be the rule triggered by the input rather than the most semantically similar passage. We propose Task-Aligned Retrieval (TAG), a retrieval framework that replaces similarity-based retrieval with applicability-based rule selection. TAG transforms source documents into traceable condition-action rules, identifies which rules apply to a given input through pairwise LLM judgments, and generates the output conditioned only on the selected actions. We empirically observe that across Wikipedia NPOV rewriting, HumanEval with PEP~8 compliance, and NBA transaction reasoning on RuleArena, TAG consistently outperforms standard RAG, with the largest gains in high-mismatch settings (up to 12.2%) while reducing retrieved context by up to 93%. These results suggest that, in rule- and instruction-governed tasks, retrieval should optimize for applicability rather than for semantic similarity alone.

[IR-17] Fine-Tuned LLM as a Complementary Predictor Improving Ads System

链接: https://arxiv.org/abs/2605.27856
作者: Hui Yang,Daiwei He,Kevin Jiang,Taejin Park,Kungang Li,Jiajun Luo,Yuying Chen,Xinyi Zhang,Sihan Wang,Haoyu He,Yu Liu,Lakshmi Manoharan,David Xue,Shubham Barhate,Runze Su,Duna Zhan,Ling Leng,Siping Ji,Jinfeng Zhuang,Alice Wu,Leo Lu,Han Sun,Zhifang Liu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommendation systems power engagement and monetization across feeds, ads, and short-video platforms, but translating the latest advances in Large Language Models into Recommendation Systems (RecSys) gains remains rare, particularly in advertising and production-scale real-world industry setups. Prior real-world LLM successes typically fall into three buckets: (a) generative retrieval that directly predicts the next items for candidate generation, (b) late-stage re-ranking that uses LLMs, and © auxiliary signal enrichment with LLMs. We introduce a complementary paradigm for ads: a fine-tuned open-source LLM used not as a ranker, but as an ads-specific ancillary predictor, forecasting likely advertisers from user profiles and histories. This LLM-driven advertiser prediction augments conventional candidate generation and provides informative priors to downstream ranking. Developed in a large-scale production advertising system, our approach produces substantial offline improvements and measurable online business impact, demonstrating that LLM world knowledge and predictive capacity can be efficiently harnessed. Beyond validating LLMs for ads applications, our results show that targeted ancillary predictions can unlock end-to-end gains across both retrieval and late-stage ranking, offering a practical path to LLM-enhanced recommendation at scale.

[IR-18] LRanker: LLM Ranker for Massive Candidates

链接: https://arxiv.org/abs/2605.27810
作者: Tao Feng,Zijie Lei,Zhigang Hua,Yan Xie,Shuang Yang,Ge Liu,Jiaxuan You
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have recently shown strong potential for ranking by capturing semantic relevance and adapting across diverse domains, yet existing methods remain constrained by limited context length and high computational costs, restricting their applicability to real-world scenarios where candidate pools often scale to millions. To address this challenge, we propose LRanker, a framework tailored for large-candidate ranking. LRanker incorporates a candidate aggregation encoder that leverages K-means clustering to explicitly model global candidate information, and a graph-based test-time scaling mechanism that partitions candidates into subsets, generates multiple query embeddings, and integrates them through an ensemble procedure. By aggregating diverse embeddings instead of relying on a single representation, this mechanism enhances robustness and expressiveness, leading to more accurate ranking over massive candidate pools. We evaluate LRanker on seven tasks across three scenarios in RBench with different candidate scales. Experimental results show that LRanker achieves over 30% gains in the RBench-Small scenario, improves by 3-9% in MRR in the RBench-Large scenario, and sustains scalability with 20-30% improvements in the RBench-Ultra scenario with more than 6.8M candidates. Ablation studies further verify the effectiveness of its key components. Together, these findings demonstrate the robustness, scalability, and effectiveness of LRanker for massive-candidate ranking.

[IR-19] Chain-based Adaptive Reconfiguration Over Lattices for Hallucination Reduction

链接: https://arxiv.org/abs/2605.27706
作者: Joan Vendrell Gallart,Solmaz Kia,Russell Bent,Michael Grosskopf
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We introduce CAROL (Chain-based Adaptive Reconfiguration Over Lattices), a probabilistic framework for test-time hallucination reduction in large language models. Rather than relying on token-level uncertainty, CAROL defines a semantic uncertainty measure based on the consistency between generated responses and a trusted context, inducing a string-submodular objective over a lattice of textual sequences. This formulation enables hallucination mitigation to be cast as a Markov chain accept-reject process with provable convergence and near-optimality guarantees, allowing the model to iteratively refine outputs toward semantic consistency. By operating at the level of meaning, CAROL unifies hallucination detection and mitigation within a single framework. Empirical results on question answering and multi-agent reasoning benchmarks show that CAROL significantly reduces hallucinations and improves reliability and interpretability compared to likelihood-based and retrieval-augmented baselines, while maintaining competitive computational efficiency.

[IR-20] Joint Optimization of Relevance and Engagement in Multi-Task Ranking for E-Commerce with Efficient LLM Supervision

链接: https://arxiv.org/abs/2605.27704
作者: Luming Chen,Jiaqi Xi,Raghav Saboo,Kenny Chi,Martin Wang,Sudeep Das,Danny Nightingale,Aditya Dodda,Elyse Winer,Akshad Viswanathan
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Optimizing industrial search ranking models solely for user engagement signals often introduces systematic biases, prioritizing popular or price-anchored items that may not satisfy semantic intent. We present a production-scale multi-task ranking system that integrates semantic relevance as a primary optimization objective, enabling explicit and controllable relevance-engagement trade-offs. Our architecture employs an ordinal relevance head that predicts cumulative probabilities over relevance thresholds, preserving the inherent ordering of labels. These outputs are integrated with engagement heads through a unified value model scoring function, enabling systematic balancing of semantic quality and short-term behavioral signals. To provide high-quality supervision for this multi-task framework, we utilize fine-tuned lightweight Large Language Models (LLMs) to generate three-level ordinal relevance labels: irrelevant, moderately relevant, and highly relevant. We address challenges regarding label distribution sensitivity and ensure high alignment with human annotations to enable efficient labeling for over 100 million query-item pairs. Evaluation across offline metrics, including NDCG@10, and online A/B experiments demonstrates that our approach significantly improves semantic alignment while preserving core engagement objectives.

[IR-21] Developing an Intelligent Job Recommendation System Using Semantic Retrieval and Explainable AI Techniques

链接: https://arxiv.org/abs/2605.27656
作者: Hussein Al Awad,Khaled Fathi Omar
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, IEEE-style paper on semantic retrieval and explainable AI for intelligent job recommendation

点击查看摘要

Abstract:Online recruitment platforms require recommendation methods capable of retrieving relevant job opportunities from large and heterogeneous collections of job postings. Keyword-based search is efficient and interpretable, but it may fail to retrieve relevant postings when equivalent roles are expressed using different terminology. This study presents a metadata-driven job recommendation system that combines TF-IDF lexical matching, Sentence-BERT semantic retrieval, query-aware filtering, optional Cross-Encoder re-ranking, and explanation generation. The proposed system utilizes structured metadata fields including job title, company name, location, seniority level, job function, employment type, and industry without relying on full job descriptions or user interaction histories. Experiments conducted on a cleaned LinkedIn job posting dataset containing 31262 records demonstrate that the best hybrid configuration achieved a Precision at 10 score of 0.8032 and an nDCG at 10 score of 0.9496. Under the internal evaluation protocol, Cross-Encoder re-ranking improved Precision at 10 from 0.7896 to 0.7948 and nDCG at 10 from 0.9666 to 0.9739. These findings indicate that lexical and semantic retrieval techniques can be effectively combined to provide explainable job recommendations when only structured metadata is available.

[IR-22] Eliot: Interactively underlineExploring Fast-Changing Scientific underlineLiterature Trends with underlineOnline Daunderlineta and Learning CIKM

链接: https://arxiv.org/abs/2605.27610
作者: Bernardo A. Denkvitts,Nitin Gupta,Biplav Srivastava
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Under-review at CIKM Applied Research 2026

点击查看摘要

Abstract:The rapid growth of scientific publishing has made it increasingly difficult to track how fast-moving areas evolve. Search engines and LLM-based assistants retrieve or summarize papers, but often hide how the corpus was selected, organized, or connected to temporal patterns. We present \textttEliot , a publicly deployed interactive system for traceable exploration of evolving scientific literature. Motivated by two studies on Large Language Models (LLMs) and Automated Planning and Scheduling (APS), \textttEliot generalizes literature-evolution analysis beyond hand-built taxonomies and domain-specific scripts. Given explicit query terms and filters, it retrieves arXiv papers at query time, represents each paper by title and abstract, clusters the corpus into themes, assigns representative keywords, and visualizes each cluster’s publication-year distribution. We evaluate \textttEliot as both an applied system and an interactive research aid. An offline configuration study across eight arXiv domains compares document representations, dimensionality reduction methods, and clustering algorithms using intrinsic clustering and topic-coherence metrics; the results support MiniLM embeddings with 10-dimensional UMAP and Agglomerative Clustering as a practical default. A scenario-based survey and expert focus group assess interpretability and use contexts: participants rated cluster labels as meaningful in 85% of scenario responses, and feedback indicated that \textttEliot is most valuable for auditable overviews of rapidly changing technical areas. These results suggest that query-time clustering and temporal inspection can complement search and generation tools by helping researchers inspect and refine the evidence behind literature trends.

[IR-23] On the Origin of Synthetic Information by Means of Steganographic Inheritance

链接: https://arxiv.org/abs/2605.27551
作者: Ching-Chun Chang,Isao Echizen
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The origin of species has been the mystery of mysteries in natural science. By analogy, the origin of synthetic information, we suggest, is the mystery of mysteries in information science. The question carries a moral weight that a technical account can neither fully resolve nor responsibly ignore, as its impact on truth, trust, and human intellect extends deep into the broader economy and society. The very power of artificial intelligence makes the evolutionary lineage of synthetic information grow ever harder to trace, for a sufficiently capable model may generate offspring that bear little resemblance, at either the structural or signal level, to the parent source from which they were derived. As in genetics, two individuals may share the same phenotype mirroring each other in outward appearance, yet differ fundamentally in their genotype. We propose, by means of steganography, a mechanism analogous to heredity. At the moment an offspring is reproduced, a projector derives a trait from the parent, and a steganographic encoder invisibly hides it within the offspring. This trait persists throughout the offspring’s life cycle in a cyber ecosystem. When parentage is queried, a steganographic decoder extracts the trait from the offspring and compares it against the traits of candidate parents in a reference pool, thereby nominating the most likely one. A theoretical analysis characterises phylogenetic accuracy as a function of projector and stegosystem properties, whilst empirical evaluations across multiple projectors and stegosystems demonstrate the viability of the proposed methodology under a broad spectrum of processing operations and semantic modifications. We envision a cyber ecosystem in which synthetic information, endowed with hidden yet traceable lineage traits, branches from a simple beginning into endless forms that have been, and are being, evolved.

[IR-24] Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

链接: https://arxiv.org/abs/2605.27494
作者: Syed Huma Shah(Duke University)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 19 pages, 9 figures, 10 tables. Code: this https URL

点击查看摘要

Abstract:Modern retrieval-augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time-to-first-token(TTFT). Prefix-level KV reuse is now standard in serving stacks such as vLLM, and chunk-level and position-independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output-level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe. We propose GroundedCache, an evidence-validated cache router that admits a cached answer only when 4 cheap gates simultaneously hold: query similarity, retrieved-evidence overlap, source-version validity, and lexical (or judge-based) support of the cached answer by the freshly retrieved evidence. We build a six-regime workload that stress-tests cache safety rather than only hit rate, and introduce an operator-facing metric, the unsafe-served rate (USR), fraction of all queries that received a wrong cached answer. Across 2 datasets and 12,000 real-LLM generations(Qwen2.5-7B-Instruct on vLLM with Automatic Prefix Caching), GroundedCache drives USR to 0.0% on every HotpotQA regime(vs. 15-35% under naive caching) and to 1.5% on mtRAG document drift(vs. 51.5%), a 34x reduction on the design-point adversarial regime and 3-10x reductions across the other mtRAG regimes, while end-to-end p50 latency stays within 1.04-1.07x of a no-cache RAG baseline. A per-gate ablation isolates the lexical support gate as the load-bearing safety mechanism on both datasets, with the remaining gates providing defense-in-depth at near-zero cost. We release the implementation, workload, and evaluation harness.

[IR-25] Context Features Are Cheap: Rank-Aware Decomposition for Efficient Feature Interaction in Recommender Systems

链接: https://arxiv.org/abs/2605.27450
作者: Yevgeny Tkach
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern industrial recommender systems use a deep ranking model to score N candidates against the same user and context features. Standard implementations broadcast context features early in the forward pass, redundantly computing context-only operations N times per request. We present a rank-aware decomposition applicable to the dominant interaction mechanisms in modern recommender architectures-Factorization Machine (FM) pairwise products, Deep Cross Network (DCNv2) cross layers, self-attention, and fully connected (FC) projection layers-built on a single algebraic principle: any linear or bilinear operation over a rank-partitioned input admits an exact block decomposition that moves context-only computation from once-per-candidate to once-per-request, identity-equivalent to the original model. Closed-form analysis and controlled ablation verify that savings scale quadratically with the number of context features. Applied to a production DLRM-style ranker without any architectural change, the decomposition increases per-pod throughput by 87.5% (a 47% reduction in peak pod count) at identical model predictions. The identity-equivalent decomposition applies only at the first layer of cross networks and self-attention, since each layer mixes ranks in its output. To extend savings across depth, we further introduce rDCN, an architectural variant of DCNv2 that maintains rank discipline across depth and matches DCNv2 accuracy within training noise at 67% fewer total FLOPs, and sketch an analogous architectural variant for self-attention. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2605.27450 [cs.IR] (or arXiv:2605.27450v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.27450 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-26] Checking Fact with Better Retrieval: Dynamic Contrastive Learning for Evidence Retrieval

链接: https://arxiv.org/abs/2605.27449
作者: Zhongtian Hua,Yi Luo,Meijia Yu,Yingjie Han
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the field of multimodal fact checking, the accuracy of retrieving evidence from different modalities has a significant impact on the downstream claim verification process. Existing general multimodal retrieval methods are often constructed based on semantics, resulting in the retrieved evidence being similar but not relevant to the claim. This paper proposes a \textbfDynamic \textbfAdaptive \textbfContrastive \textbfLearning method for evidence \textbfRetrieval called DACLR to address these issues. DACLR first uses a Multimodal Large Language Model (MLLM) to uniformly convert multimodal evidence and claims into text modalities, and extracts the features of these information at event level. Then, it conducts evidence retrieval through a two-stage retrieval method of recall-rerank. DACLR enhances the model’s event perception ability of the retrieval stage by optimizing the contrastive loss and mining hard negative samples. Specifically, DACLR designs three loss functions at two levels (semantic and event) based on the InfoNCE this http URL to these, three sets of hard negative sample candidates are set up. The model dynamically adjusts the ratio based on the accuracy supervision signal of intra-batch samples, allowing the model to learn the correlation between claims and positive samples at the event level without forgetting the semantic retrieval ability. Extensive comparison and ablation experiments demonstrates the effectiveness of DACLR and its internal optimization methods. Further research also prove the advantages of DACLR in the field of multimodal evidence retrieval.

[IR-27] RAG e: A Retrieval-Augmented Generation Evaluation Framework

链接: https://arxiv.org/abs/2605.27445
作者: Larissa Guder,João Pedro de Moura,Arthur Accorsi,Gustavo Losch do Amaral,Maurício Cecílio Magnaguagno,Felipe Meneguzzi,Marcio Sorraglia Pinho,Dalvan Griebler
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying Large Language Model (LLM) applications, particularly those relying on Retrieval-Augmented Generation (RAG), remains challenging due to high computational demands, outdated knowledge bases, and the need to manually select optimal pipeline components. In this work, we propose a modular framework for benchmarking and guiding the efficient development of RAG applications by focusing on resource telemetry and component recommendation, suggesting the best components for a domain-specific dataset. Our approach leverages core techniques in LLM applications, including document chunking, vector databases, embedding models, and retrievers, to evaluate trade-offs among accuracy, efficiency, and scalability. By directly correlating retrieval and generation quality with underlying hardware constraints, RAGe supports researchers to identify the most effective, domain-specific RAG setups for their specific operational needs, facilitating rapid prototyping even on consumer-grade hardware.

[IR-28] A Systematic Evaluation of Retrieval-Augmented Generation and Language Models for Space Operations

链接: https://arxiv.org/abs/2605.27444
作者: Ruben Belo,Marta Guimarães,Cláudia Soares
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid expansion of space activities has led to an unprecedented accumulation of technical documentation, operational guidelines, and scientific literature, creating challenges for timely decision-making in space operations. Effective management in space operations requires tools capable of efficiently processing vast and heterogeneous information sources. This paper systematically evaluates the performance of Retrieval Augmented Generation (RAG) pipelines, combining Large Language Models (LLMs) with information retrieval techniques for extracting and synthesizing actionable knowledge from domain-specific documents. We compare various retrieval strategies, embedding models, and LLM answers to assess their impact on information accuracy, relevance, and reliability. Our results demonstrate that RAG pipelines can significantly enhance knowledge access, reduce uncertainty, and support decision-making in complex space operations.

[IR-29] A Unified Structured Query Understanding Framework for Industrial Semantic Search KDD

链接: https://arxiv.org/abs/2605.27441
作者: Ping Liu,Qianqi Shen,Jianqiang Shen,Chunnan Yao,Kevin Kao,Rajat Arora,Dan Xu,Baofen Zheng,Yunxiang Ren,Benjamin Le,Ali Hooshmand,Igor Lapchuk,Juan Bottaro,Raghavan Muthuregunathan,Caleb Johnson,Liangjie Hong,Jingwei Wu,Wenjing Zhang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted by KDD-ADS 2026

点击查看摘要

Abstract:Query understanding in large-scale industrial search systems is typically implemented as a cascade of disparate, task-specific components. While individually optimizable, this fragmented architecture incurs high maintenance overhead and results in inconsistent behaviors, particularly for long-tail queries. In this work, we propose and deploy a unified structured query understanding system that consolidates these heterogeneous functions into a single Small Language Model (SLM) that performs schema-constrained generation. To address the data bottlenecks inherent in unified modeling, we introduce Query Illuminator, a dual-purpose framework serving as: (i) a teacher model for high-quality auto-annotation and distillation, and (ii) a surrogate judge for scalable evaluation where human labels are scarce. We validate this approach through extensive offline and online tests within LinkedIn’s Job Search system. Furthermore, we demonstrate the framework’s horizontal extensibility through a cross-domain case study on People Search. The results show improved user engagement and reduced operational costs, achieved while satisfying strict low-latency serving constraints on limited GPU resources.

[IR-30] Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline

链接: https://arxiv.org/abs/2605.27440
作者: Will Jack,Noah Lehman,Keller Maloney,Sarah Xu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Small changes to how a buyer phrases a question – “best CRM” vs “top CRM” vs “best CRM for a SaaS startup” – produce substantially different brand recommendations from AI assistants. Across ~6,000 paraphrase runs and ~6,000 same-prompt rerun controls on OpenAI and Anthropic models, the recommendation-set similarity (Jaccard) between two paraphrases of the same underlying buying intent is 0.288 for cosmetic rewordings (clustered 95% CI [0.215, 0.361]) and 0.135 for constraint-adding rewordings ([0.098, 0.175], pooling region/language and specificity-ladder axes) – both far below the 0.50-0.61 same-prompt rerun baseline. The prompt string, not the underlying buyer intent, is the dominant input to which brands surface. Increasing reasoning effort does not narrow the gap (bounded by +/-0.05). This is a direct challenge to an increasingly popular AEO/GEO practice. Tracking a brand’s “AI visibility” by counting brand mentions over a fixed set of prompts produces a metric whose dominant source of variance is which paraphrase the tracker happens to issue, not the model’s behavior toward the brand: the same buyer intent in two natural paraphrases produces recommendation sets that overlap 14-29% in Jaccard versus 50-61% for same-prompt reruns. Sampling more paraphrases per intent reduces the artifact in principle, and efficient multi-prompt evaluation methods exist in the academic literature, but the natural buyer-phrasing space is much larger than the benchmark-scale prompt sets those methods have been validated on, and far beyond what any commercial tracker issues per brand-intent combination. Prompt-by-prompt mention tracking is therefore structurally unstable as a unit of measurement; meaningful improvement likely requires a different unit rather than a larger prompt set.

[IR-31] Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recommendation: A 37000-Run Audit

链接: https://arxiv.org/abs/2605.27439
作者: Will Jack,Noah Lehman,Keller Maloney,Sarah Xu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI assistants like ChatGPT and Claude are recommendation engines, not search engines: they answer commercial queries by directly nominating brands rather than returning a list of links. Marketing to AI is therefore a broader problem than “show up in search” – positioning, content, and product fit matter as much as discoverability. We audit ~37,000 production runs across four model configurations and 215 commercially-framed prompts spanning 19 sectors, evaluated against a 533-brand reference catalog stratified into five prominence tiers (L1 category leaders to L5 regional players) sourced from external authority lists. The ladder proxies a brand’s awareness footprint within its sector, not revenue or market share. The failure mode differs sharply by tier. L1 brands appear in nearly every relevant retrieval but win only 25-41% of the recommendation slots they reach – the leverage is differentiation, not visibility. L2 challengers carry the highest conversion rates of any tier (37-52%) but lose to persona-mediated substitution on the Anthropic models. L3 mid-market brands are the inflection level: aggregate coverage drops to 88%, conversion to 34-40%, and persona effects peak. L4 specialists and L5 regional players face catastrophic invisibility – 48-52% never surface in any of the 37,000 runs. No uniform optimization recipe wins; the right marketing investment depends on where the brand sits on the prominence ladder.

[IR-32] MGRetrieval: Memory-Guided Reflective Retrieval for Long-Term Dialogue Agents

链接: https://arxiv.org/abs/2605.27437
作者: Tan Wang,Yunwei Dong
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant progress in dialogue, yet redundant memory contexts severely limit their effectiveness in long-term dialogue agents. External memory systems have been proposed to improve memory maintenance. However, these systems mainly rely on one-shot retrieval, which limits their ability to retrieve sufficient and relevant evidence. Although recent methods introduce reflection into retrieval, their retrieval paths are generated by the LLM from limited evidence, leading to unstable retrieval and additional latency overhead. %These limitations highlight the need for effective retrieval mechanisms. To address these limitations, we propose MGRetrieval, a retrieval strategy that grounds reflective retrieval in the semantic structure of historical memories. Specifically, MGRetrieval consists of two steps: (1) It references the structure of historical memories to construct a more precise retrieval path. (2) The LLM retains critical memories and determines whether accumulated memories are sufficient to stop further iterative retrieval. This allows the retrieval process to follow semantically meaningful paths. Through memory-guided retrieval and critical memory propagation, MGRetrieval gradually constructs concise and sufficient memory contexts. Extensive experiments on LoCoMo show that MGRetrieval outperforms the strongest baseline by 8.91% in F1 and 11.11% in BLEU-1 on average across Qwen2.5-14B and Qwen3-14B, while maintaining practical token and latency costs. The code can be found in this https URL.

[IR-33] RE-TRIANGLE: Does TRIANGLE Enable Multimodal Alignment Beyond Cosine Similarity in Retrieval?

链接: https://arxiv.org/abs/2605.27436
作者: Arijit Ghosh,Aritra Bandyopadhyay,Chiranjeev Bindra,Jingfen Qiao
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal alignment is critical for bridging the semantic gap in information retrieval. However, traditional pairwise strategies introduce a geometric blind spot: while they align anchor modalities (e.g., text) with others, they lack constraints to enforce mutual consistency between peripheral modalities (e.g., video and audio). The TRIANGLE framework addresses this by minimizing the area of modality triplets on a hypersphere to enforce holistic alignment. In this reproducibility study, we verify the robustness of this geometric objective for retrieval tasks. We confirm that TRIANGLE outperforms pairwise baselines in zero-shot settings, achieving Recall@1 gains of up to +8.7 points, though benefits are domain-dependent. However, we fail to reproduce the reported learning-from-scratch results. Analysis using a synthetic toy dataset attributes this to instability when jointly optimizing geometric alignment with Data-Text Matching (DTM) loss. Furthermore, we find that cosine regularization primarily stabilizes text-to-video retrieval, and fine-tuning with domain supervision amplifies geometric benefits but reduces cross-dataset generalization. Our findings support the efficacy of geometric alignment while highlighting critical optimization sensitivities. Code available at this https URL.

[IR-34] FD-RAG : Federated Dual-System Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2605.27432
作者: Tianhao Gao,Kai Yang,Yiyang Li
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a paradigm for grounding large language models in external knowledge, yet most existing RAG systems assume centralized knowledge access and ample computation. These assumptions break down in edge environments, where knowledge is fragmented across devices, raw data cannot be shared, and repeated LLM calls are prohibitively expensive. We propose FD-RAG, a federated dual-system RAG framework that decouples lightweight memory access from on-demand LLM reasoning for decentralized deployment. Specifically, FD-RAG learns semantic-aware adaptive hypergraphs over local corpora and distills them into compact QA memories. At inference time, it answers well-covered queries via direct memory matching and invokes LLM-based reasoning only when necessary, while tracing retrieved memories to hypergraph-grounded evidence. To mitigate cross-device knowledge fragmentation, FD-RAG aggregates anonymized memories across devices without exposing raw documents. Experiments on QA benchmarks show that FD-RAG improves accuracy by up to 7.8% while reducing latency by 8.4 \times compared with strong local and federated baselines. We also provide theoretical analysis establishing an \mathcalO(1/\epsilon^2) convergence rate for the proposed hypergraph learning, supporting its tractable deployment in edge settings.

[IR-35] Ocean4Rec: Offline LLM -Derived OCEAN Profiles for Request-Time VOD Reranking

链接: https://arxiv.org/abs/2605.27429
作者: Wonkyun Kim,Sehyun Bae,Kwanki Ahn,Mungyu Bae,Saeun Choi,Soyeon You,Chandra Prabhakar,Sehyun Kim
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Industrial video-on-demand (VOD) recommenders need richer content understanding, but LLM-as-reranker designs repeat prompt construction, token generation, model invocation, output parsing, and fallback handling for each request. In high-volume latency-sensitive services, these request-time operations complicate throughput planning, tail-latency control, capacity isolation, and predictable operation. This paper presents Ocean4Rec, a reranking layer that uses an LLM only offline to materialize item OCEAN profiles from content metadata. Items are mapped into Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism scores, while user profiles are built by time-decayed aggregation of recently clicked and deep-linked items in the same five-dimensional space. At request time, Ocean4Rec joins precomputed item profiles, user profiles, base recommender scores, and catalog recency, then performs numeric reranking without an LLM call. On anonymized Samsung Smart TV VOD logs, same-candidate Top1000 temporal-holdout offline evaluations show that Ocean4Rec improves NDCG@20 over a stronger non-OCEAN Base+Recency ordering by 7.6% for an NCF generator and 61.5% for a LightGCN generator. HR@20 is inconclusive for NCF and improves by 67.3% for LightGCN, reflecting sparse exact-item replay labels and the strength of recency as an industrial baseline. The result should be read as offline replay evidence for a bounded auxiliary content-taste feature that preserves the deployability advantage of a request-time-LLM-free serving path.

[IR-36] Will AI be overconfident about academic research findings when reliant on abstracts? (v1)

链接: https://arxiv.org/abs/2605.27392
作者: Mike Thelwall
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) like ChatGPT, DeepSeek and Gemini seem to be increasingly used for knowledge discovery, information retrieval, and knowledge summaries, including for academic topics. This can result in users being misled, such as due to hallucinations. These problems may be exacerbated for academic knowledge if LLMs base their answers on journal article abstracts when they lack full text access. To test whether the information content of abstracts can be misleading, full text articles were submitted to the GPT-OSS 120B, an LLM from OpenAI, asking it to assess separately the strength the claims for the main result in the abstract, discussion, and conclusion. Outside the social sciences and humanities, claims tended to be stronger in the abstract and conclusions than the discussion, suggesting that relying on the strength of claims in abstracts would be misleading. Thus, if LLMs ingest abstracts but not full texts, there is a risk that they will be overconfident about the findings and pass it on to users in response to relevant prompts. This is another reason to be cautious about using LLMs for academic-related knowledge discovery and summaries.

[IR-37] Memory-Based vs. Context-Only Conditioning Produces Distinct Behavioral Patterns in Stateful Personalization

链接: https://arxiv.org/abs/2605.27389
作者: Junsoo Park,Youssef Medhat,Htet Phyo Wai,Ploy Thajchayapong,Ashok K. Goel
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ITS 2026

点击查看摘要

Abstract:We study how conditioning context shapes personalization behavior in a teacher-facing educational recommender system. We compare contextual conditioning based on the current student question with memory-based conditioning using persistent learner information. Using deviation correlation and paired statistical tests, we find that contextual recommendations exhibit stronger question-level responsiveness, while memory-based recommendations exhibit history-dependent behaviors, including learner-specific differentiation under identical input. Teacher-facing evaluation signals suggest these recommendations are interpretable and actionable. These results indicate that embedding-based similarity metrics capture responsiveness to the current question but do not characterize personalization grounded in learner history, motivating behavior-level diagnostics for studying conditioning effects.

[IR-38] RAG -Coding: Enhancing LLM Medical Coding with Structured External Knowledge

链接: https://arxiv.org/abs/2605.27377
作者: Yidong Gan,David D. Nguyen,Yang Lin,Peter Zhong,Thanh Vu,Long Duong,Yuan-Fang Li
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Additional experiments and analyses are in progress

点击查看摘要

Abstract:We present RAG-Coding, an agentic method for automated ICD-10-CM coding. RAG-Coding orchestrates four large language model (LLM) agents and grounds their coding decisions in external knowledge sources (e.g. the official coding tabular list and guidelines). By retrieving and cross-referencing relevant knowledge in these sources, the agents enhance coding accuracy and ensure clinical compliance. On the MDACE dataset, RAG-Coding outperforms the best LLM-based baseline by 8-13% in micro-F1 and 2-8% in macro-F1 across multiple LLM backbones. Compared to the state-of-the-art pretrained language model method, PLM-ICD, RAG-Coding exhibits higher micro recall (+11%), while PLM-ICD exhibits higher micro precision (+6%), yielding comparable micro- and macro-F1. Ablations show stepwise gains, highlighting the importance of incorporating external knowledge. We also release MDACE-2025, updating the original dataset with expert re-annotations with the latest 2025 ICD-10-CM guidelines. This update features more fine-grained code labels and enables evaluation against current clinical standards.

人机交互

[HC-0] CaMBRAIN: Real-time Continuous EEG Inference with Causal State Space Models

链接: https://arxiv.org/abs/2605.28792
作者: Abhilash Durgam,Nyle Siddiqui,Jeffrey A. Chan-Santiago,Qiushi Fu,Elakkat D. Gireesh,Mubarak Shah
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 22 pages, 3 figures, 8 tables

点击查看摘要

Abstract:Electroencephalography (EEG) is a critical, non-invasive method to monitor electrical brain activity. EEGs can span anywhere from a couple seconds to multiple hours, posing a major hurdle for existing deep learning methods due to two major factors: (1) existing EEG models are predominantly built upon the attention mechanism, incurring quadratic scaling as the sequence length increases, and (2) raw EEG signals must be processed in a sliding-window fashion due to fixed-length input requirements, preventing global understanding of the entire signal. To this extent, we propose CaMBRAIN - the first Causal, Mamba-based state space model (SSM) capable of real-time inference of EEG signals, arguing that bidirectional approaches are needlessly expensive given the causal, unidirectional nature of EEG. However, training such a model is non-trivial, as crucial EEG events can be extremely brief - within fractions of a second - yet separated by long intervals spanning minutes. Current EEG methods use self-supervised objectives that optimize for signal reconstruction, but these are not well suited for streaming SSMs; they fail to explicitly train the hidden state to retain the salient long-range context needed for streaming inference. We therefore introduce a multi-stage self-supervised training pipeline specifically tailored to encourage long-range memory retention and strong performance on EEG signals, while preserving the linear-time complexity of state space models. CaMBRAIN achieves state-of-the-art (SOTA) results across 3 different EEG datasets with 10x higher throughput than existing models, enabling the first model capable of long-range, continuous inference of variable-length EEG signals.

[HC-1] AI in the Workplace: The Impact of AI on Perceived Job Decency and Meaningfulness

链接: https://arxiv.org/abs/2605.28680
作者: Kuntal Ghosh,Marc Hassenzahl,Shadan Sadeghian
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to CSCW 2026 / Proceedings of the ACM on Human-Computer Interaction (PACMHCI)

点击查看摘要

Abstract:The proliferation of Artificial Intelligence (AI) in workplaces is transforming how we work. While existing research on human-AI collaboration at work often prioritizes performance, less is known about their experiential outcomes. Through interviews with 24 employees across Information Technology (IT), service-based, and healthcare sectors, this paper examines AI’s impact on job satisfaction via perceptions of job decency and meaningfulness, now and in the future. Our results reveal that the anticipated impact of AI on overall job satisfaction varies with the occupational domain, with differing perceptions of its underlying decency and meaningfulness. For instance, IT and healthcare anticipate increased satisfaction with decency aspects like working hours but decreased satisfaction with meaningfulness aspects like social image due to misconceptions about AI handling most of their tasks. Conversely, service workers foresee no improvement in their working hours but a higher social standing due to the perceived status boost associated with working with AI.

[HC-2] Not All Uncertainty Is Equal: How Uncertainty Granularity Shapes Human Verification in LLM -Assisted Decision Making

链接: https://arxiv.org/abs/2605.28571
作者: Mauricio Villavicencio,Sitong Pan,Qianwen Wang
类目: Human-Computer Interaction (cs.HC)
备注: 54 pages, 36 figures, accepted by ACM FAccT 2026

点击查看摘要

Abstract:Despite warnings that LLMs can make mistakes, users often develop inappropriate trust and accept incorrect answers without critical evaluation. Uncertainty quantification (UQ), displaying LLMs’ confidence, has emerged as a promising approach to calibrate user trust. However, prior empirical studies on uncertainty communication have treated uncertainty as a single numerical score or simple natural language expression. This simplification fails to capture a key property of LLM outputs: a single response often comprises multiple claims and reasoning steps, each with distinct levels of uncertainty. To address this gap, this study investigates uncertainty granularity (i.e., the extent to which uncertainty is expressed at different levels within an LLM response) and examines its impact on LLM-assisted decision-making. We conducted a large-scale, between-subjects study (N=192) in which participants answered medical questions using LLMs that displayed uncertainty at three different granularities: output-level (entire response), relation-level (individual reasoning steps), and token-level (specific words). Our findings reveal distinct behavioral effects as a function of uncertainty granularity. Token-level uncertainty increased users’ agreement with the AI, whereas output- and relation-level uncertainty did not increase agreement but instead reduced users’ confidence in their own answers. Notably, relation-level uncertainty also reduced external verification (i.e., internet searches, checking provided URLs), steering users away from independent fact-checking and toward reliance on the LLM and its accompanying uncertainty cues. Our findings demonstrate that uncertainty granularity significantly shapes how users interact with and verify LLM outputs, providing concrete design guidance for building responsible LLM applications that encourage appropriate skepticism and verification behaviors.

[HC-3] he Decision to Verify: How Warmth and User Characteristics Shape Reliance on Conversational Agents for Information Search

链接: https://arxiv.org/abs/2605.28498
作者: Mert Yazan,Frederik Bungaran Ishak Situmeang,Suzan Verberne
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Under review for Computers in Human Behavior

点击查看摘要

Abstract:Conversational artificial intelligence (AI) provides an efficient and convenient gateway to information access. However, it can cause overreliance when users blindly trust AI and accept its answers without fact-checking. Information search increasingly follows a hybrid interaction paradigm that combines conversational AI with web search, making fact-checking easier. In this paper, we examine whether this interaction paradigm is effective in curbing reliance. We further investigate the underlying factors (e.g., digital literacy and conversation warmth) that drive users to verify AI answers. We conduct a mixed-subjects question-answering experiment where participants interact with either a warm or a neutral chatbot. Our findings reveal that reliance persists despite users having access to both conversational and web search. The decision to verify is driven primarily by existing user perceptions (e.g., prior trust in chatbots) rather than answer properties, with some users fact-checking regardless of the context and others trusting chatbots by default. Warm conversational style has an indirect yet critical influence on reliance by increasing agreement with the chatbot when it is incorrect. Consulting additional AI sources predicts higher accuracy, while traditional web search does not. Our study extends overreliance research by: (a) demonstrating its persistence despite access to fact-checking, (b) identifying verification behavior as user-dependent, and © revealing conversational warmth’s indirect effect on overreliance with implications for designing trustworthy conversational search systems.

[HC-4] GUI Agents for Continual Game Generation

链接: https://arxiv.org/abs/2605.28258
作者: Yixu Huang,Bo Li,Na Li,Zhe Wang,Kaijie Chen,Haonan Ge,Qingyi Si,Yuanzhe Shen,Ruihan Yang,Guangjing Wang,Hongcheng Guo
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot translation from prompt to artifact, leaving interaction-level failures undetected. We argue that evaluating and improving game generation requires a player, and study two roles for graphical user interface (GUI) agents in this process: (1) as an objective evaluator, for which we introduce PlaytestArena, a new evaluation environment that pairs 200 browser-based game generation tasks across eight genres with rubrics of expected in-play behaviors, adjudicated by a GUI agent that loads each build in a browser and plays it; and (2) as a subjective playtester, for which we propose Play2Code, where a game agent and a GUI agent operate in a sustained loop with shared memory, turning game generation into a dialogue between coding and playing. Our experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. Further analysis shows that GUI playtester feedback is more traceable than a human report, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation. Our project website is available at this https URL.

[HC-5] AI Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?

链接: https://arxiv.org/abs/2605.28255
作者: Maharshi Gor,Yoo Yeon Sung,Yu Hou,Eve Fleisig,Irene Ying,Tianyi Zhou,Jordan Boyd-Graber
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Findings of the Association for Computational Linguistics, 2026

点击查看摘要

Abstract:AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI collaboration requires understanding when, why, and how humans decide to rely on AI. We study two distinct reliance decisions: the delegation choice – deciding when to let AI act autonomously without knowing its output, and the adoption choice – evaluating AI suggestions and deciding how to use them. Both of these decoupled reliance patterns shape collaboration, but prior work rarely studies them together in realistic settings with the same users. We address this gap by studying collaborative human–AI teams competing in a question-answering game in which humans can choose when and how to work with AI agents to win. Our 24 matches pair 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions. While human–AI collaboration performs better than either AI or humans alone, humans make suboptimal collaboration decisions, both under-relying on correct AI suggestions (3.9% of opportunities missed) and over-relying when AI misleads them (1.7%). Both parties contribute wrong answers: reported model confidence is near chance when humans and AI disagree, while confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans’ initial incorrect answer. To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.

[HC-6] Building Community-Centred NLP Resources for Puno Quechua ACL2026

链接: https://arxiv.org/abs/2605.28253
作者: Elwin Huaman,Adrian Gamarra Lafuente,Johanna Cordova,Anna Korhonen
类目: Computation and Language (cs.CL); Databases (cs.DB); Human-Computer Interaction (cs.HC)
备注: Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP 2026), co-located with ACL 2026

点击查看摘要

Abstract:The preservation of under-resourced languages requires digital tools and resources shaped by and for their speakers. We present the first dedicated ASR resources for Puno Quechua (ISO 639-3: qxp): (1) the largest speech corpus for any single Quechua variety, consisting in 66 hours of recordings for scripted and spontaneous speech (including 36 hours of manually transcribed and validated data), collected via a participatory design campaign; (2) the first systematic ASR benchmark for Puno Quechua, evaluating state-of-the-art models and fine-tuning Whisper-base, wav2vec2-base, and XLS-R-300M, with and without continued pre-training (CPT); (3) an open release of all datasets and fine-tuned models.

[HC-7] Why Meditation Wearables Fail: Reward Misspecification in Closed-Loop EEG and Biofeedback Systems

链接: https://arxiv.org/abs/2605.28223
作者: Joy Bose
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 16 pages, 1 figure

点击查看摘要

Abstract:Consumer EEG headbands, HRV biofeedback devices, and closed-loop neurostimulation systems share a fundamental design flaw: they reward measurable proxy signals rather than the outcomes they claim to produce. When a user optimises for calm EEG, HRV coherence, or breathing resonance, their brain learns to produce those signals through whatever strategy is most efficient, including strategies unrelated to the intended benefit. We formalise this as reward misspecification: the policy maximising proxy reward R_proxy is not the policy maximising true intended outcome V_target. This produces three failure modes: proxy mismatch, strategy shortcutting, and transfer failure. We review how existing devices including Muse, HeartMath, Unyte IOM2, and clinical neurofeedback systems instantiate these failures. We introduce a four-tier measurability taxonomy distinguishing reliably measurable wearable targets (Tier 1) from targets that are currently or possibly structurally unmeasurable (Tiers 3 and 4), and show that most devices make implicit Tier 3 and 4 claims. We propose a design framework that avoids all three failure modes: single Tier-1 target (mind-wandering onset via EEG), negative-only cueing, temporal separation of fast EEG and slow somatic feature streams, and transfer to unassisted practice as the only success criterion. No current product meets all four criteria. The framework has direct implications for the design, evaluation, and regulation of cognitive and contemplative wearables.

[HC-8] SmartIterator: Visual Analytics Workflows for Supervising Unsupervised Data Grouping

链接: https://arxiv.org/abs/2605.28219
作者: Gennady Andrienko,Natalia Andrienko
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unsupervised learning methods – topic modeling, partition-based and density-based clustering – produce data groupings without human guidance, yet choosing and evaluating those groupings should not itself be unsupervised. We present \emphSmartIterator~(SI), a visual analytics approach that treats the full sequence of grouping results across a parameter sweep as a first-class analytical object. For each method family, SI provides a structured six-phase workflow that guides the analyst through systematic exploration of grouping results – from quality-metric overview through transition-stability assessment, membership-confidence evaluation, content and context inspection, and recurrent-archetype verification to an informed decision – building cumulative understanding of data structure along the way. The workflows are operationalized through \emphIteraScope~(IS), a coordinated visual display combining quality-metric charts with semantic color encoding, a 1D group embedding with Sankey-style transition flows and violin plots of membership confidence, a 2D group embedding with HDBSCAN-detected recurrent archetypes that highlights iterations capturing all persistent patterns, and domain-specific linked views for contextualized interpretation. We demonstrate the three workflows on: (1)~simulated social-media messages from the VAST Challenge 2011 (density-based clustering, validated against ground truth), (2)~EU population statistics across \sim1,500 NUTS-3 regions (partition-based clustering), and (3)~30 years of IEEE VIS papers (NMF topic modeling). The workflows constitute the main contribution: they provide actionable, method-specific guidance for navigating parameter spaces, studying how data structure evolves across configurations, and grounding analytical understanding in domain context – yielding knowledge about the data that no single ``best’’ result can provide.

[HC-9] he Illusion of Opting in AI-Mediated Consequential Decisions

链接: https://arxiv.org/abs/2605.28210
作者: Eugene Yu Ji
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Neurons and Cognition (q-bio.NC)
备注: 11 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Drawing on Ullmann-Margalit’s concept of opting (transformative, irrevocable, and shadowed by foreclosed alternatives), we show that current AI systems raise a profound ethical problem that existing AI ethics has not fully captured: the illusion of opting, in which persons and groups encounter the deceptive appearance of meaningful consequential choice while the agency needed to become genuinely capable of choosing is weakened. Against approaches that treat AI primarily as an optimizer of already given ends, we argue that AI systems should be evaluated by whether they protect and cultivate meta-capacity against the illusion of opting: the socially and institutionally scaffolded agentive capacity through which means and ends can be formed, contested, revised, and owned. This reframing is especially urgent for disadvantaged populations, who are least able to absorb the costs of the illusion of opting when AI-mediated pathways misdirect behavior and action. We propose three normative imperatives for AI-mediated consequential decisions: existential honesty, which acknowledges the limits of prediction; ecological rationality, which situates guidance within heterogeneous lived ecologies; and counterfactual reparation, which acknowledges and repairs foreclosed alternatives when AI-mediated decision-making pathways fail.

[HC-10] Robo-Blocks: Generative Scaffolding in End-User Design and Programming of Social Robots

链接: https://arxiv.org/abs/2605.28154
作者: Arissa J. Sato,Callie Y. Kim,Nathan Thomas White,Abhinav Maneesh,Yuqing Wang,Hui-Ru Ho,Bilge Mutlu
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Programming social robots is challenging for novice robot programmers due to required expertise in planning, interaction design, and programming. While large language models (LLMs) hold significant promise through code generation from natural-language descriptions, they can obscure critical elements of programming and supplant designer intent, eventually resulting in over-reliance instead of developing programming skills. In this paper, we explore how LLM-based social-robot-programming tools can support novice robot programmers through a Research through Design (RtD) process. We designed and prototyped Robo-Blocks, a block-based programming environment that leverages LLMs to offer novice robot programmers generative scaffolding through structured narratives that connect high-level ideas to executable robot behaviors. Through deployment with novices, we discovered emerging user personas and usage patterns for generative scaffolding and showed how this scaffolding shapes end-user design and programming strategies. We present design insights for the effective use of generative scaffolding and its integration into the practice of social-robot programming.

[HC-11] Learning to Assign Prediction Tasks to Agents with Capacity Constraints

链接: https://arxiv.org/abs/2605.27999
作者: Shang Wu,Saatvik Kher,Padhraic Smyth
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We address the problem of learning to assign prediction tasks to one agent from a set of available human or AI agents. In particular, we focus on the sequential learning of agent expertise and assignment policies where each agent is constrained to handle a fraction of tasks. We provide a general theoretical characterization of this problem in terms of agent capacities, differences in agent expertise, and task context. We then develop a framework of sequential explore-exploit policy-learning algorithms that seek to maximize overall performance. Experimental results over a variety of tabular, image, and text prediction tasks demonstrate systematic gains from our policy-learning algorithms relative to non-contextual baselines across different types of agents, including LLMs and humans.

[HC-12] EyeSpy: Inferring Eye Gaze via Side-Channel Attacks Against Foveated Rendering

链接: https://arxiv.org/abs/2605.27939
作者: Paul Maynard,Harris Amjad,Camila Molinares,Bo Ji,Brendan David-John
类目: Human-Computer Interaction (cs.HC)
备注: 20 pages, 12 figures. Accepted to the 47th IEEE Symposium on Security and Privacy (IEEE SP 2026). Artifacts: this https URL

点击查看摘要

Abstract:While eye tracking provides valuable capabilities for virtual reality, such as gaze interaction and dynamic foveated rendering (DFR), eye-tracking data can inadvertently reveal sensitive user information if not properly protected. Current protections, such as adding permission prompts or gatekeeping gaze data, are insufficient on DFR-enabled systems because gaze data is used internally to drive DFR. When DFR is implemented, objects in the fovea (i.e., immediate gaze area) incur a higher GPU workload than those in the periphery. This gaze-contingent workload creates a novel side channel, which can be leveraged to reconstruct gaze positions. Specifically, we design a novel attack that sweeps imperceptible high-cost objects (HCOs) across the user’s field of view and logs rendering performance metrics (e.g., frame rate or frame time) commonly exposed through standard game engines. Then, we correlate variation in these metrics (caused by HCO-foveal overlap) with the known HCOs’ positions to infer gaze coordinates directly without using eye-tracking APIs. Our experimental results show that mean gaze prediction errors (1.1-4.4 degrees) across the Meta Quest Pro, Varjo XR-4, and desktop platforms are comparable to typical eye-tracker accuracy. We demonstrate that the attack generalizes across various hardware platforms, standard game engines, and foveated rendering pipelines. Finally, we design defense mechanisms based on supervised and unsupervised detectors that can flag the attack reliably (F1 of 0.99) over short time windows.

[HC-13] Show Dont TELL: Explainable AI-Generated Text Detection

链接: https://arxiv.org/abs/2605.27921
作者: Aldan Creo,Suraj Ranganath
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the “tells” by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector’s decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.

[HC-14] Local Privacy Laws in a Globalized World

链接: https://arxiv.org/abs/2605.27801
作者: Shantanu Sharma,Ethan Myers,Lorenzo De Carli,Ritwik Banerjee,Indrakshi Ray
类目: Computers and Society (cs.CY); Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: Accepted in ACM Conference on Data and Application Security and Privacy (CODASPY) 2026

点击查看摘要

Abstract:Personal data has emerged as a highly valuable yet sensitive asset that drives business decisions, enables targeted advertising, and generates substantial revenue for companies, while simultaneously facilitating invasive monitoring of users. In recent years, research on digital privacy violations, including undue access, collection, and sharing of user data, has grown significantly. Much of this research adopts the European General Data Protection Regulation (GDPR) as the primary reference framework. This is reasonable, as GDPR was a pioneering legislation, and many of its stipulations are clear and unambiguous. However, we argue that focusing solely on GDPR (and a small set of other Western regulatory frameworks) ignores privacy-related concerns, attitudes, and problems faced by users from other locales, creating a significant research blind spot. This work systematically normalizes the heterogeneous legal requirements of multiple data protection laws into a unified abstraction aligned with the data lifecycle, which forms the foundation for the implementation of such regulations. We further investigate the implications of these laws on different stakeholders, including users, organizations, and governments. Overall, this work aims to broaden the digital privacy research community’s perspective and to serve as a set of guiding principles for developing technological privacy solutions spanning multiple countries.

[HC-15] Chameleon Clippers: A Tool for Developing Fine Motor Skills in Remote Education Settings

链接: https://arxiv.org/abs/2605.27749
作者: Gennie Mansi,Ashley Boone,Sue Reon Kim,Jessica Roberts
类目: Human-Computer Interaction (cs.HC)
备注: 4 pages, 1 figure, this https URL

点击查看摘要

Abstract:Art education plays a significant role in K-2 learners’ physical and cognitive development. However, teachers struggle to translate in-person activities to remote settings and to give necessary feedback to help learners develop fine motor skills. Previous research shows the benefits of tangible technology and real-time system feedback for supporting teachers and students in digital environments, but little research explores their affordances for remote art education. We developed Chameleon Clippers: interactive scissors that give real-time feedback to learners as they cut along a line. In preliminary tests, learners felt engaged and responded to feedback, enjoying their experience. Our low-cost design augments existing classroom artifacts and practices, supporting classroom integration. Testing also revealed directions for future study, including the frequency of feedback and assimilation into a broader, art education platform. Through our study, we demonstrate the potential for tangible technology to create more interactive, engaging, and supportive remote K-2 learning experiences.

[HC-16] Explanations as Dialogues: Toward Human-Centered Conversational Explainable AI

链接: https://arxiv.org/abs/2605.27666
作者: Niharika Mathur,Smit Desai
类目: Human-Computer Interaction (cs.HC)
备注: To be published in the ACM Conversational User Interfaces (CUI)'26 Conference as Provocation

点击查看摘要

Abstract:As AI systems become increasingly conversational, a gap emerges wherein explanations are studied as static artifacts, yet in practice, are experienced as dialogue. In this provocation, we argue that the conversational layer around an explanation is not incidental to its effectiveness, but a critical constituent. Drawing on three illustrative scenarios, we invite the CUI community to study explanations as interactive, conversational exchanges shaped by timing, tone, persona and conversational history, and introduce our vision for Human-Centered Conversational XAI (HC2XAI).

[HC-17] Structuring Human-AI Productive Interdependence by Strategic Level of Automation Selection for Qualitative Inquiry

链接: https://arxiv.org/abs/2605.27634
作者: Feng Zhou,Jacqueline Meijer-Irons,Ambar Murillo
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) offer a solution to the scale-versus-depth dilemma in qualitative analysis, the paradigm of maximizing automation is fundamentally at odds with the interpretive nature of qualitative inquiry. We argue that effective Human-AI collaboration is not an automation problem, but an interdependence problem. This paper reframes the design of “co-data” systems through the lens of Interdependence Theory, proposing a formal framework to structure human-AI productive interdependence. The framework guides the selection of an appropriate Level of Automation (LoA) for different stages of the qualitative analysis process by assessing task risk and the cost of validation. We present a case study where this framework led to a deliberately interdependent workflow, fostering the calibrated trust necessary for rigorous analysis. We conclude by presenting three design principles that instantiate this framework, demonstrating how to leverage AI as a powerful partner while preserving the human researcher’s irreplaceable role in the transformation process of meaning-making.

[HC-18] What Catches the Eye? A Conjoint Study of Infographic Design Preferences

链接: https://arxiv.org/abs/2605.27554
作者: Amit Kumar Das,Karanbir Pelia,Manav Nitesh Ukani,Klaus Mueller
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Infographic designers balance many choices at once: chart type, color, and whether to add a benchmark or a scale. Past work studies these factors one at a time, so we know little about how readers weigh them against each other. We address this gap with a choice-based conjoint study (N = 65) in which participants viewed pairs of infographics on a mock newspaper page about unemployment. Each infographic varied across three attributes: comparison type (none, US average, percentage scale), color (red, blue), and graphic type (single icon, icon series, bar chart). Comparison type drove most of the preference variation (58.5%), followed by graphic type (29.2%) and color (12.3%). Readers favored percentage scale markers and benchmark comparisons; color had no practical effect. The percentage scale level adds axis information rather than a benchmark, so the comparison type result mixes two distinct ideas. A single topic and a narrow palette also limit external validity. We argue that conjoint analysis is a practical and underused tool for studying visualization preferences across many design dimensions.

[HC-19] Keyphrase Generative Representation of Youth Crisis Conversations Beyond Static Taxonomies

链接: https://arxiv.org/abs/2605.27546
作者: Abeer Badawi,Will Aitken,Lydia Sequeira,Jocelyn Rankin,Maia Norman,Elham Dolatabadi
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Crisis Responders (CRs) rapidly assess thousands of youth SMS conversations each year to identify mental health concerns and guide support. Yet youth distress is increasingly expressed through evolving and context-specific language that often does not fit fixed-label taxonomies. This work analyzed 703,975 de-identified Kids Help Phone conversations (2018-2023) and expanded KHP’s 19-label issue taxonomy into a 39-label hierarchical schema. We then introduce Keyphrase Generative Representation (KGR), a constrained LLM generating concise, conversation-specific keyphrases, evaluated across 129 conversations and 387 expert annotations. The expanded taxonomy achieved expert consensus reliability, with an accuracy of 0.96, and expert review found that 81% of keyphrases accurately reflected content and 74% improved clarity. KGR surfaced identity-linked themes absent from the fixed taxonomy, including immigration problems and caregiver burden, and supported a topic-retrieval workflow that increased accuracy from 0.25 to 0.70 (+0.45) over the manual analyst process. KGR marks a shift toward hybrid, interpretable generative representations that extend crisis response beyond static taxonomies to surface emerging and culturally grounded patterns of youth distress.

[HC-20] Designing Augmented Reality for Preschoolers on the Move

链接: https://arxiv.org/abs/2605.27386
作者: Supriya Khadka,Sanchari Das
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at CHI 2026 Workshop on Next Steps for Augmented Reality On-the-Move: Challenges Opportunities

点击查看摘要

Abstract:Advancements in augmented reality (AR) technologies offer immense potential for mobile experiences. However, most commercial and educational AR systems assume a baseline of predictable user behavior and stationary interaction. Preschoolers and children in early childhood education, specifically ages 3 to 8, are naturally erratic, physically dynamic, and prone to rapid locomotion, making them the ultimate stress test for mobile spatial computing. Through a focused analysis of recent literature on physical activity and spatial learning in AR for preschoolers, this paper identifies points of friction in current mobile deployments. We highlight recurring failures in camera tracking during dynamic movement, physical safety hazards caused by screen-induced distraction, spatial crowding around physical markers, and the privacy risks of continuous environmental surveillance. To address these challenges, we propose AnchorPlay AR, a conceptual prototype for a privacy-preserving, audio-first spatial application. By explicitly separating locomotion from visual tracking, AnchorPlay AR uses audio cues to safely guide movement and reserves visual augmentation for stationary moments, offering a safer framework for preschoolers in constant motion.

[HC-21] From Instructor to Collaborator: What a 90-Participant Study Reveals about Human-Agent Collaboration in a Mobile Serious Game

链接: https://arxiv.org/abs/2605.27384
作者: Danai Korre
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 4 pages, 5 figures, ACM CHI 2026 workshop paper

点击查看摘要

Abstract:This position paper reflects empirical data collected during my PhD from a large-scale within-subjects study (N = 90). The study compared a highly human-like, spoken embodied conversational agent (ECA) against a low human-like text base agent (no embodiment, text bubble only) within a mobile, Unity-developed game about pre-decimal UK currency. The game included two agents with different roles-an Instructor (Alex) and a Shopkeeper/Collaborator. Users interacted using voice and mouse input. The quantitative data I collected included a usability questionnaire (CCIR MINERVA) and the Agent Persona Instrument. Data was analyzed using paired t-test, repeated measures ANOVA and multiple linear regression to identify correlations between the persona and usability. The results showed a statistically significant preference for the version of highly human-like agents, with a large effect size. This is further discussed alongside qualitative findings from observations and exit interviews. The results are framed for Human-Agent collaboration, especially for how roles, mixed-initiative dialogue, and breakdowns/repairs become apparent in goal-oriented tasks. I conclude with questions on timing, user expectations, and role-specific interactions. This submission does not propose new frameworks; it reports empirical findings and questions I hope to workshop with the community.

[HC-22] he Alignment Floor: When Persona Customization Is Safe

链接: https://arxiv.org/abs/2605.27382
作者: Xing Zhang,Guanghui Wang,Yanwei Cui,Wei Qiu,Ziyuan Li,Bing Zhu,Peiyang He
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A key promise of pluralistic AI is behavioral adaptation: persona prompts like “be creative” or “be thorough” let systems respect diverse user values and communication styles. But how much customization can a model absorb before its alignment breaks? We present the first controlled study of the alignment-customization tradeoff, testing seven persona conditions across five tasks on two models with different alignment strengths (1,800 runs). We discover the alignment floor: on a strongly-aligned model (Claude Sonnet), persona prompts have zero effect on sycophancy – all conditions produce ~15%, a stable platform on which rich personalization is safe. On a weakly-aligned model (Nova Lite), the same personas shift sycophancy from 5% to 50% – the floor is absent and customization becomes a safety liability. Surprisingly, Agreeableness is not the worst offender; Extraversion (+20pp) and Openness (+15pp) cause greater degradation. The constructive finding is the Skeptic defense: a critical-thinking persona reduces sycophancy to 5% even on the weak model – the single largest effect in the study. Cross-model transfer of persona effects is near-zero ( \rho = 0.006 ), meaning alignment testing must be per-model. We propose the alignment floor as a design principle: measure it before deploying persona customization, and layer safety-oriented personas underneath user-facing ones to enable personalization without compromising alignment.

[HC-23] Surprising Performances of Students with Autism in Classroom with NAO Robot

链接: https://arxiv.org/abs/2407.12014
作者: Qin Yang,Huan Lu,Dandan Liang,Shengrong Gong,Huanghao Feng
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autism is a developmental disorder that manifests in early childhood and persists throughout life, profoundly affecting social behavior and hindering the acquisition of learning and social skills in those diagnosed. As technological advancements progress, an increasing array of technologies is being utilized to support the education of students with Autism Spectrum Disorder (ASD), aiming to improve their educational outcomes and social capabilities. Numerous studies on autism intervention have highlighted the effectiveness of social robots in behavioral treatments. However, research on the integration of social robots into classroom settings for children with autism remains sparse. This paper describes the design and implementation of a group experiment in a collective classroom setting mediated by the NAO robot. The experiment involved special education teachers and the NAO robot collaboratively conducting classroom activities, aiming to foster a dynamic learning environment through interactions among teachers, the robot, and students. Conducted in a special education school, this experiment served as a foundational study in anticipation of extended robot-assisted classroom sessions. Data from the experiment suggest that ASD students in classrooms equipped with the NAO robot exhibited notably better performance compared to those in regular classrooms. The humanoid features and body language of the NAO robot captivated the students’ attention, particularly during talent shows and command tasks, where students demonstrated heightened engagement and a decrease in stereotypical repetitive behaviors and irrelevant minor movements commonly observed in regular settings. Our preliminary findings indicate that the NAO robot significantly enhances focus and classroom engagement among students with ASD, potentially improving educational performance and fostering better social behaviors.

[HC-24] I Hear Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors

链接: https://arxiv.org/abs/2605.28064
作者: Lelia Erscoi(1),Tomi Kinnunen(1) ((1) Computational Speech Group, University of Eastern Finland)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: To be included in Odyssey 2026: The Speaker and Language Recognition Workshop, Session 4.2, 23-26 June, Lisbon, Portugal

点击查看摘要

Abstract:Automatic deepfake detection has received considerable research attention, yet the socio-technical environment in which humans actually encounter synthetic speech remains poorly understood. We investigate voice deepfake detection as a perceptual and contextual process, presenting a localization task in which 47 participants marked suspected synthetic segments across authentic, fully synthetic, and partially synthetic utterances under three manipulated trust cues: instructional framing, affective priming, and provenance labeling. Participants provided quality ratings on mechanicalness, expressiveness, intelligibility, clarity, calmness, and confidence of evaluation. Utterance class was the primary determinant of detection accuracy and perceptual quality; trust cues produced no main effects but motivated detection behavior. Fully synthetic speech was detected at below-chance levels. Quality ratings tracked utterance type, indicating implicit discrimination where overt detection failed.

计算机视觉

[CV-0] From Pixels to Words – Towards Native One-Vision Models at Scale

链接: https://arxiv.org/abs/2605.28820
作者: Haiwen Diao,Jiahao Wang,Penghao Wu,Yuhao Dong,Yuwei Niu,Yue Zhu,Zhongang Cai,Weichen Fan,Linjun Dai,Silei Wu,Xuanyu Zheng,Mingxuan Li,Yuanhan Zhang,Bo Li,Hanming Deng,Huchuan Lu,Quan Wang,Lei Yang,Lewei Lu,Dahua Lin,Ziwei Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native “one-vision” architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: this https URL.

[CV-1] Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players

链接: https://arxiv.org/abs/2605.28816
作者: Fangfu Liu,Kai He,Tianchang Shen,Tianshi Cao,Sanja Fidler,Yueqi Duan,Jun Gao,Igor Gilitschenski,Zian Wang,Xuanchi Ren
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:World models for interactive video generation have largely focused on single-agent settings, where future observations are generated from a single control signal. However, many generated environments require multi-agent interaction: multiple players, robots, or embodied agents act simultaneously within a shared space. Scaling world models to such settings requires a principled multi-agent design: agents should remain independently controllable, permutation-symmetric, and support efficient inference while maintaining consistency across time and perspectives. In this paper, we present our generative multi-agent world model for interactive simulation. It introduces Simplex Rotary Agent Encoding, a parameter-free extension of 3D RoPE that represents agents as vertices of a regular simplex in rotary angle space. This gives each agent a distinct phase while making all agents permutation-equivalent, enabling scalable agent identity without learned per-slot identities or a fixed agent ordering. To avoid dense all-to-all attention across agents, we further propose Sparse Hub Attention, where learnable hub tokens mediate token interaction across agents, reducing cross-agent attention cost from quadratic to linear in the number of agents. For real-time rollout, we distill a full-context diffusion teacher into a causal student that generates temporal blocks sequentially with KV caching, enabling action-responsive generation at 24 FPS. Experiments in multiplayer virtual environments show that our model improves video fidelity, action controllability, and inter-agent consistency over slot-based and dense-attention baselines, while generalizing from two to four players without additional training.

[CV-2] HarmoVid: Relightful Video Portrait Harmonization CVPR2026

链接: https://arxiv.org/abs/2605.28811
作者: Jun Myeong Choi,Jae Shin Yoon,Luchao Qi,Roni Sengupta,Joon-Young Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:We present a method for harmonizing the lighting of a foreground video to match a target background scene, adjusting shadows, color tone, and illumination intensity (relightful harmonization). Unlike images, acquiring labeled data for videos, where identical motions are recorded under different lighting conditions, is practically infeasible and non-scalable. While one way to create such paired data is to apply existing image-based harmonization models frame by frame to a video, the resulting outputs often suffer from significant temporal jitters. We overcome this problem by introducing a novel lighting deflickering model that can stabilize the global and local lighting flickering artifacts. Our video diffusion model learns from these upgraded deflickered data with a volume of real and synthetic videos to generate high-quality video harmonization results. We further propose an asymmetric alpha mask conditioning technique to learn the clean boundaries from real videos. Experiments demonstrate that our model achieves strong temporal coherence, naturalness, cleaner boundaries, and physically meaningful lighting behavior, while maintaining strong relighting expressiveness compared to prior image-based and video-based harmonization methods.

[CV-3] AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning ICML2026

链接: https://arxiv.org/abs/2605.28809
作者: Zhen-Hao Xie,Yu-Cheng Shi,Da-Wei Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICML 2026. Code is available at this https URL

点击查看摘要

Abstract:Class-Incremental Learning (CIL) is important in building real-world learning systems. In CLIP-based CIL, the model performs classification by comparing similarity between visual and textual embeddings obtained from template prompts, e.g., ``a photo of a [CLASS]‘’. This seemingly monolithic matching process can be decomposed into two conceptually distinct stages: attribute extraction and attribute aggregation. For example, a model may recognize cat using attributes such as fur texture and whiskers. When learning a new class like car, the model must extract additional attributes like wheels and adjust how they are aggregated in the shared representation space. However, since only data from the current task is available, incremental updates can bias both attribute extraction and aggregation toward new classes, leading to catastrophic forgetting. Therefore, we propose AREA for attribute extraction and aggregation in CLIP-based CIL. To stabilize extraction, we anchor class-level visual and textual attributes on the hyperspherical embedding space via principal geodesic analysis. To stabilize aggregation, we learn lightweight task-specific experts with scoring and residual refinement, regularized by a variational information bottleneck objective. During inference, we perform routing over task attribute manifolds via optimal transport for more concise prediction. Experiments show that AREA consistently outperforms SOTA methods. Code is available at this https URL.

[CV-4] Ω-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling

链接: https://arxiv.org/abs/2605.28803
作者: Xinyu Wang,Mingze Li,Sicheng Lyu,Dongxiu Liu,Kaicheng Yang,Ziyu Zhao,Yufei Cui,Xiao-Wen Chang,Peng Lu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models unify perception, reasoning, and control within a single policy, yet their multi-billion-parameter backbones and diffusion-based action heads make on-device deployment prohibitively expensive. Prior quantization efforts offer only partial solutions, compressing the LLM backbone while leaving the DiT action head at full precision, or resorting to mixed-precision schemes, driven by the belief that uniformly quantizing the action head is inherently unstable. We challenge this assumption with Omega-QVLA, the first training-free post-training quantization framework that compresses both the language backbone and the entire diffusion action head of a VLA model to a uniform W4A4 precision, eliminating the need for mixed-precision allocation. Omega-QVLA combines a composite SVD-Hadamard rotation that equalizes per-channel weight energy while diffusing residual activation outliers with per-step DiT activation scaling quantization that absorbs dynamic-range drift across denoising steps. On LIBERO, Omega-QVLA compresses Pi 0.5 and GR00T N1.5 to W4A4 with 98.0% and 87.8% task success rates, matching or exceeding their FP16 references of 97.1% and 87.0%, while reducing the static memory footprint by 71.3%. Real-world manipulation experiments further confirm smooth, accurate manipulation where prior methods fail. Code is available at this https URL.

[CV-5] Bias Leaves a Gradient Trail: Label-Free Bias Identification via Gradient Probes on Concept Decompositions

链接: https://arxiv.org/abs/2605.28780
作者: Thomas Vitry,Kieran Edgeworth,Stefan Wermter,Jae Hee Lee
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to the 49th German Conference on Artificial Intelligence (KI2026)

点击查看摘要

Abstract:Vision classifiers can exploit spurious correlations, achieving high in-distribution accuracy yet failing under distribution shift. Existing approaches to bias mitigation and analysis often depend on curated datasets, spurious-attribute or group labels, or retraining, which may be infeasible once a model is deployed or the relevant bias is unknown. We present a bias-label-free, post-hoc method for identifying spurious concepts in frozen vision models, relying only on standard class labels from a held-out audit dataset. For each target class, we collect patches from inputs predicted as that class and apply non-negative matrix factorization to intermediate activations to obtain a bank of interpretable concept vectors. Candidate concepts are then ranked with a bias estimator derived from their interaction with backpropagated gradients on misclassified examples: bias concepts tend to get activated when correcting false negatives and suppressed when correcting false positives. On Colored MNIST and Waterbirds the method recovers concepts aligned with the known spurious cue, and on CelebA it surfaces decision-relevant directions that only partially coincide with the annotated gender attribute; suppressing the top-ranked concepts at inference time improves worst-group accuracy by up to 17.9 percentage points on Waterbirds and 10.4 on CelebA without any retraining or parameter updates. Our method identifies decision-relevant spurious directions that need not coincide with annotated ones, providing both an interpretable auditing tool and an actionable debiasing handle for frozen vision models. Code is available at this https URL.

[CV-6] Self-Prophetic Decoding to Unlock Visual Search in LVLMs ICML2026

链接: https://arxiv.org/abs/2605.28741
作者: Zhendong He,Qiyuan Dai,Guanbin Li,Liang Lin,Sibei Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are rapidly evolving toward true multimodal reasoning, with visual search representing a concrete instantiation of the thinking-with-images paradigm. However, LVLM visual search faces two key challenges: incompatibility among intrinsic capabilities after post-training, and interference in long multi-step reasoning contexts. To address these, we identify two novel insights. First, self-regulation between pre- and post-training LVLMs leverages the intrinsic single-step capabilities of the pre-training model to mitigate capability deterioration and long-context interference. Second, probability-based prophetic sampling, replacing naive prompting, provides a probabilistic interface where the pre-training model acts as a prophet and the post-training model selectively accepts prophetic tokens under its output distribution, preserving coherent multi-step reasoning. Building on these insights, we introduce SeProD, a self-prophetic decoding framework that leverages intrinsic single-step capabilities to enable coherent multi-step reasoning in a training-free, plug-and-play manner. Experiments show that SeProD consistently improves multiple visual-search LVLMs across all 12 splits of 4 visual search benchmarks, as well as across general VQA benchmarks, without added computational overhead, thanks to its parallel prophetic acceptance mechanism.

[CV-7] SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping

链接: https://arxiv.org/abs/2605.28735
作者: Hongyu Wen,Jia Deng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transparent objects are common in daily life, and it is important to understand their multilayer depth, including the transparent surface and the objects behind it. Existing methods for multilayer depth typically extend single-layer prediction. They define layers by the front-to-back ordering of 3D points and predict the layers sequentially. However, as layered geometry can admit multiple valid groupings of 3D points into layers, a predefined grouping strategy is inherently restrictive. In this work, we propose SeeGroup, a multi-layer depth estimation method that avoids imposing a predefined grouping and allows the model itself to adaptively assign surfaces to depth maps. We formulate per-pixel multi-layer depth as a point process, treating depth layers as unordered events along each camera ray. This induces a permutation-invariant likelihood over the observed depth layers, yielding a loss that naturally supports arbitrary layer groupings. Experiments demonstrate that our method significantly advances the state of the art of multi-layer depth estimation, improving quadruplet relative depth accuracy on LayeredDepth benchmark from 61.34% to 70.09%. Code is available at this https URL.

[CV-8] OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism HiF8 Quantization and Reinforcement Learning

链接: https://arxiv.org/abs/2605.28691
作者: Yunyang Ge,Xianyi He,Zezhong Zhang,Bin Lin,Bin Zhu,Xinhua Cheng,Li Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64 \times single-GPU speedup and over 1.52 \times eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69 \times and 2.27 \times speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.

[CV-9] EntroAD: Structural Entropy-Guided Prompt Adaptation for Zero-Shot Anomaly Detection

链接: https://arxiv.org/abs/2605.28630
作者: Xinyu Zhao,Qingyun Sun,Jiayi Luo,Jianxin Li
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Zero-Shot Anomaly Detection (ZSAD) aims to detect anomalies in unseen domains without target-domain adaptation. Recent CLIP-based methods have shown promising performance by leveraging prompt learning and visual-text alignment. However, most existing approaches rely on a single adaptation pathway, which may be insufficient for heterogeneous anomaly patterns across domains. In practice, anomalies exhibit vastly different characteristics, ranging from salient, localized structural disruptions to subtle, diffuse, and irregular variations. To address this challenge, we propose EntroAD, a structural entropy-guided zero-shot anomaly detection framework. Unlike previous methods, EntroAD introduces a dynamic routing mechanism to process different types of anomalies with specialized adaptation strategies. Specifically, we estimate patch-level structural entropy from self-attention-induced patch relations and use it as a proxy for relational uncertainty to guide anomaly-aware token routing. Based on this routing signal, we construct anomaly-aware routed tokens to better capture anomaly cues with different structural characteristics. We further introduce a confidence-aware dual-branch prompt adaptation module to stabilize visual-text alignment while preserving CLIP’s transferable prior. Extensive experiments on 10 industrial and medical benchmarks show that EntroAD achieves state-of-the-art performance in challenging cross-dataset ZSAD settings.

[CV-10] A Multiscale Kinetic Framework for Image Segmentation: From Particle Systems to Continuum Models

链接: https://arxiv.org/abs/2605.28619
作者: Horacio Tettamanti,Giulia Guicciardi,Mattia Zanella
类目: Computer Vision and Pattern Recognition (cs.CV); Adaptation and Self-Organizing Systems (nlin.AO)
备注: 26 pages, 34 figures

点击查看摘要

Abstract:In this work, we present a multiscale kinetic framework for consensus-based image segmentation. By interpreting an image as a system of interacting particles, each pixel is characterised by its spatial position and an internal feature encoding color information. We introduce a coupled interaction scheme governing the evolution of particles in both position and feature spaces, from which we derive a kinetic formulation for the particle density in the space-feature domain combining transport, aggregation, and diffusion effects. Furthermore, through a suitable scaling, we obtain a first-order macroscopic model describing the evolution of the fraction of pixels carrying information on the fraction of pixels having a certain feature. Based on this reduced-complexity model, we present a data-oriented approach where we make use of particle-based optimisation techniques for the accurate segmentation of images. Numerical tests show the effectiveness of the proposed framework and its robustness under different noise conditions.

[CV-11] Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization

链接: https://arxiv.org/abs/2605.28615
作者: Zhuohan Liu,Wujian Peng,Yitong Chen,Zuxuan Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we propose BiDPO, a framework to enhance T2I model’s capability of compositional text-to-image generation. We begin by introducing an carefully designed pipeline to construct a large-scale preference dataset, BiComp, with strictly quality control. Then, we extend Diffusion DPO to jointly optimize image and text preferences, which is shown to greatly effective in improving the models to follow complex text prompt in generation. To further enhance the models for fine-grained alignment, we employ a region-level guidance method to focus on regions relevant to compositional concepts. Experimental results demonstrate that our BiDPO substantially improves compositional fidelity, consistently outperforming prior methods across multiple benchmarks. Our approach highlights the potential of preference-based fine-tuning for complex text-to-image tasks, offering a flexible and scalable alternative to existing techniques.

[CV-12] JECA2: Judgment-Explanation Consistent Adversarial Attack against Forensic Vision-Language Models

链接: https://arxiv.org/abs/2605.28609
作者: Jiachen Qian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 6 figures. Includes supplementary material

点击查看摘要

Abstract:Forensic vision-language models (VLMs) have recently been developed to detect image tampering and provide natural-language explanations. However, their robustness against adversarial manipulation remains underexplored. Existing adversarial attacks typically aim to flip the model’s binary judgment, while the accompanying explanation may still reveal forensic cues and contradict the attacked judgment. In this paper, we study judgment-explanation consistent adversarial attacks against forensic VLMs and propose JECA^2, a controlled white-box red-team diagnostic that jointly redirects visual attribution and aligns textual explanations with the target judgment. On the visual side, JECA^2 uses Grad-CAM-guided perturbations to divert attribution from tampered regions toward benign regions. On the textual side, it optimizes prompt embeddings toward authenticity-affirming semantics under a token-proximity constraint. Experiments on forensic VLM benchmarks show that JECA^2 achieves higher attack success and automated judgment-explanation consistency than implemented baselines under white-box threat settings, while transfer to closed-source VLMs remains measurable but limited. Our results highlight a consistency failure mode in explanation-based forensic VLMs and motivate future robustness evaluation beyond binary detection accuracy.

[CV-13] Internally Referenced Low-Light Enhancement

链接: https://arxiv.org/abs/2605.28605
作者: Peiyuan He,Hainuo Wang,Hengxing Liu,Mingjia Li,Xiaojie Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised low-light image enhancement (LLIE) is highly appealing as it eliminates the reliance on external paired data. However, the lack of external references causes networks to struggle with decoupling entangled illumination, delicate textures, and amplified noise. To resolve this challenge, we propose an Internally Referenced LLIE framework that extracts reliable physical and structural references from the degraded input image itself. First, we introduce a local exposure-simulated scheme to extract a low-frequency pseudo ground-truth. This serves as an internal physical reference to guide global illumination estimation and correct color casts. Second, we propose a dual-domain preservation strategy with spatial and spectral constraints to construct internal structural references. Specifically, an Illumination-Aligned Perceptual loss preserves global structures under illumination shifts, while a Shift-Invariant Spectral Correlation loss captures fine-grained local structures and suppresses high-frequency noise. Finally, we propose a Gain-Adaptive Feature Modulation (GAFM) mechanism to address highly spatially-variant residual noise. By transforming the self-estimated illumination map into an internal spatial gain prior, GAFM dynamically guides a blind-spot network for spatially-aware denoising. Extensive experiments demonstrate that our method achieves state-of-the-art performance, delivering superior noise suppression and textural fidelity. Code will be publicly released at this https URL.

[CV-14] Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

链接: https://arxiv.org/abs/2605.28604
作者: Xiao Wang,Minglei Yang,Bin Yang,Wenke Huang,Zheng Wang,Xin Xu,Mang Ye
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Identifying key individuals in video scenes is essential for applications such as automated video editing and intelligent surveillance. Current methods primarily focus on static images and immediate visual cues, overlooking the rich spatio-temporal information in videos. This leads to the phenomenon of Temporal Importance Shift (TIS), wherein individuals deemed significant in early frames may be demoted as the entire temporal context is considered. To address this, we introduce the Video Important Person (VIP) identification task, aimed at automatically identifying the most influential individuals in videos while providing textual rationales. We present Temporal-VIP, a large-scale rationale-annotated dataset consisting of 9,249 video segments across 11 categories with aligned importance rationales. To mitigate TIS, we develop the VIP-Net framework, which includes a Social Cue Encoder (SCE) for extracting multi-modal spatio-temporal cues, a Temporal Importance Rectifier (TIR) for hierarchical cue fusion and cross-modal alignment, and VIP Inference for ranking individuals. Experimental results show that VIP-Net achieves 67.3% accuracy, significantly outperforming state-of-the-art models (37.5%-53.9%) and yielding a mean rationale similarity of 0.63 to ground truth through feature-guided LLM refinement. The dataset and code are available at this https URL.

[CV-15] Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation CVPR2026

链接: https://arxiv.org/abs/2605.28587
作者: Yang Gao,Wuyang Li,Po-Chien Luan,Alexandre Alahi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Understanding dynamic 3D environments is essential for safe autonomous driving, particularly when reasoning about human-centric, nonrigid agents. However, existing weakly supervised occupancy prediction frameworks predominantly assume rigid-body motion and rely on simple frame-to-frame offsets, limiting their ability to capture fine-grained deformations and maintain temporal coherence. To address this issue, we propose DeGO, a deformable Gaussian occupancy framework that unifies decoupled Gaussian deformation with factorized 4D foundation-model distillation. DeGO disentangles rigid and nonrigid motion, enabling each Gaussian primitive to evolve through both deformation and offset-based updates. In parallel, a factorized 4D distillation strategy transfers cross-camera and cross-frame knowledge from the VGGT foundation model, producing foundation-aligned features that enhance temporal consistency. Experiments on the Occ3D-NuScenes benchmark demonstrate that our method achieves state-of-the-art performance under weak supervision, delivering 13.5% gains on human-centric instances and 10.9% overall improvements. These results highlight the effectiveness of deformation-aware and foundation-guided occupancy modeling for dynamic scene understanding. The code is publicly available: this https URL

[CV-16] Resolution-free neural surrogates for geometric parameterization and mapping with spatially varying fields

链接: https://arxiv.org/abs/2605.28551
作者: Yanwen Huang,Lok Ming Lui,Gary P. T. Choi
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many imaging problems require computing spatial transformations induced by spatially varying intensity, feature, or density fields. Canonical examples include distortion correction, deformable image registration, atlas-based segmentation, and deformation-driven image analysis. These tasks can be formulated as geometric mapping problems in which the transformation is constrained to preserve local structure, control boundary behavior, or regulate angular distortion. Such formulations typically lead to variational models, diffusion processes, or elliptic partial differential equations. However, repeatedly solving high-resolution systems becomes computationally expensive when the underlying parameter fields vary across instances. In this work, we propose a resolution-free neural surrogate for geometric parameterization and mapping problems. Given a spatially varying parameter field p:\Omega\to\mathbbR^m and query locations \x_i_i=1^N\subset\Omega , the model predicts mapped locations \u(x_i)_i=1^N on arbitrary structured or unstructured point sets. To avoid dependence on a fixed grid, we use a multi-resolution geometric encoding strategy that conditions the network on coordinate-augmented samples of the parameter field. The model is trained without labeled solution data by enforcing geometry-aware constraints derived from variational energies, diffusion-based density equalization, and quasi-conformal theory. Experimental results on quasi-conformal mapping and density-equalizing mapping problems are presented to demonstrate the effectiveness of our proposed method.

[CV-17] GEM: Generative Supervision Helps Embodied Intelligence

链接: https://arxiv.org/abs/2605.28548
作者: Ruowen Zhao,Bangguo Li,Zuyan Liu,Yinan Liang,Junliang Ye,Fangfu Liu,Diankun Wu,Zhengyi Wang,Xumin Yu,Yongming Rao,Han Hu,Jun Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at this https URL

[CV-18] DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

链接: https://arxiv.org/abs/2605.28544
作者: Chen Shi,Jinrui Xu,Shaoshuai Shi,Kehua Sheng,Bo Zhang,Li Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.

[CV-19] Janus-LoRA: A Balanced Low-Rank Adaptation for Continual Learning

链接: https://arxiv.org/abs/2605.28495
作者: Cheng Chen,Pengpeng Zeng,Yuyu Guo,Lianli Gao,Hengtao Shen,Jingkuan Song
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9pages, International Conference on Machine Learning

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) has emerged as a promising paradigm for Continual Learning. It independently updates its low-rank factors ( A and B ), creating a composite update to the full weight matrix through their interaction. To prevent catastrophic forgetting, this update should remain orthogonal to the task-specific subspace that contains previously learned knowledge. However, we identify that this composite update systematically violates this orthogonality, reintroducing interference and undermining stability. Furthermore, naively enforcing this orthogonality compromises plasticity, disrupting the delicate stability-plasticity trade-off. To resolve these issues, we propose \textbfJanus-LoRA, a framework that restores this balance through two novel components. Specifically, we first introduce Gradient Rectification, a closed-form solution that mathematically decouples LoRA’s factor updates, enforcing orthogonality against the historical knowledge subspace identified by an efficient Online Estimation. Next, to enhance plasticity, we introduce a Decoupled Margin Loss that promotes feature-level separation by pushing new feature representations away from old ones, thus creating distinct, low-interference regions for new learning. Comprehensive experiments on challenging benchmarks demonstrate that by harmonizing parameter-level orthogonality with feature-level separation, Janus-LoRA achieves a superior balance and establishes new state-of-the-art performance.

[CV-20] DiscoForcing: A Unified Framework for Real-Time Audio-Driven Character Control with Diffusion Forcing ICML2026

链接: https://arxiv.org/abs/2605.28491
作者: Kaiyang Ji,Bingsheng Qian,Binghuan Wu,Kangyi Chen,Ye Shi,Jingya Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICML 2026

点击查看摘要

Abstract:We study real-time audio-responsive character control as a deployment-faithful problem: strictly causal, bounded-latency streaming that must generate coherent full-body motion at interactive frame rates while the audio condition can change abruptly, including tempo shifts, drops, or user edits. Prior music-to-motion systems are largely optimized for offline generation with global context, and degrade in streaming rollouts where conditioning history becomes stale or unreliable. We introduce DiscoForcing, a streaming audio-driven diffusion framework that combines a causal music encoder that captures rhythmic structure and phase dynamics with a diffusion-forcing sequence model trained under heterogeneous noise levels across the temporal horizon. Building on this, we design a hybrid temporal schedule and a history-guided streaming sampler to explicitly trade off responsiveness against long-horizon consistency under non-stationary audio. Implemented in an end-to-end real-time interactive system with online avatar playback and humanoid deployment workflows, DiscoForcing delivers more stable long-horizon rollouts and sharper audio-motion alignment than prior baselines under matched causality and latency constraints while maintaining real-time throughput.

[CV-21] SSR3D-LLM : Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLM s

链接: https://arxiv.org/abs/2605.28490
作者: Jiawei Li,Ziyi Liu,Weijie Shi,Long Chen,Jiajie Xu,Xiaofang Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision that compresses a relational instruction into one selection. This is brittle for fine-grained queries where multiple same-class candidates must be ruled out by context objects and spatial relations. We propose Structured Spatial Reasoning 3D-LLM (SSR3D-LLM), a structured grounding interface for unified 3D-LLMs. Given fixed Mask3D object proposals, the LLM writes a sequence of latent spatial reasoning steps and memory tokens from the query, and a geometry-aware scorer reads these latent steps in order to refine candidate rankings step by step with step-length masking. The latent steps are learned from standard benchmark target supervision with auxiliary referential-cue supervision during training, while inference uses only the input query and Mask3D proposals. Across ReferIt3D, ScanRefer, and Multi3DRef, SSR3D-LLM achieves the strongest results among unified 3D-LLM baselines, with substantial gains over the single-pointer QPG baseline on fine-grained grounding and consistent improvements over prior unified 3D-LLMs, while preserving the default language-task route.

[CV-22] SA4Depth: Consistent Pose-Depth Scale Alignment for Self-Supervised Monocular Depth Estimation

链接: https://arxiv.org/abs/2605.28477
作者: Changxuan Li,Nadine Berner,Nassir Navab,Federico Tombari,Stefano Gasperini
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE RA-L 2026

点击查看摘要

Abstract:Self-supervised depth estimation from monocular sequences relies on the joint learning of a depth and a pose network. Despite abundant research done to improve the depth network, efforts on the pose remain limited. In this context, even when depth is estimated up to scale, we highlight the importance of the alignment between the scene scales estimated by the pose and depth nets. Then, we introduce SA4Depth, an approach to improve this alignment and boost the depth predictions while keeping the inference time unchanged. Our proposed method uses the depth estimated during training to reproject learnable visual features across consecutive frames and refine the pose estimates by reducing feature alignment residuals. With our method, the estimated scene scales by the separate depth and pose networks are aligned, and the prediction scale consistency is improved across different sequences. Our differentiable refinement integrates seamlessly into existing self-supervised pipelines and substantially improves their depth estimates. We demonstrate this with extensive experiments both outdoors and indoors on KITTI, Cityscapes, and NYUv2. Additionally, results on KITTI Odometry confirm the effectiveness of our pose refinement. Our code is available at this https URL .

[CV-23] REVEAL: Reference-Grounded Reasoning for Multimodal Manipulation Detection

链接: https://arxiv.org/abs/2605.28459
作者: Jun Zhou,Bingwen Hu,Yaxiong Wang,Zhedong Zheng,Yongzhen Wang,Yuchen Zhang,Ping Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Multimodal manipulation detection aims to simultaneously identify forged image–text pairs and localize tampered regions, yet existing methods typically rely on memorizing isolated artifacts and struggle with imperceptible manipulation traces or domain shifts. Inspired by human comparative reasoning, we reformulate this task as a reference-grounded verification problem, where authenticity is assessed by comparing a query against retrieved authentic evidence. We propose REVEAL Reference-Enabled Verification for Evidence Analysis and Localization), a framework explicitly designed for this comparative paradigm. To support this paradigm, we construct a large-scale reference library comprising 170K authentic news image–text pairs featuring over 40K public figures. Technically, REVEAL employs a difference-aware fusion mechanism to capture fine-grained discrepancies between the query and retrieved evidence. Furthermore, we introduce a task-decoupled Mixture-of-Experts (MoE) architecture to jointly execute instance-level detection and fine-grained grounding, effectively mitigating optimization conflicts between these heterogeneous objectives. Extensive experiments demonstrate that REVEAL significantly outperforms state-of-the-art methods, and notably enables \emphtraining-free domain adaptation by simply updating the reference library, offering a robust and practical solution for detecting evolving misinformation. Code is available at this https URL.

[CV-24] Diffusion Large Language Models for Visual Speech Recognition

链接: https://arxiv.org/abs/2605.28456
作者: Jeong Hun Yeo,Chae Won Kim,Hyeongseop Rha,Yong Man Ro
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: Code: this https URL

点击查看摘要

Abstract:Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the best of our knowledge, the first Diffusion Large Language Model (DLLM)-based VSR framework, formulating transcription as iterative masked denoising with flexible-order decoding. With confidence-based unmasking, DLLM-VSR commits high-confidence positions early and uses the committed tokens as bidirectional context to refine ambiguous ones. To adapt DLLMs to VSR, we introduce a two-stage masked-denoising training strategy that separates visual-to-text content alignment from length modeling. We further observe a performance gap with oracle-length decoding, which assumes access to the true transcript length, indicating that reducing target-length uncertainty can improve DLLM-based VSR. To reduce this gap, we develop length-guided candidate decoding, which uses video duration to construct plausible transcript-length hypotheses, decodes under multiple hypotheses, and reranks candidates using length plausibility and decoding confidence. The proposed method achieves a state-of-the-art WER of 19.5% on LRS3 using only its labeled training data.

[CV-25] BiasEdit: A Training-Free Bias-Detect-and-Edit Framework for Learning Fair Visual Classifiers WWW

链接: https://arxiv.org/abs/2605.28450
作者: Jungwook Seo,Yoonsik Park,Changmin Lee,Sungyong Baik
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to The Web Conference 2026 (formerly WWW) as an Oral presentation

点击查看摘要

Abstract:Visual data from the Web power image classifiers, which often underpin many web services, such as recommendation and content moderation. However, the raw Web data often contain spurious correlations and social biases, and neural networks are known for their tendency to learn biases present in data. This can reinforce unfairness in web services and the web data, leading to a vicious cycle. In the context of image classification, networks learn bias attributes for a specific class when a majority of images contain the same attribute only for a given class. Hence, training a fair and debiased classifier from a biased dataset demands handling an imbalanced problem between a majority of images with bias attributes (bias-aligned samples) and a minority without (bias-conflict samples). In this work, we introduce BiasEdit, a modular framework that automatically detects bias attributes from the original dataset and edits them to construct a debiased dataset. Specifically, BiasEdit first detects unknown bias attributes via statistical dependence and mutual information analysis of visual-linguistic representations, and then explicitly edits those attributes using text-guided image editing to generate realistic bias-conflict samples. Unlike prior works that assume known bias attributes or relies on synthetic mixing, our method operates without manual annotations and can leverage off-the-shelf vision-language and editing models. BiasEdit addresses a fundamental challenge in Web-sourced visual AI, mitigating dataset-induced bias and achieving state-of-the-art debiasing performance even when training data are fully biased.

[CV-26] Self-Supervised Online Robot-Agnostic Traversability Estimation for Open-World Environments

链接: https://arxiv.org/abs/2605.28442
作者: Julia Hindel,Simon Bultmann,Houman Masnavi,Daniele Cattaneo,Abhinav Valada
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 16 Figures

点击查看摘要

Abstract:Self-supervised online traversability estimation enables robots to continuously learn from unlabeled open-world experiences and adapt their navigation behavior toward safe and efficient trajectories. Existing approaches either rely on handcrafted proprioceptive traversability scores, limiting robot-agnosticism, or cluster prior data, preventing online learning. Moreover, many continual learning methods incur substantial memory and computational costs, hindering onboard deployment. We introduce COTRATE, an online learning framework for continuous traversability estimation from multimodal, unlabeled robot experience. Our method first infers robust traversability scores using a robot-agnostic, learning-based online terrain assessment module operating on proprioceptiveand inertial signals. These scores then supervise a visual traversability network through a novel alignment loss that associates visual embeddings with online terrain this http URL mitigate forgetting during continual learning with minimal overhead, we propose a diversity-aware feature selection strategythat preserves performance using a compact replay memory. We further show that the learned traversability representation supports knowledge transfer across different robot platforms with different locomotion kinematics. We evaluate COTRATE on a dataset of \approx 50,000 images collected with two robotic platforms across 11 outdoor terrains, and benchmark it on navigation tasks in three representative outdoor environments. We make the dataset, code, and trained models publicly available.

[CV-27] Bayesian Gated Non-Negative Contrastive Learning ICML2026

链接: https://arxiv.org/abs/2605.28441
作者: Peng Cui,Jiahao Zhang,Lijie Hu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:While Contrastive Learning (CL) has revolutionized self-supervised representation learning, its latent representations remain highly entangled and opaque, limiting their interpretability in safety-critical applications. We identify that a fundamental cause of this entanglement is the reliance on deterministic similarity measures, which treat all feature dimensions equally. In compositional scenes, this creates an Optimization Conflict: common background features, such as, “blue sky”, are encouraged to align in positive pairs but simultaneously repelled in negative pairs, causing gradient oscillations that hinder precise semantic disentanglement. To address this, we propose BayesNCL (Bayesian Gated Non-Negative Contrastive Learning). Unlike standard approaches, BayesNCL introduces a probabilistic gating mechanism that dynamically filters out task-irrelevant, high-frequency common features while selectively retaining discriminative semantics. By formalizing feature selection as a variational inference problem with a sparse Bernoulli prior, our method effectively resolves the optimization conflict. Empirical experimental results on Imagenet-100 demonstrate that BayesNCL achieves a remarkable 142.1% improvement in semantic consistency compared to state-of-the-art baselines, yielding highly interpretable representations without compromising downstream task performance. Code is available at this https URL.

[CV-28] Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization CVPR2026

链接: https://arxiv.org/abs/2605.28428
作者: Jungwook Seo,Minjeong Kim,Younkwan Lee,Seungho Shin,Sungyong Baik
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Detecting subtle visual anomalies in images remains challenging, particularly when only normal samples are available a priori. Such unsupervised anomaly detection is typically solved by measuring feature similarity of a query patch to a memory of normal patches. However, similarity alone does not reveal how strongly a query patch violates the structure of the normal feature manifold. We propose a training-free Laplacian graph energy optimization formulation, named ANoCo that scores Anomaly by the cost of Non-Conformity of a query patch to align with a fixed normal manifold. For each query patch, we construct a bipartite query to normal graph weighted by cosine affinity, explicitly removing query-query and normal-normal edges to prevent evidence dilution. We formulate anomaly scoring as a convex Laplacian energy with anchored normal nodes, and solve in closed form. In particular, we do not use the optimized features themselves-the anomaly score is the magnitude of the update required to satisfy normality constraints, reframing the graph Laplacian as a non-conformity operator rather than a smoothing prior. The proposed method introduces no learnable parameters, message passing, or sampling, and has complexity comparable to a single linear solve. Across standard benchmarks, it delivers strong image-level AUROC, stable localization maps, and improved robustness over prior methods, demonstrating the effectiveness of using optimization-induced feature drift as anomaly measure.

[CV-29] VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLM s

链接: https://arxiv.org/abs/2605.28422
作者: Qiaoru Li,Shaotian Liang,Jintao Chen,Haoran Sun,Yuxiang Cai,Jianwei Yin,Yankai Jiang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Latent reasoning enables reasoning over continuous hidden states rather than explicit tokens, avoiding the language bottleneck and inference overhead of chain-of-thought for medical VQA. However, existing methods suffer from modality collapse, insufficient visual supervision, and train-inference mismatch. Moreover, their opaque latent states offer no interpretability, which is critical in clinical applications. We propose VITAL, a latent-space reasoning framework for medical MLLMs with visual-semantic dual supervision: an auxiliary text decoder reconstructs reasoning chains from latent states, while a visual projector regresses ROI features from a frozen, independent medical vision encoder. Both modules are discarded at inference with zero overhead, yet can be re-attached post-hoc for dual interpretability, providing textual and visual explanations of the reasoning process without sacrificing efficiency. We construct a 61K dataset spanning 9 imaging modalities, exceeding prior medical visual latent reasoning datasets by an order of magnitude. Experiments on 7 benchmarks show that VITAL consistently and substantially outperforms the backbone, all latent reasoning baselines, and medical MLLMs trained on far larger data, achieving state-of-the-art results competitive with trillion-parameter proprietary models.

[CV-30] EgoRelight: Egocentric Human Capture and Illumination Recovery for Relightable and Photoreal Avatar Rendering

链接: https://arxiv.org/abs/2605.28401
作者: Jianchun Chen,Yinda Zhang,Rohit Pandey,Thabo Beeler,Marc Habermann,Christian Theobalt
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mixed Reality (MR) headsets promise a future of immersive telepresence where virtual humans blend indistinguishably into real or virtual surroundings. Achieving this vision requires a method for capturing a user’s motion, estimating appearance under novel lighting, and understanding the environment - all from the constrained viewpoint of a head-mounted display (HMD). Existing approaches treat these as isolated problems: they either focus on driving avatars with baked-in lighting or rely on studio setups for relighting. In this paper, we present EgoRelight, a holistic framework for egocentric telepresence that simultaneously captures full-body human performance, synthesizes photorealistic and relightable appearance, and estimates high dynamic range (HDR) environment maps from a single HMD. First, to ensure motion and surface reconstruction, we propose an egocentric perception module that leverages stereo down-facing cameras to extract dense depth maps, which serve as geometric control signals to drive a mesh-based avatar. Second, we introduce a novel neural appearance model that learns to synthesize view-dependent specular and view-independent diffuse shading separately. By employing a specialized ray-sampling strategy, our model generalizes to unseen illumination without relying on restrictive analytical BRDF priors. Third, we enable seamless avatar integration into the physical world via a test-time inverse rendering process, which recovers an HDR environment map by matching the pre-trained avatar’s appearance to live egocentric camera observations. We demonstrate our system through a social telepresence application, where remote users are coherently relit according to their physical environment. Extensive experiments show that our components and the integrated system significantly outperform state-of-the-art baselines in geometric accuracy and rendering as well as relighting fidelity.

[CV-31] Adaptive Temporal Gating of Longitudinal Magnetic Resonance Imaging for Alzheimers Prediction

链接: https://arxiv.org/abs/2605.28397
作者: Alireza Moayedikia,Sara Fin,Alicia Troncoso Lora,Uffe Kock Wiil
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting conversion from Mild Cognitive Impairment (MCI) to Alzheimer’s Disease (AD) is critical for early intervention. Current deep learning paradigms predominantly rely on cross-sectional structural MRI, neglecting prognostic value in patient-specific anatomical trajectories. We introduce the Temporal Adaptive Fusion Network (TAF-Net), a hybrid CNN-Transformer architecture that models paired longitudinal 3D MRI scans. Central to TAF-Net is a Temporal Fusion Module governed by an Adaptive Temporal Gate, which learns patient-specific weightings to synthesize three spatiotemporal representations: explicit structural change, region-to-region temporal cross-attention, and bilateral feature concatenation. Evaluated on the Alzheimer’s Disease Neuroimaging Initiative cohort for three-year MCI-to-AD conversion prediction, TAF-Net achieved the highest discriminative performance among all evaluated methods using only structural MRI, significantly outperforming the strongest baseline and approaching multimodal methods requiring PET, CSF, or genetic data. The architecture exhibited exceptional data efficiency, matching baseline performance with a fraction of training data. Ablation studies demonstrate that longitudinal fusion improves discrimination while reducing predictive variance by 48% compared to single-timepoint evaluation. Interpretability analyses reveal spatial attention aligned with established AD pathology in the medial temporal lobe and ventricles, while the gating mechanism prioritizes explicit volumetric change with strong positive correlation to conversion risk.

[CV-32] Sketch2Motion: Text-driven 2D Sketch to 3D Animation via Diffusion-guided Skeleton Optimization

链接: https://arxiv.org/abs/2605.28394
作者: Gaurav Rai,Ojaswa Sharma
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Animation of 2D hand-drawn sketches provides an effective medium for visual communication. However, these sketches pose challenges, particularly in handling occlusions and accurately mapping motion. While 3D animation naturally addresses these challenges, estimating 3D motion remains a very complex task. Recent approaches to converting 2D sketches to 3D animations have mainly focused on specific types of motion, such as bipedal movements and facial expressions. We propose Sketch2Motion, a diffusion-guided framework for skeleton-based motion synthesis that combines classical character animation pipelines with deep generative priors. Our method represents motion using skeletal transformations, which are propagated to mesh deformations via linear blend skinning. To guide the resulting animation toward realistic and semantically meaningful motion, we integrate a text-to-video diffusion model via motion-aware score-distillation sampling (MoSDS), enabling optimization without paired motion data. Additionally, we apply physics-inspired smoothness, topological, and contact constraints to stabilize optimization and preserve motion plausibility. Further, we integrate a spring-mass simulator to introduce secondary motion effects. The proposed framework is generalized, fully differentiable, modular, and compatible with biped, quadruped, and non-living articulated characters. Experiments demonstrate that our approach produces temporally coherent, text-aligned animations that outperform baseline motion transfer methods that lack generative priors or explicit physical constraints. We will make our code and dataset publicly available.

[CV-33] Bound-Constrained Sparse Representation for Electrical Impedance Tomography

链接: https://arxiv.org/abs/2605.28392
作者: Chun Zhang,Dong Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study proposes a bound-constrained sparse representation (BC-SR) framework for electrical impedance tomography (EIT), aimed at improving conductivity estimation without explicit regularization. BC-SR adopts a representation-driven strategy, generating conductivity from low-dimensional latent variables via an implicit composite parameterization. Structural priors are embedded using a truncated graph-Laplacian basis, while a bound-preserving nonlinear mapping enforces admissible conductivity ranges and improves conditioning through implicit gradient modulation. The approach ensures robust convergence, even under noisy or incomplete data. Extensive validation on 2D/3D simulations, tank experiments, and in-vivo lung data shows that BC-SR improves physical consistency and structural fidelity, offering enhanced robustness compared to traditional methods. Additionally, BC-SR enables 3D time-difference EIT reconstruction, offering improved spatial resolution and a more coherent representation of 3D conductivity distributions, particularly for in-vivo lung data. This suggests potential for improved performance in EIT, particularly in clinical applications for respiratory monitoring.

[CV-34] oward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation Models ICIP2026

链接: https://arxiv.org/abs/2605.28348
作者: Corentin Seutin,Mohamed Amine Ettaki,Michaël Clément,Pierrick Coupé,Rémi Giraud
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2026 IEEE International Conference on Image Processing (ICIP 2026)

点击查看摘要

Abstract:Vision-language segmentation models have recently achieved strong performance by leveraging high-level semantic object categories expressed in natural language. However, this semantic dependence limits their ability to reason about intrinsic visual properties such as shape, geometry, or texture, which are essential in many real-world applications. In this work, we introduce Semantic-Agnostic aNd Shape-Aware (SANSA) segmentation, a new paradigm that requires segmentation models to operate solely from non-semantic textual descriptions. To this end, we propose two strategies to generate SANSA segmentation prompts based on either dictionary constraints or example guidance, both generating semantic-agnostic textual descriptions. These prompts are then used to finetune segmentation models under semantic-agnostic supervision. Experiments show that finetuning on SANSA prompts yields up to a 20% mIoU improvement on this new segmentation task, compared to pretrained state-of-the-art models, while maintaining strong performance on standard semantic prompts. These results highlight the importance of low- and mid-level visual reasoning for improving the generalization and controllability of vision-language segmentation models.

[CV-35] ransfer learning RGB models to hyperspectral images with trainable tensor decompositions

链接: https://arxiv.org/abs/2605.28331
作者: Mariette Schönfeld,Laurens Devos,Wannes Meert,Hendrik Blockeel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transfer learning makes it possible to use large vision networks on a variety of domains, by specializing their models’ general filters to new tasks. However, these networks assume the input images to have 3 input channels, making them incompatible with multi- or hyperspectral images. Current approaches that mitigate this incompatibility sacrifice information in either the image, or the model. This work proposes a novel approach that preserves the image and spatial information present in the model by using partially trainable tensor decompositions. We create such decompositions of pretrained convolutional filters, separating the filters into spatial and spectral components. The spectral components are then replaced with trainable components of higher channel dimensionality. This creates hyperspectral filters that can specialize to new datasets, while retaining the spatial patterns of the original filter. Experiments on a variety of hyperspectral datasets show that our approach is more accurate and robust than other hyperspectral transfer learning methods.

[CV-36] Inpainting-Style Conditional Diffusion for Multivariable Time Series Forecasting

链接: https://arxiv.org/abs/2605.28324
作者: Kourosh Kiani,S.M. Muyeen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel conditional diffusion-based framework for multivariable time-series solar power forecasting. The proposed method reformulates temporal PV data as structured two-dimensional representations (images) using a sliding-window patch construction, enabling the application of Denoising Diffusion Probabilistic Models (DDPM) within a unified spatiotemporal learning paradigm. A key contribution of this work is the formulation of solar forecasting as an inpainting problem, where future time steps are treated as missing regions to be reconstructed. This is achieved through a mask-based conditional diffusion mechanism, in which historical observations are preserved as conditioning context while the target (future) region is progressively corrupted and subsequently recovered via reverse diffusion. The model learns to generate coherent future sequences conditioned on observed data, effectively performing time-series inpainting. To fully utilize all available features and ensure compatibility with U-Net architectural constraints, a zero-padding strategy is introduced to construct fixed-size inputs. The model is trained using a supervised denoising objective to predict injected noise, enabling accurate iterative reconstruction during the reverse process. Extensive experiments conducted on benchmark PV dataset, including GEFCom2014, demonstrate that the proposed approach achieves high forecasting accuracy, particularly for short-term horizons. The results highlight the effectiveness of integrating diffusion-based generative modeling with an inpainting formulation for robust, flexible, and high-fidelity solar power forecasting.

[CV-37] EventShiftFlow: Towards Hardware-efficient FPGA-based Flow Estimation ICRA2026

链接: https://arxiv.org/abs/2605.28312
作者: Arianna Alonso Bizzi,Fernando Cladera,C. J. Taylor
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures. Accepted to the IEEE ICRA 2026 Workshop on Challenges and Opportunities of Neuromorphic Field Robotics and Automation

点击查看摘要

Abstract:Event-based vision sensors offer asynchronous, high-temporal-resolution measurements that are attractive for low-latency robotic perception, but many event-based motion estimation methods are computationally intensive and difficult to map to FPGA hardware. We present a streaming velocity estimator that discretizes asynchronous events into fixed-duration time bins, constructs a 1-bit spatial occupancy grid, and evaluates multiple velocity hypotheses in parallel using only fixed-width integer logic - shift registers, counters, comparators, and small LUT-mapped multiplies - with no dividers and no DSP blocks. It requires no frame reconstruction, no floating-point arithmetic, and no iterative optimization. The method deliberately trades dense sub-pixel optical flow for a sparse, quantized velocity estimate at each active pixel, suited to low-latency tasks such as reactive obstacle avoidance on size-, weight-, and power-constrained platforms. On noisy synthetic data with known ground-truth velocities, the method recovers both magnitude and direction, with magnitude estimates being most challenged when objects of different velocities intersect. On a real event-camera sequence, directional accuracy reaches 99.5% across all four evaluated motion segments, with performance remaining robust across occupancy densities in the 10-40% range. We characterize the algorithm’s density-dependent behavior, present a parameter sensitivity analysis, show that the proposed datapath requires less than 2 kB of storage, and implement a single-axis prototype on a low-cost Xilinx Artix-7.

[CV-38] EchoAvatar: Real-time Generative Avatar Animation from Audio Streams SIGGRAPH2026

链接: https://arxiv.org/abs/2605.28272
作者: Bohong Chen,Yumeng Li,Yinglin Xu,Youyi Zheng,Yanlin Weng,Kun Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2026; Project Page: this https URL

点击查看摘要

Abstract:Real-time synthesis of high-fidelity 3D character motion from audio is a pivotal component for next-generation interactive avatars and virtual assistants. However, most existing approaches are limited to offline processing of complete audio sequences or are constrained to specific domains, rarely handling both speech and music effectively. In this paper, we introduce a novel framework designed to generate continuous, coherent full-body motion from streaming speech and music with low latency. Central to our approach is a unified streaming architecture capable of synthesizing continuous motion from incremental audio inputs. We employ a robust training strategy that enforces strong audio dependency, allowing the model to seamlessly generalize across conversational speech and rhythmic music without requiring explicit domain labels or mode switching. Additionally, we explored Reinforcement Learning to refine the quality of online generation. Furthermore, we bridge reactive animation with intent-driven behavior via a tool-call interface that allows upstream Large Language Models to inject explicit semantic control. By combining this controllability with stream audio-driven synthesis, our framework serves as a plug-and-play solution for transforming voice agents into interactive humanoid avatars. Extensive experiments demonstrate that our method outperforms state-of-the-art realtime baselines in motion quality and synchronization while maintaining the flexibility required for live deployment. Our code, pre-trained models, and videos are available at this https URL.

[CV-39] LV-OSD: Language-Vision-Complementary Open-Set Object Detection

链接: https://arxiv.org/abs/2605.28271
作者: Yupeng Zhang,Ruize Han,Wei Feng,Song Wang,Liang Wan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection is an important task in computer vision, which aims to detect the objects of interest. through the given category list or query images. In this work, we propose a new problem of language-visual-complementary open-set object detection (LV-OSD), i.e., using the flexible text-based and/or image-based prompts to specify the desired object categories. This setting is more common and practical in real-world applications. For this purpose, we design a dual-branch detection framework, LVDor, which can simultaneously accept both text and image prompts. Specifically, we first build the Multi-modal Prompts (MPr) containing various text descriptions and image samples for each category. Subsequently, to bridge the semantic gap among the input image, text prompts, and image prompts, we design a Target-guided Prompt Dynamic Weighting (TPDW) module. Guided by the prior information of the target image, this module dynamically produces the text and image prompts that best align with the target semantics, achieving precise alignment and effectively reducing the discrepancy between the two modalities, thereby accommodating the LV-OSD setting. We also propose a simple Prompt Random Masking (PRM) mechanism during training to simulate the arbitrary combination of text and/or image prompts in testing. Extensive experimental results verify our problem formulation’s reasonability and our method’s effectiveness. Prompts and code will be released publicly.

[CV-40] Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

链接: https://arxiv.org/abs/2605.28270
作者: Leonhard Sommer,Emil Akopyan,Adam Kortylewski
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Estimating the 9D pose of everyday objects from a single real-world image remains challenging. This is largely due to the lack of large-scale supervision. Most existing datasets either rely heavily on synthetic renderings or provide limited coverage of real-world objects: the largest real-world 9D pose dataset to date contains only 17K annotated objects across 9 categories. We address this gap with Every9D-21M, a dataset of 9D pose annotations for 21.8M real-world images from 109K object- centric videos spanning 700 everyday object categories - two orders of magnitude larger than prior real-world 9D pose benchmarks in both image and category count. To achieve this scale, we leverage object-centric videos by reconstructing object- level point clouds via multi-view geometry and aligning similar instances into a shared canonical coordinate frame. Canonical poses are manually annotated for only a small set of reference objects (fewer than 0.01% of all images) and propagated to the remaining instances via cross-instance alignment. All propagated canonical poses are then verified from multiple viewpoints. We further introduce cross-category orientation rules that induce category-level symmetries, enabling symmetry-aware evaluation. Beyond establishing dedicated training and evaluation splits as a benchmark for 9D pose foundation models, we show that training on Every9D-21M improves performance on ImageNet3D and PASCAL3D+, and generalizes to HANDAL substantially better than training on ImageNet3D. Data and code are available at this https URL.

[CV-41] MORI-Seg: Learning Morphological Geometry for Instance Segmentation without Instance Annotations

链接: https://arxiv.org/abs/2605.28261
作者: Leiyue Zhao,Tianyu Shi,Daniel Reisenbuchler,Xinzi He,Junchao Zhu,Tianyuan Yao,Yuechen Yang,Yanfan Zhu,Junlin Guo,Gelei Xu,Haichun Yang,Yuankai Huo,Mert R. Sabuncu,Yihe Yang,Ruining Deng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instance-level quantification of kidney functional units is essential for morphometric analysis, yet most publicly available pathology datasets provide only semantic segmentation annotations, where adjacent structures of the same class are merged into single regions. This prevents reliable instance-level analysis and limits downstream quantitative studies. Existing heuristic post-processing methods often yield suboptimal instance separation, particularly in crowded and adherent regions, while deep learning-based instance segmentation approaches typically require intensive instance-level annotations that are costly and labor-intensive to obtain. We propose MORI-Seg, a deep learning framework that enables instance segmentation without requiring instance-level annotations. Instead of heuristic splitting or instance supervision, MORI-Seg learns morphology-aware geometric representations directly from semantic masks by jointly modeling object-centric distance fields and boundary-band representations to encode interior structure and contact interfaces. A class-conditioned feature disentanglement module further promotes intra-instance coherence and inter-instance separation. Under semantic-only supervision, MORI-Seg decomposes connected semantic regions into distinct instance masks in an end-to-end manner. Experiments demonstrate improved instance separation accuracy and more reliable morphometric quantification compared with classical post-processing pipelines and representative semantic-to-instance learning approaches. The official implementation is publicly available at this https URL.

[CV-42] Category-Level 3D Correspondence in Camera Space via Morphable Object Priors

链接: https://arxiv.org/abs/2605.28257
作者: Leonhard Sommer,Artur Jesslen,Basavaraj Sunagad,Adam Kortylewski
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures. Data and code are publicly available at this https URL

点击查看摘要

Abstract:Understanding 3D objects from images is fundamental to robotics and AR/VR applications. While recent work has made progress in category-level pose estimation, current representations fail to capture the fine-grained semantics needed for reasoning about object parts, functions, and interactions. In this work, we study category-level 3D correspondence in camera space – predicting, from a single image, 3D locations that remain consistent across instances within a category – and show that it can emerge without explicit correspondence supervision by learning a shared morphable object prior. To enable research in this direction, we introduce HouseCorr3D, the first large-scale benchmark for monocular category-level 3D correspondence with 178k images across 50 household object categories, 280 unique instances, and 3D keypoint annotations directly on CAD models. Crucially, HouseCorr3D provides amodal correspondence labels for occluded regions and explicit symmetry annotations, addressing key limitations of existing datasets. We further propose Morpheus, a method that learns morphable category-level shape priors by disentangling canonical shape, deformation, and object pose. Through this shared canonical grounding, semantically meaningful 3D correspondences in camera space emerge implicitly. These emerging 3D correspondences set a new state of the art on HouseCorr3D, demonstrating that semantic 3D object understanding can arise without direct correspondence supervision. Data and code are publicly available at this https URL.

[CV-43] PointQ-Bench: Benchmarking Diagnostic and Interpretable Point Cloud Quality Assessment

链接: https://arxiv.org/abs/2605.28241
作者: Duanchu Wang,Cheng Li,Junjie Yang,Jing Huang,Zihang Cheng,Zhi Gao,ZhuBohong,Di Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point cloud quality plays a critical role in 3D acquisition, reconstruction, rendering, and perception, yet existing point cloud quality assessment (PCQA) research remains largely centered on scalar score prediction. In practical inspection scenarios, quality assessment often involves identifying defects, characterizing dominant issue types, assessing downstream usability, and providing evidence-supported descriptions, which are not explicitly evaluated by current benchmarks. We introduce PointQ-Bench, a benchmark designed to extend PCQA from scalar scoring toward comprehensive quality understanding. PointQ-Bench consists of 3,083 point clouds spanning authentic scans, simulated distortions, and AI-generated content, covering eight major issue types. Each sample is annotated with mean opinion scores (MOS), quality levels, issue tags, expert-grounded descriptions, and 12,332 question-answer pairs. The benchmark supports three perception-oriented tasks: anomaly sensing, defect diagnosis, and usability grading, as well as a cognition-oriented task of open-ended quality reporting. To evaluate free-form quality descriptions, we further propose SSFRQ-5D, a five-dimensional evaluation protocol validated through human-AI agreement analysis. Extensive experiments on 14 vision-language models and traditional PCQA baselines reveal a consistent perception-diagnosis gap: while current models exhibit emerging abilities in coarse defect perception, they struggle with grounded diagnosis and quality calibration. Strong 2D MLLMs generally outperform existing 3D VLMs, and the benefit of additional views or point-level inputs is non-uniform, varying across tasks, data sources, and models, particularly under boundary-ambiguous conditions. Overall, PointQ-Bench provides a diagnostic testbed for advancing reliable and interpretable point cloud quality understanding.

[CV-44] Learning to Label: A Reinforced Self-Evolving Framework for Semi-supervised Referring Expression Segmentation

链接: https://arxiv.org/abs/2605.28239
作者: Runlong Cao,Ying Zang,Chuanwei Zhou,Tianrun Chen,Tong Zhang,Zhen Cui,Chunyan Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 13 figures

点击查看摘要

Abstract:Semi-supervised referring expression segmentation (SS-RES) aims to achieve precise pixel-level language grounding under limited annotation, yet suffers from limited supervision and unreliable pseudo-labels when exploiting unlabeled image-text pairs. In this work, we propose Learning to Label, a reinforced self-evolving framework (L2L) that casts pseudo-label construction as a learnable decision-making process. To build foundational understanding, we leverage a multimodal large language model to extract semantic-spatial priors, which are instantiated as initial soft segmentation proposals and elevated, together with textual cues, into learnable guidance signals that condition a hierarchical segmentation network. To ensure stable learning, reinforced pseudo-label selection is formulated as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions. This reinforced self-evolving loop enables joint optimization of the segmentation model and pseudo-labels, progressively enhancing label reliability under sparse supervision. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate improvements over existing methods, validating its effectiveness and generalization.

[CV-45] POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation

链接: https://arxiv.org/abs/2605.28237
作者: Ruiyan Gong,Meisheng Zhang,Yuxiang Zhao,Mingchao Sun,Yanfen Shen,Zedong Chu,Zhining Gu,Wei Guo,Xiaolong Cheng,Qiming Li,Kangning Niu,Yanqing Zhu,Xiaolong Wu,Tianlun Li,Mu Xu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 9 figures

点击查看摘要

Abstract:Real-world navigation is fundamentally driven by Points of Interest (POIs), yet reaching a precise POI remains a critical “final-meters” challenge. Existing Vision-Language Navigation (VLN) benchmarks of POI-goal navigation often suffer from coarse granularity or significant sim-to-real gaps due to generated scene. To bridge this gap, we present POINav-Bench, the first benchmark designed for closed-loop evaluation of real-world POI-goal navigation. It comprises 11 commercial areas reconstructed from real-world captures using 3D Gaussian Splatting (3DGS), covering 126,398 m^2 in total and spanning 163 distinct POIs. With traversability-aware annotations and reference trajectories, POINav-Bench enables high-fidelity evaluation of navigation agents in realistic, POI-rich real-world environments. Building on this, we propose the POINav Brain-Action Framework where a Brain module performs POI-grounded reasoning to guide an Action module in predicting continuous waypoints for real-world execution. We further curate the POINav-Dataset, containing 70K real-world signage-entrance pairs. Experiments show that our framework provides a viable path toward refining real-world POI-goal navigation.

[CV-46] Bridging the Sampling Distribution Shift in Radio Map Estimation: A Trajectory-Aware Paradigm

链接: https://arxiv.org/abs/2605.28234
作者: Feng Qiu,Zheng Fang,Shuhang Zhang,Kangjun Liu,Longkun Zou,Jing Liu,Ke Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning-based radio map estimation (RME) plays a critical role in UAV-assisted wireless sensing, enabling tasks such as coverage prediction and network optimization. Most current methods assume an independently and identically distributed (i.i.d.) training and testing setting based on random sampling. However, practical UAV measurements are collected sequentially along feasible trajectories, resulting in highly structured and spatially correlated patterns. This mismatch introduces a sampling distribution shift that increases the intrinsic difficulty of spatial field recovery and compromises the generalization of models trained under i.i.d. assumptions. To mitigate this issue, we propose a trajectory-aware training paradigm based on Stochastic-Triggered Trajectory-Based Sampling (ST-TBS), which preserves trajectory continuity while introducing sampling variability. Moreover, from a statistical perspective, we show that trajectory-based sampling reduces spatial diversity and increases information redundancy compared to random sampling. Extensive experiments on the RadioMapSeer and SpectrumNet datasets demonstrate that models trained with random sampling suffer significant performance degradation under trajectory-based observations, with RMSE increasing from 0.0391 to 0.2632 on SpectrumNet. Conversely, our proposed ST-TBS method effectively reduces the RMSE to 0.0571. These results highlight the necessity of aligning training and deployment sampling distributions for reliable RME.

[CV-47] Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation

链接: https://arxiv.org/abs/2605.28230
作者: Mariam Hassan,Kaouther Messaoud,Wuyang Li,Alexandre Alahi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern video generative models produce visually impressive results, yet frequently violate basic physical principles. We propose Proprio, a training-free framework that enables a frozen video generator to assess and improve the physical plausibility of its own outputs. Inspired by proprioception, the biological sense of one’s own movement, Proprio treats the model’s flow residual under controlled latent perturbations as a self-scoring signal. Samples that are better explained by the generator’s learned dynamics induce smaller and more stable residuals. We aggregate this signal across timesteps and perturbations, focus it on motion-relevant regions with a dynamic spatiotemporal mask, and use it for best-of-N search, gradient-based self-refinement, or both. Across text-to-video and image-to-video benchmarks, Proprio consistently improves physical plausibility, outperforming VLM-based scoring, and external world-model baselines in several settings. With TurboWan2.2, Proprio improves Physics-IQ from 32.2 to 37.5 (+16.5%) and VideoPhy2-hard physical commonsense from 45.6 to 55.0 (+20.6%). Human evaluation further shows that raters prefer Proprio-selected or refined videos for physical plausibility in roughly two-thirds of comparisons. These results suggest that frozen video generators contain actionable internal signals for evaluating and improving the physical plausibility of their own outputs.

[CV-48] VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer CVPR2026

链接: https://arxiv.org/abs/2605.28229
作者: Rui Lin,Chuanming Wang,Huadong Ma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR2026 camera ready

点击查看摘要

Abstract:With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph\ie image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs’ temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams. To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling. To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts. Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization. Our source code is available at \hrefthis https URLthis https URL.

[CV-49] A Patient-Specific Pulmonary Arterial Tree Digital Twin to Extract Pulmonary Embolism Biomarkers

链接: https://arxiv.org/abs/2605.28217
作者: Morgane des Ligneris,Nathan Painchaud,Allan Serva,Laurent Bertoletti,Pierre Croisille,Carole Frindel,Odyssée Merveille
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages + 2 pages of supplementary materials. Submitted to special issue of JBHI

点击查看摘要

Abstract:Pulmonary embolism, the obstruction of a pulmonary artery by a blood clot, is one of the leading causes of acute cardiovascular syndrome. In clinical practice, therapeutic decisions after diagnosis via computed tomography pulmonary angiography rely on risk stratification, which categorizes 30-day mortality risk into three categories. This stratification depends on the right-to-left ventricular diameter ratio and blood levels of two cardiac enzymes. However, blood biomarkers are not always available in emergency settings, and manual calculation of established severity scores - such as Qanadli and Mastora - is time-consuming and rarely performed in clinical routine practice. This study introduces an automated pipeline that models a directed graph representation of the pulmonary arterial tree, labeling its hierarchical structure and characterizing pulmonary embolism. The pipeline derives image-based biomarkers, including local artery-level features (morphological information, hierarchical position, clot volume, and resulting obstruction) and global patient-level biomarkers such as automatically calculated severity scores (Qanadli and Mastora) and the total embolic volume distribution by lobes and hierarchical levels. Using artificial-intelligence-generated binary masks of arteries, emboli, lungs, and lobes, it creates a patient digital twin of the arterial structure. Validation of the pipeline through comparison to an existing pipeline, anatomical expectations, and manual severity score calculations demonstrates the pipeline’s ability to automatically generate anatomically accurate digital twins and severity scores with strong agreement. This supports the potential of these image-derived biomarkers to automatically provide rapid, precise information on thrombotic burden and spatial clot distribution.

[CV-50] From Kellgren-Lawrence to Calcium Pyrophosphate Crystal Deposition: A Soft-Labelling Framework for Knee Osteoarthritis Assessmen

链接: https://arxiv.org/abs/2605.28176
作者: Francisco Bérchez-Moreno,Riccardo Rosati,Maria Chiara Fiorentino,Víctor M. Vargas,Edoardo Cipolletta,Emilio Filippucci,Luca Romeo,Pedro A. Gutiérrez,César Hervás-Martínez
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background and objective. Conventional Deep Learning (DL) approaches for Knee Osteoarthritis (KOA) grading rely on one-hot labels, which fail to capture both the ordinal uncertainty of Kellgren–Lawrence (KL) and Calcium Pyrophosphate Deposition Disease (CPPD) severity scores and the asymmetric relationship between the two scales observed in clinical practice. Methods. We retrospectively collected 2172 knee X-ray images, including 968 radiographs jointly annotated for KL and CPPD severity. An ordinal DL framework based on soft-labelling was developed for both tasks, replacing one-hot targets with unimodal probability distributions centred on the annotated grade. Four formulations were investigated: binomial, beta, triangular, and exponential. Results. All soft-labelling strategies consistently outperformed the nominal baseline. For CPPD grading, the triangular formulation achieved the highest Quadratic Weighted Kappa (QWK) and the lowest Mean Absolute Error (MAE) (QWK = 0.796; MAE = 0.438), while the beta formulation yielded the most balanced class-wise performance considering Average MAE (AMAE) and Maximum MAE (MMAE) across classes (AMAE = 0.458; MMAE = 0.573). For KL grading, the beta-based approach provided the best overall performance, achieving the highest QWK together with the lowest MAE and class-wise errors (QWK = 0.777; MAE = 0.529; AMAE = 0.523; MMAE = 0.775). Statistical analysis demonstrated significant improvements over conventional one-hot supervision (p 0.001). Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.28176 [cs.CV] (or arXiv:2605.28176v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.28176 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-51] FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales

链接: https://arxiv.org/abs/2605.28174
作者: Jorge L. Rodriguez,Victor Angulo Morales,Areej Alwahas,Mariana Elias Lara,Fida Mohammad Thoker,Kasper Johansen,Bernard Ghanem,Fernando T. Maestre,Matthew F. McCabe
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 29 pages, 9 figures

点击查看摘要

Abstract:Foundation models offer a promising route to transferable remote sensing representations, but many current approaches depend on very large pretraining datasets and fixed sensor configurations, limiting their suitability for ecological and environmental applications, where observations often vary across platforms, spatial and spectral resolutions, and available modalities. We introduce FLORO, a multimodal geospatial foundation model designed to learn transferable representations from a small but highly diverse remote sensing corpus. FLORO is pretrained using masked autoencoding on a heterogeneous combination of Sentinel-1, Sentinel-2, SkySAT imagery, elevation, and UAV-derived data. To accommodate sensor variability, FLORO incorporates availability-aware inputs that indicate which spectral bands and auxiliary modalities are present in each sample, enabling a unified input space across heterogeneous sensor configurations. We evaluated FLORO on the PANGAEA benchmark under a frozen-encoder protocol across scene classification, segmentation, and regression tasks. Despite being pretrained on a smaller corpus than competing foundation models, FLORO achieved strong and stable transfer across optical, optical-SAR, and optical-elevation benchmarks spanning medium-resolution satellite, airborne, and ultra-high-resolution UAV imagery. FLORO obtained the second-best average segmentation performance across six PANGAEA benchmarks, trailing only a recently introduced foundation model pretrained on over two orders of magnitude more images, remained competitive on scene classification, and was robust in regression tasks, while qualitative results showed improved preservation of spatial structure in flood, urban, biomass, and canopy-height prediction settings. In a separate controlled experiment on EuroSAT-MS, geo-positional encoding further improved classification relative to absolute positional encoding.

[CV-52] MangaFlow: An End-to-End Agent ic Framework for Controllable Story to Manga Generation

链接: https://arxiv.org/abs/2605.28173
作者: Muyao Wang,Zeke Xie,Yanhao Chen,Lixin Xiu,Hideki Nakayama
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end manga generation is a structured visual storytelling task that requires story decomposition, recurring character and scene grounding, page layout design, panel rendering, page composition, and lettering. However, existing generative models often perform direct page synthesis, entangling these factors in a single visual output and limiting precise control over layout geometry, visual references, and cross-panel consistency. To address these limitations, we propose MangaFlow, an agentic framework for controllable long-form manga generation that decomposes manga creation into planning, grounding, layout construction, reference-conditioned rendering, composition, and text placement. By treating layout and visual references as explicit intermediate variables, MangaFlow enables both simple text-to-manga generation and more precise user-controlled manga creation. This design exposes layout, visual assets, and lettering as editable intermediate controls for refining panel geometry, references, and text placement. To support long-form consistency, MangaFlow introduces a story section memory that links section descriptions with corresponding character, scene, and object references for reuse across panels. We further present a meta-benchmark for evaluating layout controllability, visual consistency, and generation quality. Experiments show that MangaFlow improves layout adherence and cross-panel consistency over direct generation baselines while supporting flexible human control.

[CV-53] DebFilter: Eradicating Biases Stashed in Value CVPR2026

链接: https://arxiv.org/abs/2605.28167
作者: Seung Hyuk Lee,Songkuk Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures, supplementary material included, CVPR 2026

点击查看摘要

Abstract:Text-to-image diffusion models, which are theoretically equivalent to score-based generative models, generate images through a multi-step denoising process guided by text embeddings extracted from pretrained vision-language models such as CLIP. However, these text embeddings inherently encode social and semantic biases – such as those related to gender and age – that are subsequently propagated and amplified through the guidance mechanism, along with the model’s training on large-scale datasets that are imbalanced with respect to these bias-related concepts, often leading to skewed outputs in text-to-image generation. We propose DebFilter, a lightweight and training-free framework for mitigating such biases in text-to-image diffusion models. Observing that the model’s error prediction at each denoising step is primarily influenced by cross-attention dynamics, we introduce a bias-correction strategy that adjusts the value components within cross-attention. Specifically, we apply a fixed offset to the slice of guidance embedding, effectively steering the semantic direction of cross-attention values toward unbiased representations. This adjustment reconfigures the score landscape to produce balanced outputs while maintaining alignment with the intended text semantics. Unlike prior approaches that rely on fine-tuning or retraining, DebFilter operates entirely at inference time, requiring no additional data or model updates. Our results demonstrate that this method effectively mitigates social biases in generated images, offering an efficient and scalable pathway toward fairer and more inclusive text-to-image generation.

[CV-54] MeniOmni: A Structured Multimodal Benchmark for Holistic Meniscus Injury Assessment ICME

链接: https://arxiv.org/abs/2605.28161
作者: Shurui Xu,Siqi Yang,Weiping Ding,Hui Wang,Mengzhen Fan,Yuyu Sun,Shuyan Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE International Conference on Multimedia and Expo (ICME) 2026 (Oral Presentation)

点击查看摘要

Abstract:Clinical diagnosis of meniscus injuries requires radiologists to integrate volumetric MRI evidence with patient context (e.g., sex, age, BMI) and to produce structured diagnostic reports. Existing knee MRI benchmarks are typically unimodal and rely on coarse labels, limiting their ability to evaluate holistic clinical reasoning. We introduce MeniOmni, a structured multimodal benchmark for meniscus injury assessment, consisting of 746 multi-center MRI studies with tri-planar volumetric inputs, Clinical Priors, and expert-annotated clinical text. MeniOmni supports two tasks: (1) fine-grained Stoller severity grading and (2) diagnostic report generation. We further propose risk-aware ordinal evaluation and a semantic consistency metric (Meni-Score) to better reflect clinical relevance. Baseline experiments show that incorporating Clinical Priors improves grading performance and reduces severe errors, highlighting the value of multimodal context for safer assessment. Code and data are available at this https URL.

[CV-55] Intra-YOLO: A Small Object Detection Model for Caries and Molar-Incisor Hypomineralization in Intraoral Photography Based on Transfer Learning with Reinforcement Learning

链接: https://arxiv.org/abs/2605.28157
作者: Po-Lun Chwang,Po-Yu Chang,Wen-Liang Lin,Tung-Sheng Wu,Min-Ching Wang,Yun-Chien Cheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study developed a computer-aided diagnosis (CAD) system for detecting caries and molar-incisor hypomineralization (MIH) in intraoral photographs. These lesions share similar appearances, making clinical differentiation challenging, especially given their small size and variability in imaging conditions.

[CV-56] A novel ordinal multi-view aggregation scheme for oak defoliation

链接: https://arxiv.org/abs/2605.28151
作者: Francisco Bérchez-Moreno,Ricardo Enrique Hernández-Lambraño,David Guijo-Rubio,Víctor Manuel Vargas,Francisco José Ruiz-Gómez,Juan Carlos Fernández,Pablo González-Moreno
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Forest decline driven by climate and biotic stressors threatens ecosystem functioning, making accurate monitoring of tree health essential. In this work, we address tree defoliation estimation as an ordinal classification problem using ground-level imagery. We propose a novel multi-view ensemble framework that aggregates predictions from Convolutional Neural Networks (CNNs) trained on different perspectives of individual trees (north, south, and crown). This approach leverages complementary visual information while preserving modelling consistency through a homogeneous ensemble design. A comprehensive evaluation is conducted by comparing multiple ordinal classification methods and analysing the contribution of each view and their combinations. Results show that modelling the ordinal structure of defoliation levels improves performance over nominal approaches, while the proposed multi-view ensemble consistently outperforms single-view and pairwise configurations. In particular, the three-view ensemble achieves the most robust and accurate predictions across all evaluation metrics. These findings highlight the potential of combining Deep Learning (DL), Ordinal Classification (OC), and multi-view aggregation for scalable, consistent, and objective forest health assessment in complex ecosystems such as Mediterranean dehesas. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.28151 [cs.CV] (or arXiv:2605.28151v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.28151 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-57] No Safe Dose: How Training Data Drives Unsafe Image Generation

链接: https://arxiv.org/abs/2605.28137
作者: Felix Friedrich,Lukas Helff,Niharika Hegde,Patrick Schramowski,Kristian Kersting
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-image models trained on large-scale data often inevitably ingest unsafe content. While some people observe input-output amplifications, it remains unclear whether and how training data composition directly drives model output safety or by other factors. We shed light on this question by isolating this variable: we train the same text-to-image model on datasets that differ \emphonly in their fraction of unsafe images (0% to 9.6%), across several dataset scales (100K to 8M). Then we generate images with the resulting models, and evaluate them with four independent safety classifiers. Output unsafety rises monotonically from 16.6% at 0% contamination to 25.5% at 5%. A factorial design reveals that the \emphproportion, not the absolute count, of unsafe training images is the operative variable. The 16.6% irreducible baseline at zero contamination implicates the other components, e.g. frozen text encoder, as a residual safety risk – confirmed by a text encoder ablation showing that SafeCLIP reduces this floor to 9.6%, while the dose-response effect persists across all three encoders tested. Critically, no quality degradation in terms of FID, CLIPscore and ImageReward accompanies safety filtering. These results establish that data curation and text encoder safety are complementary and independently effective interventions. At the same time, the remaining level of unsafety poses questions for future research about emerging capabilities and compositionality.

[CV-58] SAM-Enhanced Segmentation on Road Datasets: Balancing Critical Classes in Autonomous Driving

链接: https://arxiv.org/abs/2605.28136
作者: Toomas Tahves,Mauro Bellone,Junyi Gu,Raivo Sell
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Dense semantic segmentation is essential for autonomous driving, yet many multi-modal datasets lack pixel-level annotations. The Zenseact Open Dataset (ZOD) provides rich multi-sensor data but only bounding-box labels, limiting its use for segmentation research. Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate the pipeline on the Iseauto autonomous-vehicle platform, achieving 77.5% mIoU, and show that SAM-derived representations transfer effectively across sensor configurations via bidirectional transfer learning. All code and annotations are released to support reproducible research.

[CV-59] Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

链接: https://arxiv.org/abs/2605.28132
作者: Haozhan Shen,Tiancheng Zhao,Kangjia Zhao,Jianwei Yin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is here: \href{ this https URL }{ this https URL }

点击查看摘要

Abstract:Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at \hrefthis https URLthis https URL.

[CV-60] CLEAR-NeRF: Collinearity and Local-region Enhanced Accurate 3D Reconstruction in Unbounded Scenes

链接: https://arxiv.org/abs/2605.28125
作者: Vladislav Polianskii,Elijs Dima,Isabel Salmerón Marazuela,Gergő László Nagy,Sigurdur Sverrisson,Volodya Grancharov
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Many real-world 3D reconstruction applications demand photorealism and metric accuracy across unbounded, complex scenes with challenging lighting and imperfect captures that current Neural Radiance Field (NeRF) pipelines only partly satisfy. This study adapts NeRF-based 3D reconstruction to multi-region of interest unbounded scenes to improve robustness to lighting and pose variation while enforcing metric accuracy suitable for digital-twin applications. Our approach introduces (i) automated local region localization/detection and reconstruction to seamlessly prioritize areas of interest without proliferating submodules, (ii) collinearity-enforcing ray sampling to learn smooth planar and curved surfaces, (iii) depth-localized neighborhood point extraction to suppress surface artifacts, and (iv) geometry-relevant color aggregation to mitigate lighting- and pose-caused variations. Results indicate superior performance of the proposed pipeline over the baseline NeRF models and established Structure from Motion (SfM) - Multi-View Stereo (MVS) solutions.

[CV-61] ST-ColoNet: Spatio-Temporal Colon Segment Recognition via Hybrid Attention and Edge-Guided Feature Learning

链接: https://arxiv.org/abs/2605.28119
作者: Ziyi Wang,Zhengjie Zhang,Jingsheng Gao,Dahong Qian,Suncheng Xiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Colo-segment recognition in colonoscopy videos is a key requirement for many downstream tasks, but existing automatic recognition methods only use colonoscopy images without fully exploiting the use of temporal information, leading to poor performance. Additionally, relevant public video-based datasets are in scarcity. To tackle this problem, we curate and release a labeled dataset specifically for the task of colo-segment recognition. In addition, we propose a two-stage deep learning-based framework, Colo-Segment Recognition via SpatioTemporal Network (ST-ColoNet), for the task of colo-segment recognition from colonoscopy videos which includes the Colorlaus module that uses metric learning to optimize edge-mediated spatial feature extraction, as well as the Full-Temp module which combines three self-attention patterns to better approximate full self-attention on long colonoscopy sequences and optimize temporal feature aggregation. Through extensive ablation experiments, we show that our framework is capable of achieving state-of-the-art performance on the task of colo-segment recognition, achieving an accuracy of 81.0% and F1-score of 70.7%, which is a tremendous improvement over state-of-the-art methods.

[CV-62] Revisiting Change Detection Methods for their Application to Serac Fall Time-Lapse Monitoring

链接: https://arxiv.org/abs/2605.28100
作者: Arthur Dérédel,Carlos Crispim-Junior,Pierre Lemaire,Johan Berthet,Laure Tougne Rodet
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint, 19 pages, 8 figures

点击查看摘要

Abstract:In an era where climate change aggravates environmental uncertainties, the identification and detection of event precursors are becoming crucial to mitigate the impacts of disastrous natural hazards. While classical sensors such as interferometric lasers or seismometers are reliable, their widespread deployment is often hindered by logistical and economic barriers, leaving numerous blind spots. Time-lapse cameras, which already provide cost-effective, high-resolution visual context to such sensors, present a promising alternative. However, processing their output automatically faces significant challenges, notably linked to extreme shape and lighting variations. Overcoming those issues is essential to deploy them at large-scale as a monitoring tool. This paper introduces a novel sub-task of change detection, namely volumetric change detection, applied to time-lapse cameras and slope instabilities. We conduct a comprehensive review of state-of-the-art change detection methods and related tasks, analyze their core components and assess their applicability to this context. To that end, we introduce the new dataset SeracFallDet, which contains serac fall annotations and has been thoroughly annotated to meet the latter demand. Through generalization experiments, we demonstrate that dense and semi-dense feature matching, although not trained specifically for this task, exhibit robust performance. Alternatively, supervised approaches struggle with data scarcity and annotation imbalance. This suggests that hybrid methods may offer a path forward by leveraging the strengths of both tasks. These findings highlight the potential of feature matching techniques and the need for further innovation to overcome the challenges of real-world deployment in environmental monitoring.

[CV-63] Qwen -Image-Bench: From Generation to Creation in Text-to-Image Evaluation

链接: https://arxiv.org/abs/2605.28091
作者: Niantong Li,Guangzheng Hu,Weixu Qiao,Ying Ba,Qichen Hong,Shijun Shen,Jinlin Wang,Fan Zhou,Jianye Kang,Xin Shang,Ziyi He,Wei Wang,Dalin Li,Jiahao Li,Jie Zhang,Kaiyuan Gao,Kun Yan,Lihan Jiang,Ningyuan Tang,Shengming Yin,Tianhe Wu,Xiao Xu,Xiaoyue Chen,Yuxiang Chen,Yan Shu,Yanran Zhang,Yilei Chen,Yixian Xu,Zekai Zhang,Zhendong Wang,Zihao Liu,Zikai Zhou,Hongzhu Shi,Yi Wang,Bing Zhao,Hu Wei,Lin Qu,Chenfei Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-Image generation has evolved from basic image synthesis into a frequently used core capability in professional creative workflows, where simple text-image alignment can no longer satisfy users’ pressing demands for faithful real-world reconstruction and genuine creative expression. Existing benchmarks, however, remain anchored in these foundational criteria and do not yet capture the nuanced capabilities that matter in authentic artistic practice, making it difficult to reliably distinguish state-of-the-art T2I models. To address the gap, we introduce Qwen-Image-Bench, a creator-centric benchmark co-designed with professional artists and grounded in real-world creation scenarios. Qwen-Image-Bench enriches conventional evaluation with two application-driven dimensions: Real-world Fidelity and Creative Generation. Drawing on the staged reasoning inherent in professional artistic workflows, we organize these five pillars into a top-down hierarchical taxonomy that further decomposes into 23 second-level sub-capabilities and 56 third-level verifiable rubrics. To ensure broad coverage, we curate 1000 stratified prompts with each prompt jointly exercising more than four fine-grained facets across multiple pillars. We train a unified judge model Q-Judger based on Qwen3.6-27B, supervised by 80 professional annotators from global art academies under blind labeling and triple-review protocols, that scores every image across all 56 verifiable facets, producing fine-grained, rubric-grounded, and fully attributable diagnostics rather than a single opaque score. Empirically, Qwen-Image-Bench reliably distinguishes leading T2I models, achieving the greatest separation on the two application-driven dimensions of Real-world Fidelity and Creative Generation where existing benchmarks provide little insight, while also providing a trustworthy optimization signal for production-level T2I development.

[CV-64] VLA-Hijack: A Transferable Patch Attack against Vision-Language-Action Models via Visual Proprioception Hijacking

链接: https://arxiv.org/abs/2605.28083
作者: Jiyuan Fu,Kaixun Jiang,Jingkai Jia,Zhaoyu Chen,Xueyao Chen,Lingyi Hong,Shuyong Gao,Chenzhi Tan,Dingkang Yang,Wenqiang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Vision-Language-Action (VLA) models have emerged as powerful generalist policies, their severe vulnerability to adversarial patches significantly hinders their deployment in safety-critical domains. Moreover, existing patch attacks primarily focus on white-box settings, heavily overfitting to the specific action output space of the target model, which results in poor cross-architecture transferability. To overcome this limitation, we propose VLA-Hijack, a unified adversarial framework that breaks the transferability bottleneck by exploiting a fundamental vulnerability identified in this work: before planning any motion, a VLA model must first use visual information to locate its own robotic arm within the environment. Targeting this shared visual self-localization process, our approach concurrently optimizes Attention-Guided Proprioceptive Suppression to inhibit the real robotic arm’s features, and Multimodal Proprioceptive Injection to establish the patch as a surrogate “phantom embodiment”. By alternating between semantic concept anchoring and visual prototype projection, VLA-Hijack effectively severs the semantic relationship between the agent’s true embodiment and its control policy. Extensive experiments across diverse architectures (OpenVLA, UniVLA, and CronusVLA) demonstrate that VLA-Hijack achieves superior optimization efficiency in white-box settings and sets a new SOTA for cross-architecture and cross-domain black-box transferability.

[CV-65] CogPortrait: Fine-Grained Eye-Region Control in Portrait Animation via Hierarchical Agent Planning

链接: https://arxiv.org/abs/2605.28056
作者: He Feng,Yongjia Ma,Donglin Di,Lei Fan,Tonghua Su
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Portrait animation methods have achieved substantial visual quality and lip synchronization, but fine-grained manipulation of the eye region still faces a trade-off between input granularity and motion accuracy. Existing methods using emotion labels or coarse text prompts are insufficient for describing subtle ocular dynamics, whereas approaches based on Action Units or driving videos provide higher fidelity at the cost of a heavier input burden. These limitations are still restrictive for beyond-emotion states (e.g., thinking) and drowsiness. In light of the above, we propose CogPortrait, a two-stage framework that generates portrait animations from high-level labels. In the first stage, three chain-of-thought Multimodal Large Language Models (MLLMs) agents compile high-level labels into facial keypoints through temporal event planning, prototype retrieval, and composition from a real-behavior library, and semantic-physiological constraint enforcement. In the second stage, a DiT-based video generation backbone synthesizes the final animation conditioned on the keypoints, reference portrait, audio, and text prompt, enhanced by a dynamic classifier-free guidance strategy with eye-region-aware reweighting and KTO-based refinement for boundary cases. We further introduce the EMH benchmark covering diverse emotions and beyond-emotion categories with two AU-level metrics for evaluating fine-grained eye-region and head-motion control. Extensive experiments on HDTF and the EMH benchmark demonstrate that CogPortrait achieves more precise eye-region control than existing methods while maintaining supe- rior visual quality and identity consistency

[CV-66] Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models

链接: https://arxiv.org/abs/2605.28051
作者: Landi He,Mingde Yao,Shawn Young,Lijian Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, the optimization is driven by surrogate gradients rather than the true selection process, leading to unreliable learning of token importance. In this paper, we propose DiffPrune, which reformulates pruning as continuous control of token information instead of discrete selection learning. Specifically, we introduce an Information Throttler that modulates each token using variance-preserving noise conditioned on importance scores, where higher scores induce less information suppression during training. This design directly operates on token representations, naturally providing a fully differentiable optimization path for learning token importance. At inference, tokens are removed via hard thresholding on the learned scores. Across ten VLM benchmarks, DiffPrune retains 96.5% of full-model accuracy while accelerating LLM prefill by 2.85x, with only 0.69 ms of inference overhead.

[CV-67] Stay Fair! Ensuring Group Fairness in Diffusion Models Across Guidance Scales

链接: https://arxiv.org/abs/2605.28036
作者: Myeongsoo Kim,Eunji Kim,Minwoo Chae,Sangwoo Mo
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 28 pages, 18 figures

点击查看摘要

Abstract:Diffusion models steer conditional generation with a tunable guidance scale to trade off prompt alignment and diversity. However, existing debiasing techniques are optimized for a single scale, degrading fairness when users adjust this parameter. We trace this behavior to a previously overlooked source by decomposing total bias into two components: a model bias and a guidance bias. While prior work primarily targets the former, we show that the guidance bias grows monotonically with the guidance scale, eventually dominating the high-guidance regimes users prefer. To address this, we extend Strong Demographic Parity to guidance and derive a condition under which the target distribution retains its group ratio across guidance scales. We propose StayFair, which leverages this condition to design fair guidance algorithms in both regimes. For classifier guidance, it equalizes the classifier’s output distributions across groups; for classifier-free guidance, it shifts the null embedding by a prompt-dependent offset. Because StayFair modifies only the guidance step, it is orthogonal to model debiasing and can be layered onto existing fair diffusion models to extend their fairness across guidance scales. Across class-conditional and text-to-image generation, StayFair decouples fairness from the guidance scale without sacrificing image quality.

[CV-68] Dual-branch Distilled Transformer for Efficient Asymmetric UAV Tracking CVPR2026

链接: https://arxiv.org/abs/2605.28018
作者: Hongtao Yang,Bineng Zhong,Qihua Liang,Yaozong Zheng,Xiantao Hu,Yuanliang Xue,Shuxiang Song
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026 Highlight

点击查看摘要

Abstract:Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives of knowledge transfer: spatially focused feature-level distillation that compensates for weakened representations by guiding the student to learn strong target representations, and prediction-level distillation that enhances spatial localization by learning the teacher’s capability for accurate target localization. Furthermore, to enhance robustness against appearance variations, we introduce a fine-grained target-aware distillation strategy that selectively transfers the teacher’s target modeling capacity to the student. A temporal adaptation module is incorporated at inference to enhance robustness over time. Experiments on five UAV benchmarks demonstrate that EATrack achieves a favorable balance between accuracy and speed. Code: this https URL

[CV-69] Enhancing Ultra-low-field MRI with Segmentation-guided Adversarial Learning

链接: https://arxiv.org/abs/2605.28016
作者: James Grover,Andrew Phair,Michael Ferraro,David E.J. Waddington
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Ultra-low-field (ULF) MRI offers portable and low-cost imaging but suffers from poor image quality. To address this, we present our submission to the 2025 ULF Enhancement Challenge (ULF-EnC), where the goal is to synthesise high-field-like MRIs from 64 mT scans. Our pipeline enhances ULF MRI through a combination of anatomical conditioning and model ensembling. We first generate tissue segmentation priors using a Swin UNETR trained solely on challenge-provided data. These priors condition two independent enhancement networks - a CycleGAN and a transformer-based residual enhancement model (T-REX) - each trained to synthesise 3 T-like MRIs. Outputs from both models are combined using a weighted average. Our approach produces enhanced MRIs that were comparable to high-field scans both quantitatively and qualitatively.

[CV-70] Automated Estimation of Impact Time Impact Location and Shuttlecock Speed in Badminton Smashes Using Event Cameras

链接: https://arxiv.org/abs/2605.28011
作者: Yudai Washida,Yuto Kase,Kai Ishibe,Ryoma Yasuda,Sakiko Hashimoto
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 5 figures

点击查看摘要

Abstract:Quantifying impact phenomena in badminton smashes is important for evaluating both athletic performance and equipment; however, conventional measurement systems involve trade-offs between temporal resolution, data efficiency, and preparation effort. This study proposes a measurement method using two synchronized event cameras to automatically estimate impact time, impact location on the racket face, and post-impact shuttlecock speed in an integrated manner within the same trial. The swing interval was detected from event rate statistics, impact time was estimated from the shuttlecock trajectory inflection in the lateral-view event data, impact location was determined by ellipse fitting to the racket face in the rear-view event image, and shuttlecock speed was calculated in the sagittal plane. To validate the proposed method, Bland-Altman analysis was performed against a high-speed camera-based reference method using 125 smash trials from five players. Impact time and shuttlecock speed were estimated in all 124 analyzable trials, and impact location was estimated in 93.5% (116/124). The bias (95% CI) for impact time, medio-lateral impact location, longitudinal impact location, and shuttlecock speed were 1.84 ms (1.45 to 2.23), 3.45 mm (2.18 to 4.72), -1.92 mm (-2.97 to -0.88), and -1.00 m/s (-2.46 to 0.46), respectively. No proportional bias was observed for any metric. These results suggest that the proposed method can serve as a useful tool for integrated assessment of badminton smash performance and equipment in practical settings.

[CV-71] Geometry-Correct Diffusion Posterior Sampling with Denoiser-Pullback Curvature Guidance and Manifold-Aligned Damping

链接: https://arxiv.org/abs/2605.27990
作者: Seunghyeok Shin,Minwoo Kim,Dabin Kim,Hongki Lim
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL

点击查看摘要

Abstract:Diffusion posterior sampling conditions diffusion priors on measurements, but data-consistency updates are typically scaled by hand-tuned guidance weights and can destabilize sampling under stiff, operator-dependent curvature. We replace scalar guidance with a per-noise-level damped Gauss–Newton correction computed in diffusion-state coordinates. The correction pulls likelihood gradients back through the denoiser, uses a one-sided curvature model that avoids forward denoiser Jacobians, and applies diffusion-calibrated rank-one damping aligned with the denoiser residual. Each correction is solved with matrix-free GMRES using automatic differentiation, and sampling proceeds with a variance-preserving Langevin transition with a closed-form drift/noise split. On FFHQ and ImageNet across inverse problems, it achieves competitive PSNR/SSIM/LPIPS while running markedly faster than most of the compared baselines; on accelerated MRI reconstruction, it achieves the best PSNR/SSIM among the compared baselines.

[CV-72] ABot-OCR Technical Report

链接: https://arxiv.org/abs/2605.27978
作者: Kaitao Jiang,Ruiyan Gong,Xiaolong Cheng,Kangning Niu,Tianlun Li,Mu Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 11 figures, technical report

点击查看摘要

Abstract:We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.

[CV-73] Bridging the Generalization Gap in Adverse Weather Segmentation: A Training Recipe Perspective

链接: https://arxiv.org/abs/2605.27962
作者: Cong Xu,Pu Luo,Yumei Li,Boyou Xue
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper describes our approach for the 8th UG2+ Workshop (CVPR 2026) Track~2, which targets semantic segmentation of outdoor scenes degraded by five weather conditions: blur, darkness, snow, haze, and glare. A central challenge we observe is a severe generalization gap – models that perform well on the validation set often collapse on the test set. For instance, SegFormer-B5 drops 16.1 mIoU points from validation to test, suggesting that model capacity alone is insufficient for robustness. We investigate whether a carefully designed training recipe, rather than architectural complexity, can address this gap. Starting from a pre-trained SegMAN-S backbone, we systematically study the effects of domain-adaptive fine-tuning, multi-source data mixing, scene-balanced sampling, and synthetic degradation augmentation. Our final system achieves 59.9% mIoU on the official test set while maintaining a validation-test gap of only 6.5 points – less than half that of larger models. We analyze negative results from architectural modifications, loss function variants, and model scaling to provide practical insights for weather-robust segmentation under limited data.

[CV-74] Mags-RL: Wearing Multimodal LLM s a Magnifying Glass via Agent ic Reinforcement Learning For Complex Scene Reasoning

链接: https://arxiv.org/abs/2605.27960
作者: Xuanzhao Dong,Wenhui Zhu,Peijie Qiu,Xiwen Chen,Xiaobing Yu,Xin Li,Zhipeng Wang,Shao Tang,Gen Li,Yujian Xiong,Hao Wang,Yanxi Chen,Prayag Tiwari,Yalin Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution “magnifying glass” agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.

[CV-75] ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

链接: https://arxiv.org/abs/2605.27959
作者: Guannan Lv,Ren Nie,Hongjian Dou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-based approaches typically focus on regions of interest (RoIs) by injecting cropped image patches or RoI-specific features into the reasoning context. However, such designs can weaken holistic scene understanding and inter-object relations, while incurring decoding costs that scale with the number and size of RoIs. Alternatively, adaptive visual feature selection often requires fine-grained supervision or complex heuristics. To address these limitations, we propose ROVER (Routing Object-centric Visual Evidence for grounded multi-image Reasoning), a lightweight, learnable plugin for efficient global visual evidence routing. Upon each object grounding prediction, ROVER injects a step-specific token triplet to synergistically: (i) aggregate the ongoing reasoning context, (ii) distill intra-image cues into a visual working space via object-centric differential attention, and (iii) route and integrate history-aware evidence across objects and images within this space for subsequent reasoning. We integrate ROVER into Qwen2.5-VL-7B and develop an interleaved SFT-to-GRPO training pipeline. Strictly adhering to the original datasets and evaluation protocols, our method achieves the best performance on MM-GCoT (+4.8% answer accuracy, +14.6% grounding accuracy) and VideoEspresso (+8.6% answer accuracy). The VideoEspresso-trained model demonstrates strong transferability, outperforming the base model by +4.7% on average across diverse benchmarks.

[CV-76] Con-DSO: Learning Short-Horizon Consistency Priors for RGB-D Direct Sparse Odometry

链接: https://arxiv.org/abs/2605.27952
作者: Haolan Zhang,Thanh Nguyen Canh,Chenghao Li,Ziyan Gao,Xiongwen Jiang,Nak Young Chong
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Submitted

点击查看摘要

Abstract:Visual odometry (VO) is a fundamental component in robotics and augmented reality. RGB-D direct VO benefits from metric depth measurements, but it can degrade in challenging environments, where dynamic objects, occlusions, illumination changes, and unreliable depth violate the short-horizon photometric and depth-geometric consistency assumptions used by direct alignment. Existing approaches mitigate these issues through semantic filtering, explicit occlusion reasoning, illumination adaptation, or hand-crafted geometric criteria, but often rely on external modules or fixed assumptions tailored to individual failure modes, limiting their flexibility and ability to handle diverse challenges in a unified manner. In this work, we propose Con-DSO, a consistency-aware RGB-D direct sparse odometry framework that predicts dense photometric and depth-geometric consistency uncertainty from temporally adjacent RGB-D frame pairs. The consistency network is trained using flow-guided photometric errors and projective depth-consistency errors, allowing consistency violations to be represented as pixel-level uncertainty. These pairwise uncertainty predictions are converted into a host-side quality prior for keyframe-based tracking. The prior is then applied to VO through quality-aware support-pixel selection and decoupled photometric-geometric weighting during pose estimation, enabling continuous attenuation of unreliable observations rather than hard rejection or threshold-based gating. Experiments on five public RGB-D benchmarks show substantial gains over direct RGB-D VO baselines, with over 20% absolute trajectory error reduction on ICL-NUIM and 50%–80% reductions on RGB-D Scenes V2, TUM/Bonn Dynamic, and OpenLORIS sequences.

[CV-77] Evaluating the Feasibility of Inferring Dietary Behavior Change Receptivity from Egocentric Images of Eating Environment

链接: https://arxiv.org/abs/2605.27950
作者: Long Li,Yuning Huang,Heather A. Eicher-Miller,J.Graham Thomas,Fengqing Zhu,Edward Sazonov
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurately assessing dietary behavior change receptivity is essential for designing effective just-in-time adaptive interventions (JITAIs) that promote healthier eating habits. However, self-report-based assessment of behavior change receptivity is sparse and delayed, limiting its practical use in continuous monitoring. To explore whether passive sensing may help address this challenge, this study conducts a pilot investigation of inferring participants’ self-reported behavior change receptivity from egocentric eating images collected by a wearable camera. We use pilot data obtained from free-living eating episodes using the Automatic Ingestion Monitor v2 (AIM-2). The data included egocentric image sequences captured during eating and paired with responses to questions assessing specific dimensions of behavior change receptivity (awareness, interaction capability, and motivation). To examine whether visual information contained any relevancy to these responses, we evaluated a transfer-learning-assisted framework that combines a pre-trained Contrastive Language-Image Pre-Training (CLIP) vision encoder with a lightweight transformer classifier. The model processes eating episode image sequences to extract potential semantic and temporal cues related to behavior change receptivity. Preliminary experimental results show promising improvements over simple baseline models for behavior change receptivity indicators. These early findings suggest that egocentric eating episode images may contain cues related to dietary behavior change receptivity, and warrant further investigation with larger and more comprehensive datasets.

[CV-78] SEMAGIC: Learning Semantically Consistent Deformable 3D Representations from In-the-Wild Images

链接: https://arxiv.org/abs/2605.27938
作者: Sky Cen,Wufei Ma,Guofeng Zhang,Alan Yuille,Adam Kortylewski
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learning deformable 3D object models from single-view in-the-wild images has enabled impressive 3D shape reconstruction without supervision. However, it remains unclear whether these models capture the semantic structure required for downstream tasks. We find that existing deformable reconstruction approaches, despite producing visually plausible geometry, yield unstable correspondences across instances and perform poorly on semantic correspondence benchmarks. We introduce SEMAGIC, a framework for learning semantically consistent deformable 3D representations from single-view in-the-wild images. Rather than treating reconstruction as the end goal, SEMAGIC uses deformable modeling as a mechanism to discover category-level correspondences. Each category is represented by a canonical template mesh and a learned deformation field, functioning similarly to an autoencoder that reconstructs instance geometry from image features, enabling vertices to maintain consistent semantic meaning across instances. Semantic consistency is enforced during training through (i) a feature-level consistency loss aligning semantic features between canonical and deformed meshes, and (ii) vertex-index-conditioned deformation that preserves semantic correspondence across instances. By explicitly coupling geometric deformation with semantic alignment, SEMAGIC produces representations that maintain stable part correspondences across intra-category variation. Experiments demonstrate that SEMAGIC improves semantic correspondence of deformable models by +14.7 PCK@0.1 on SPair-71k, establishing deformable models as effective semantic 3D representations.

[CV-79] Structure-Guided Visual Perturbation Neutralization for LVLMs

链接: https://arxiv.org/abs/2605.27927
作者: Yuanhe Zhang,Xueting Wang,YanBin Ren,Haoran Gao,Xinhan Zheng,Zhenhong Zhou,Fanyu Meng,Li Sun,Sen Su
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Image inputs enable Large Vision Language Models (LVLMs) to perceive fine-grained visual information, but also introduce a pixel-level attack surface through which adversarial perturbations can elicit unsafe model behaviors. However, most existing defenses are designed for traditional computer vision settings and thus often overlook the cross-modal alignment required by LVLMs, leading to degraded performance. Meanwhile, the limited defenses tailored to LVLMs often require substantial image modifications and introduce considerable computational overhead, thereby compromising inference quality and efficiency. To address these limitations, we propose Structure-Induced Guided Neutralization (SIGN), a lightweight, plug-and-play defense framework that improves LVLM compatibility via Prior Structural Extraction and achieves efficient perturbation suppression via Dynamic Guided Neutralization. Extensive experiments show that SIGN achieves over 87% defense success rate with only 0.5% pixel modification and 0.16 seconds per image, while nearly preserving original visual representations and benign task performance. Our work offers a lightweight alternative to defenses that require costly model training and highlights the potential of exploiting a vision encoder for efficient adversarial protection. Our code is open source on this https URL.

[CV-80] SIGMA: Semantic-Difference Instruction-Grounding Mask Annotator for Text-Driven Image Manipulation Localization

链接: https://arxiv.org/abs/2605.27924
作者: Peiyu Zhuang,Jianquan Yang,Haodong Li,Zhuoying Cai,Ruitao Xie,Jishen Zeng,Baoying Chen,Jiwu Huang,Xiaochun Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-driven image editing has advanced rapidly, but reliably localizing these manipulations requires image manipulation localization (IML) models trained on large pixel-annotated datasets, and there is still no low-cost way to obtain such training data at scale. We observe that these data already exist in disguise: public editing datasets contain millions of structurally identical (original, edited) pairs to IML training samples, lacking only pixel-level masks. Recovering these masks automatically is non-trivial: pixel differencing is overwhelmed by diffusion-induced perturbations across all pixels, and instruction-only grounding localizes only what the prompt describes, missing unintended editor side-effects. We propose SIGMA (Semantic-difference Instruction-Grounding Mask Annotator), which performs semantic-feature differencing in a vision foundation backbone and injects an instruction-derived spatial prior into this visual stream via bidirectional cross-modal refinement, amplifying the difference signal at intended-edit regions when the editor faithfully realizes user intent. SIGMA is trained in two complementary stages: Stage I supervises on inpainting masks; Stage II closes the diffusion-domain shift via VAE-roundtrip noise calibration, EMA self-training, and an edit-noise disentanglement loss. SIGMA outperforms existing automatic mask generators on five benchmarks (+12.20% F1, +11.16% IoU). When applied to public editing corpora, it produces a ~1.1M IML training set that improves six diverse detectors by +18.34% F1 across five datasets, turning previously unused editing data into a model-agnostic supervisory resource for IML. We’ll release the full codebase as soon as the paper is accepted.

[CV-81] Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical Study

链接: https://arxiv.org/abs/2605.27923
作者: Sudip Vhaduri,Ryan Gammon,Sayanton Dibbo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:The rapid growth of computer vision and increasingly complex image recognition tasks has exposed fundamental computational limitations of classical machine learning models, motivating the exploration of quantum computing as an emerging new paradigm. This paper presents a comprehensive benchmarking study of classical and quantum machine learning models for image recognition on the MNIST handwritten digit dataset, evaluating both traditional models, a Classical Support Vector Machine (CSVM) and a Quantum Support Vector Machine (QSVM), and deep neural network models, a Classical Convolutional Neural Network (CCNN) and a Quantum Convolutional Neural Network (QCNN), across four performance dimensions: classification accuracy, computational runtime, parameter count, and memory requirements. Experiments are conducted as functions of both feature dimensionality and sample size, and across CPU and GPU execution environments, providing a controlled, multidimensional comparison to address gaps in prior work. For the SVM-based models, QSVM consistently outperforms CSVM in accuracy, reaching \sim 0.90 versus \sim 0.85 at 1,000 samples, with a higher computational cost. A feature count of 10 qubits and a sample size in the range of 200 – 500 emerge as practical operating points that balance accuracy and runtime. For the neural network models, CCNN and QCNN achieve comparable classification accuracy, both exceeding 0.96 at 64 features and 60,000 samples, yet QCNN offers substantially superior parameter and memory efficiency, requiring \sim 94% fewer parameters and \sim 75% less memory than CCNN at higher feature counts, while incurring higher runtime. Across both model families, quantum models consistently outperform classical models by greater margins in accuracy as feature dimensionality or sample size increases.

[CV-82] Rethinking Video-Language Model from the Language Input Perspective AAAI2026

链接: https://arxiv.org/abs/2605.27920
作者: Xiang Fang,Wanlong Fang,Changshuo Wang,Xiaoye Qu,Daizong Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in AAAI 2026

点击查看摘要

Abstract:Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts. Specifically, we first generate positive and negative texts from the original ones to target specific text components. Then, we propose an attribute-based text reasoning strategy to mine fine-grained textual semantics of generated texts. Finally, we utilize videos as guidance to conduct cross-modal bridging by designing a self-weighted loss. Extensive experiments show that the proposed method can serve as the plug-and-play module to effectively improve the performance of state-of-the-art VLMs.

[CV-83] Decoupled Training with Local Reinforcement Fine-Tuning in Federated Learning ICML2026

链接: https://arxiv.org/abs/2605.27900
作者: Yuting Ma,Lechao Cheng,Xiaohua Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been accepted by ICML 2026

点击查看摘要

Abstract:Federated Learning (FL) with pre-trained Vision-Language Models (VLMs) has emerged as a promising paradigm for various downstream tasks. By leveraging its strong representations, recent studies improve task adaptation under insufficient local data while preserving generalization. However, these methods emphasize fully local optimization with simple parameter aggregation,which can amplify inter-client optimization inconsistency and intra-client over-specialization under heterogeneous and full-data FL settings, making it difficult to balance global task adaptation and generalization. To address these challenges, we propose FedDTL, a novel federated VLM framework that decouples the image encoder and text encoder across clients and the server. Through decoupled encoder training with server-client modality alignment, FedDTL promotes coherent global semantic update and reduces inter-client optimization inconsistency, improving global task this http URL further mitigate intra-client over-specialization,we introduce a two-stage local fine-tuning, where a supervised fine-tuning stage enables rapid and reliable warm-start, followed by a reinforcement learning stage that enhances generalization. Extensive experiments on multiple benchmarks, including label skew and feature shift, demonstrate that FedDTL achieves an effective balance between global task adaptation and generalization under various FL data distributions in both few-shot and full-data regimes.

[CV-84] owards Unified Vision-Language Models with Incomplete Multi-Modal Inputs AAAI2026

链接: https://arxiv.org/abs/2605.27894
作者: Xiang Fang,Wanlong Fang,Changshuo Wang,Keke Tang,Daizong Liu,Siyi Wang,Wei Ji
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in AAAI 2026

点击查看摘要

Abstract:Video-Language Models (VLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might face challenges due to deactivated sensors (e.g., cameras are unavailable due to data privacy), yielding modality-incomplete data and leading to inconsistency between training and testing data. While straightforward incomplete input can boast training generalization-ability and lead to training failure, its potential risks to VLMs regarding safety and trustworthiness have been largely neglected. To this end, we make the first attempt to propose a unified incomplete video-language model to process the incomplete multi-modal inputs. Extensive experimental results show that our method can serve as a plug-and-play module for previous works to improve their performance in various multi-modal tasks.

[CV-85] SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation

链接: https://arxiv.org/abs/2605.27893
作者: Lingyu Xiong,Jinjin Shi,Xuran Xu,Cong Luo,Runyu Shi,Ying Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Foundation Models (VFMs) have demonstrated impressive representational capabilities. However, adapting them to downstream tasks via full fine-tuning incurs prohibitive computational and storage overhead. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a compelling alternative, aiming to achieve performance parity with full fine-tuning at minimal training costs. Nonetheless, applying PEFT to VFMs for dense prediction tasks remains challenging due to the structural and distributional gaps. To bridge these gaps, we propose \textbfScale-\textbfIntegrated \textbfGlobal \textbfModulation \textbfAdapter (\textbfSIGMA), a novel lightweight PEFT method, which consists of two modules: scale-adaptive fusion and semantic modulation. Specifically, the scale-adaptive fusion module is utilized to bridge structural gaps by enhancing the extraction of multi-granularity visual information. Furthermore, SIGMA introduces semantic modulation on the fusion features to perform global feature alignment to further eliminate the distribution gap. This design facilitates unified spatial and distributional adaptation, requiring only 1.72% trainable parameters relative to the VFM backbone. Comprehensive experiments across various downstream dense tasks and multiple VFM backbones demonstrate that SIGMA achieves consistent and superior performance over state-of-the-art PEFT methods.

[CV-86] SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

链接: https://arxiv.org/abs/2605.27891
作者: Zhida Zhang,Jie Ma,Zhan Peng,Haoxue Wu,Yang Han,Jun Liang,Jie Cao,Jing Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.

[CV-87] Reflective Dialogue between Teacher and Solver Agents for Video Question Answering CVPR2026

链接: https://arxiv.org/abs/2605.27885
作者: Takuya Murakawa,Toru Tamaki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Yhis paper serves as the technical report for the 1st Cross-Domain EgoCross Challenge @ EgoVis Workshop, CVPR 2026

点击查看摘要

Abstract:Various approaches have been proposed to adapt Vision-Language Models (VLMs) to specialized domains for Video Question Answering, including fine-tuning and in-context learning. However, acquiring task-specific knowledge at the inference phase from only a small labeled support set without fine-tuning remains a challenge. In this paper, we propose a method that achieves adaptation solely through inference-time context injection. Our method first constructs a Reflective Dialogue (RD) – a multi-turn conversation between two agents, in which Teacher poses each support question and delivers correctness feedback, and Solver answers and provides visual grounding explanations (or reflections) for both correct and incorrect answers. This dialogue history is then used as context at the inference phase. Experiments on the EgoCross benchmark demonstrate that our method outperforms both a baseline zero-shot setting and a standard in-context learning approach that passes support set examples directly, achieving 3rd place in the Open-source Track of the 1st Cross-Domain EgoCross Challenge at the CVPR 2026 EgoVis Workshop, for which this paper also serves as a technical report.

[CV-88] A Road-Conditioned Traffic Movie Prediction Network with Spatiotemporal and Structure-Consistent Learning

链接: https://arxiv.org/abs/2605.27884
作者: Joshua Kofi Asamoah,Blessing Agyei Kyem,Armstrong Aboah
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages (double column), 7 Tables, 11 Figures

点击查看摘要

Abstract:City-wide traffic forecasting is important for congestion management, route guidance, and intelligent transportation systems, but accurate prediction remains challenging when future traffic must be generated as spatial maps over an entire urban network. Existing traffic movie prediction methods have improved frame-level accuracy, yet many still treat forecasting mainly as image reconstruction. This can produce traffic maps that are numerically close to the ground truth but weakly constrained by road layout, connectivity, travel direction, and congestion propagation, especially in cross-city settings where both traffic behavior and road structure change. To address this limitation, this study proposes RCSNet, a road-conditioned spatiotemporal network that reformulates traffic movie prediction as topology-guided future-state generation. RCSNet extracts road-aware representations from static road maps, models multi-horizon traffic dynamics from historical observations, aligns directional traffic features with local road structure, and progressively generates future traffic maps for improved temporal consistency. A structure-consistent learning objective further encourages predictions to remain accurate, road-aligned, and spatially stable. Experiments across multiple cities show that RCSNet improves both forecasting accuracy and structural consistency. In same-city forecasting on Berlin, Antwerp, and Moscow, RCSNet reduces average MAE, MSE, and RMSE by 11.5%, 10.0%, and 5.1%, respectively, compared with the closest baseline. In cross-city testing on unseen Chicago and Bangkok, it reduces RMSE by 10.6% and 10.5% without target-city fine-tuning. Additional horizon-wise, road-structure, explainability, statistical, and efficiency analyses show that RCSNet produces more accurate, transferable, road-aligned, and computationally efficient traffic forecasts.

[CV-89] ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation

链接: https://arxiv.org/abs/2605.27852
作者: Yu Zhang,Yidi Shao,Wenqi Ouyang,Yushi Lan,Zhexin Liang,Chengrui Wu,Xudong Xu,Xingang Pan
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios – body-driven garments, robotic manipulation, and free-fall collisions – under a single model and achieves approximately 4 – 9\times lower error than prior state-of-the-art methods across all scenarios; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of \sim 493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts.

[CV-90] A self-supervised learning approach to deep filter banks for texture recognition

链接: https://arxiv.org/abs/2605.27843
作者: Joao B. Florindo,Lucas O.Lyra,Antonio E. Fabris
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:An important challenge in texture recognition is the limited amount of data for training frequently found in real-world applications. In computer vision in general, a successful strategy to mitigate this issue is the use of a pretraining stage where the neural network learns to identify relations between parts of the data in a self-supervised manner. A well-established framework in this direction is masked autoencoder. Nevertheless, these models usually rely on computationally intensive architectures, such as vision transformers. In the particular case of texture images, most of the relevant information is compacted within a delimited area around each pixel, which suggests that capturing long-range dependence via the attention mechanism may be unnecessary. Based on that assumption, here we propose a framework where the pretraining model is a convolutional autoencoder. To leverage the rich information conveyed by texture patterns, we employ deep filters coupled with Fisher vector pooling. In this way, we improve the performance of texture recognition without adding significant computational burden. Our approach is compared with several state-of-the-art methods in different texture databases, confirming its potential both in terms of classification accuracy and computational complexity.

[CV-91] Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security AAAI2026

链接: https://arxiv.org/abs/2605.27823
作者: Xiang Fang,Wanlong Fang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in AAAI 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly vulnerable to adversarial prompts that exploit semantic ambiguities to bypass safety mechanisms, resulting in harmful or inappropriate outputs. Such attacks, including jailbreaking and prompt injection, pose significant risks to the integrity and availability of LLMs in security-critical applications. This paper proposes the Adversarial Prompt Disentanglement (APD) framework, a novel defense mechanism that proactively identifies and neutralizes malicious components in input prompts before they are processed by the LLM. The APD framework integrates three key innovations: (1) a mutual information-based semantic decomposition method to isolate adversarial and benign prompt components, ensuring statistical independence; (2) a graph-based intent classification approach that leverages spectral analysis to detect malicious patterns in prompt semantics; and (3) a lightweight transformer-based classifier trained on real-world datasets of toxic and jailbreaking prompts, enabling efficient and accurate adversarial intent detection. Evaluated on diverse datasets containing adversarial prompts, APD demonstrates superior robustness, reducing harmful output generation by over 85% while maintaining negligible impact on model performance. The framework’s computational efficiency supports real-time deployment, making it a practical solution for securing LLMs. Our work addresses critical challenges in machine learning security on novel attacks and integrity methods for ML systems, and offers a scalable, ethically grounded defense against prompt-based adversarial threats.

[CV-92] urning Video Models into Generalist Robot Policies

链接: https://arxiv.org/abs/2605.27817
作者: Sizhe Lester Li,Evan Kim,Xingjian Bai,Tong Zhao,Tao Pang,Max Simchowitz,Vincent Sitzmann
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: project page: this https URL

点击查看摘要

Abstract:Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Recent work proposes robot foundation models that jointly predict future observations and actions by finetuning video models with action-labeled data. In this paper, we test the limits of an alternative approach: leave the video planner as-is while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several natural benefits: the video planner remains embodiment-agnostic, different video models can be interchanged easily without re-training the IDM, and the IDM can be independently trained with readily available self-play data. We present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully-designed IDM based on the robot embodiment Jacobian. We demonstrate that our IDM design is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video-to-Embodied Robot Action Model (VERA), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs. Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control. More results are available on our project website: this https URL.

[CV-93] Pattern Recognition Tasks with Personalized Federated Learning

链接: https://arxiv.org/abs/2605.27816
作者: Md. Arifur Rahman,Isha Das,Mushfiqur Rahman Abir,B. M. Taslimul Haque,Abdullah Al Noman,Abir Ahmed,Md. Jakir Hossen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Comprehensive comparative analysis of 7 Personalized Federated Learning algorithms across MNIST, SignMNIST, and Digit5 datasets. The paper presents detailed methodology, workflow architecture, experimental evaluation, and privacy-preserving AI analysis for distributed intelligent systems, secure collaborative learning, and critical infrastructure applications

点击查看摘要

Abstract:Personalized Federated Learning (PFL) constitutes a novel paradigm that tailors Machine Learning (ML) models to individual clients, thereby furnishing personalized model updates whilst upholding stringent data privacy principles. Diverging from conventional standard Federated Learning (FL) approaches, PFL adapts models to distinct client data distributions, engendering heightened levels of accuracy, customization, and data security, all while minimizing communication overhead. This methodology proves particularly salient in contexts marked by pattern recognition tasks reliant upon heterogeneous data sources and underpinned by paramount privacy apprehensions. In the present research endeavor, this article undertake a comprehensive comparative analysis of seven distinct PFL algorithms deployed across three diverse datasets, namely MNIST, SignMNIST, and Digit5. The overarching objective entails ascertaining the preeminent PFL algorithm, within the framework of pattern recognition tasks, through a rigorous evaluation anchored in metrics encompassing Accuracy, Precision, Recall, and F1 Score. Concurrently, an in-depth scrutiny of these PFL algorithms is conducted, elucidating their operative workflows, advantages, and limitations. Through empirical investigation, the findings evince that APPLE, FedGC, and FedProto emerge as stalwart contenders, consistently furnishing superior performance across the spectrum of assessed datasets, while acknowledging the contextual specificity of alternative algorithms and the potential for iterative refinement to realize optimal outcomes.

[CV-94] Residualized Temporal Sparse Autoencoders for Interpreting Diffusion Models

链接: https://arxiv.org/abs/2605.27813
作者: Calvin Yeung,Prathyush Poduval,Ali Zakeri,Zhuowen Zou,Mohsen Imani
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models generate images through an iterative denoising process, so internal neural layers produce trajectories of activations rather than single static representations. Sparse autoencoders (SAEs) have recently been used to decompose diffusion activations into interpretable feature directions, but most approaches analyze activations at individual timesteps or condition on time rather than learning directly from full activation trajectories. In this work, we introduce residualized temporal SAEs for diffusion activation trajectories. We collect activations across denoising time, fit linear predictors between neighboring timesteps, and represent each trajectory using an initial activation together with residual components not explained by these linear dynamics. Training an SAE on this residualized representation encourages sparse latents to capture structure beyond what is linearly predictable. The residualized decoder directions can be mapped back into activation space, allowing each latent to be analyzed as a feature trajectory over denoising time. Through reconstruction and ablation studies, spatiotemporal feature analysis, and qualitative steering experiments on Stable Diffusion~1.5, we show that residualized temporal SAEs provide a useful framework for studying temporally structured diffusion activations.

[CV-95] CuriosAI Submission to the CASTLE Challenge at EgoVis 2026 CVPR

链接: https://arxiv.org/abs/2605.27800
作者: Yuto Kanda,Hayato Tanoue,Takayuki Hori
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The 4th place solution for the CASTLE Challenge at the CVPR EgoVis Workshop 2026

点击查看摘要

Abstract:CASTLE 2026 asks 185 multiple-choice questions over 600+ hours of synchronized multi-view egocentric video. We explore two approaches on top of a shared multimodal preprocessing layer, including per-person timelines, speaker-resolved transcripts, and multi-VLM caption ensembles. Approach A, SVA: Search-Verify-Answer, is a three-stage pipeline that hierarchically narrows to a primary window, verifies sub-windows with a VLM under four anti-confabulation rules, and fuses evidence with an LLM judge under an evidence-priority hierarchy. Approach B, TMKG: Temporal-Multimodal-Knowledge-Graph, is the contrast: it builds a temporal multimodal knowledge graph, locates a primary cell via graph search, and produces the final answer with a single grounded VLM. SVA reaches a leaderboard accuracy of 0.50 and is our final challenge submission; TMKG reaches 0.35.

[CV-96] Can Segmentation Models Understand the World? Towards Proactive Affordance Reasoning via Visual Chain-of-Thought

链接: https://arxiv.org/abs/2605.27764
作者: Yuchen Guo,Junli Gong,Hongmin Cai,Yiu-ming Cheung,Weifeng Su
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent segmentation models couple large language models (LLMs) with mask decoders to ground complex language expressions into masks, yet their instructions remain target-referential: they describe, constrain, or imply the region to be segmented. However, in real-world embodied interaction, human instructions are often at the intent-level, which includes the desired outcome without naming the region that enables it. To bridge this gap, we introduce SegWorld, where the model reasons about the scene through a multi-level visual chain-of-thought (CoT) before committing to a mask. Before receiving any instructions, it proactively observes the scene, describing visible objects and inferring plausible events they may support. Given an instruction, it continues the chain: from the object relevant to the intent, through the action that satisfies it, to the physical interaction site, the object part that affords the action. We formalize SegWorld as probabilistic inference, in which proactive observation supplies a linguistic scene context that improves mask prediction when instructions are given at the level of intent. We construct an intent-to-part benchmark for evaluating affordance-bearing part segmentation from high-level goals. Experiments show SegWorld matches instruction-driven baselines on target-referential instructions and improves substantially on intent-level ones.

[CV-97] AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

链接: https://arxiv.org/abs/2605.27761
作者: Yifan Sui,Xin Huang,Hongbing Li,Fang Xu,Jiahe Lv,Haolong Yan,Yeqing Shen,Litao Liu,Zhimin Fan,Ziyang Meng,Jia Wang,Junbo Qi,Kaijun Tan,Zheng Ge,Xiangyu Zhang,Daxin Jiang,Osamu Yoshie
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: 11 pages, 6 figures. Preprint

点击查看摘要

Abstract:The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely unevaluated. The core difficulty is that closed-source applications do not expose internal states, making traditional automatic verification inapplicable. To bridge this gap, we introduce AndroidDaily, a large-scale benchmark comprising 350 realistic daily-use tasks across 94 high-frequency Android applications spanning transportation, shopping, local services, entertainment, content creation, social media, and everyday utilities. To enable automatic and verifiable assessment in these opaque environments, we propose Guideline-grounded Reviewer for Automatic Diagnostic Evaluation (GRADE), a process-aware evaluator built on a three-tiered system of observable external guidelines: operational obligations, output quality, and negative constraints. GRADE tracks the agent’s visual trajectory against these criteria and produces step-level diagnostic judgments, turning long-horizon, open-ended mobile interactions into verifiable evaluation without relying on hidden internal states. Experiments show that GRADE achieves 87.37% agreement with human evaluators. The strongest model reaches a 62.0% success rate on AndroidDaily, highlighting a substantial gap between current reasoning capabilities and practical execution in realistic mobile workflows.

[CV-98] Mahalanobis PatchCore: Covariance-Aware and Streaming-Compatible Industrial Anomaly Detection

链接: https://arxiv.org/abs/2605.27748
作者: Niccolò Ferrari,Oligert Osmani,Evelina Lamma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 57 pages, 7 figures

点击查看摘要

Abstract:Industrial visual anomaly detection is usually one-class: normal images are abundant, while defects are rare, heterogeneous, and often unavailable during system design. PatchCore-style retrieval suits this setting because it scores test images from a memory bank of normal patch features, but the standard Euclidean geometry ignores feature correlations and its offline construction materialises the full patch pool before subsampling. We introduce Mahalanobis PatchCore, a covariance-aware, streaming-compatible extension of PatchCore. Its artificial intelligence contribution is a retrieval detector that estimates a regularised covariance model in reduced feature space and whitens embeddings, so Euclidean nearest-neighbour search after transformation implements Mahalanobis retrieval. A bounded-memory, re-iterable training pipeline builds the memory bank without storing all normal patches at once, using incremental dimensionality reduction, online covariance estimation, and streaming aggregation. The engineering application is automated industrial inspection, where visual anomaly detection must remain accurate under practical memory limits. We evaluate the method on a public 15-category industrial anomaly-detection benchmark and three industrial datasets covering blow-fill-seal strip-ampoule meniscus inspection, amber-glass-ampoule bottom inspection, and lyophilised-cake vial inspection. Mahalanobis PatchCore preserves most offline PatchCore image-level performance on the public benchmark while reducing peak memory from 5.41 to 2.78 GB, and improves the selected industrial mean image area under the receiver operating characteristic curve from 0.981 to 0.986. Comments: 57 pages, 7 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: I.4; I.5; I.2.6 Cite as: arXiv:2605.27748 [cs.CV] (or arXiv:2605.27748v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.27748 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-99] Bounded-Compute Multimodal Regression for Product-Rating Prediction CVPR2026

链接: https://arxiv.org/abs/2605.27737
作者: William Leach,Ru He,Sizhuo Ma,Yizhen Jia,Min Cao,Jian Wang,Rick Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the LoViF Workshop at CVPR 2026. 8 pages, 2 figures

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly attractive for multimodal quality assessment, but their default reliance on autoregressive text generation and dynamic visual processing is poorly matched to scalar regression under strict latency budgets. We present a bounded-compute adaptation of SmolVLM2-256M-Video-Instruct for product-rating prediction in the LoViF 2026 Efficient VLM challenge. Motivated by recent multimodal engagement-prediction results showing that feature-based regression can outperform token-based score generation, we replace the language-modeling head with a lightweight two-layer MLP fed by pooled decoder states, and we enforce deterministic inputs through fixed 384x384 images and truncated metadata. Across controlled ablations, static global image processing slightly outperforms dynamic tiling, and scaling from 100K to 16M training examples substantially improves validation correlation. Under the official held-out evaluation, our 228M-parameter model achieves 0.39 PLCC and 0.40 CES, providing a strong and reproducible baseline for resource-constrained multimodal regression.

[CV-100] Explicit Critic Guidance for Aligning Diffusion Models

链接: https://arxiv.org/abs/2605.27736
作者: Zhengyang Liang,Qihang Zhang,Ceyuan Yang
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online reinforcement learning is becoming increasingly important for aligning diffusion models with non-differentiable objectives. However, existing methods still face limitations in assigning fine-grained credit along denoising trajectories and in realizing stable value-based optimization. We propose a state-aligned latent actor-critic framework for diffusion post-training, in which the diffusion model serves as its own timestep-conditioned value function and predicts values directly on noisy latent states. This enables trajectory-level PPO training, supports stable actor-critic optimization with simple conditioning and value pretraining strategies, and naturally allows the learned critic to be reused for inference-time steering. We further extend the framework to multi-reward optimization, where joint training with complementary rewards helps alleviate reward hacking. Across both UNet- and DiT-based backbones, our method consistently outperforms prior group-relative RL and actor-critic baselines on single-reward and multi-reward benchmarks, while test-time steering provides additional gains in generation quality.

[CV-101] Asynchronous Remote Sensing Time-Series Fusion for Cloud Removal and Anytime Reconstruction CVPR2026

链接: https://arxiv.org/abs/2605.27726
作者: Forouzan Fallah,Chia Yu Hsu,Wenwen Li,Anna Liljedahl,Yezhou Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 MORSE Workshop

点击查看摘要

Abstract:Frequent cloud cover severely limits the usability of Sentinel-2 (S2) optical time series for Earth surface monitoring. Sentinel-1 (S1) SAR provides all-weather complementary observations, but practical S1/S2 fusion remains difficult because acquisitions are irregular and asynchronous. Many existing approaches assume temporally aligned inputs (or require external nearest-date matching) and typically restore only observed timestamps, limiting reconstruction under long gaps and preventing on-demand synthesis. We propose AGFlow (Time Aligned Generative Flow Matching), a spatiotemporal flow-matching model for S1/S2 cloud removal and time-series reconstruction with three capabilities: (1) timestamp-conditioned internal alignment that fuses asynchronous S1 and cloudy S2 observations without preprocessing-based pairing; (2) spatiotemporal, context-aware denoising that models spatial structure jointly with temporal dynamics (rather than independent per-pixel time series); and (3) anytime querying, enabling generation of cloud-free S2 frames at both observed and user-specified timestamps within the monitoring window. We evaluate on the RESTORE-DiT benchmark protocol with quantitative metrics, qualitative comparisons, and component ablations. AGFlow notably improves fully missing-frame reconstruction (MAE and RMSE reduce by 16-19% over RESTORE-DiT) and provides reliable reconstructions under persistent gaps, while also yielding competitive cloud removal performance and flexible temporal querying for downstream tasks such as dense vegetation monitoring.

[CV-102] Structure over Pixels: Learning Variable-Length Visual Programs

链接: https://arxiv.org/abs/2605.27696
作者: Piotr Wyrwiński,Kacper Dobek,Krzysztof Krawiec
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Discrete visual tokenizers translate images into ordered sequences of codes, providing a natural representation for structural description of scenes. Yet existing adaptive tokenizers either require post-hoc search or select among a discrete set of pre-trained rates, rather than learning a continuous per-image sequence length coupled to the model and scene, and they typically train against pixel reconstruction, emphasizing texture rather than structure. We propose STROP, a discrete visual tokenizer architecture that forms structural scene representations and simultaneously learns how long an image’s visual program should be. Using a four-phase curriculum supervised by local rate–distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass. By bypassing pixel-level reconstruction gradients, the codebook is shaped entirely by the quality of higher-level latent representations. Program length grows with scene complexity, and signs of compositional structure emerge both in downstream dense-prediction transfer and in direct inspection of the learned code vocabulary.

[CV-103] nsor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

链接: https://arxiv.org/abs/2605.27686
作者: Kabir Swain,Sijie Han,Daniel Karl I. Weidele,Mauro Martino,Antonio Torralba
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.

[CV-104] Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

链接: https://arxiv.org/abs/2605.27616
作者: Zijian Du,Oleg Rybakov
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time anomaly segmentation demands both high recall and efficient low-precision inference. We study the three-way interaction of model architecture, model scale, and FP4 quantization-aware training (QAT) recipe on a recall-critical brain tumor segmentation task, evaluating multiple architectures, scales, and QAT recipes under a unified protocol. We find that architecture choice has the largest impact on quantization robustness, with attention-based architectures showing remarkable resilience to recipe choice while CNN degrades under gradient-quantizing recipes at larger scales. At low capacity, FP4 can discretize softmax attention, but advanced QAT recipes prevent this collapse. At larger scales, advanced recipes mitigate gradient quantization noise that degrades CNN quality. Five-fold patient-level cross-validation confirms these findings are robust to data partition. Our results show that the Swin Transformer is robust to QAT recipe choice across all scales, making it the recommended architecture for FP4-quantized anomaly segmentation.

[CV-105] Hallucination Behavior in Multimodal LLM s Across Agricultural Image Interpretation and Generation Tasks

链接: https://arxiv.org/abs/2605.27595
作者: Partho Ghose,Al Bashir,Prem Raj,Azlan Zahid
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are being rapidly adopted in agricultural imaging applications, ranging from crop interpretation to synthetic field image generation. However, these models frequently exhibit hallucinations outputs that appear confident yet deviate from biological or environmental reality potentially leading to misinformed agronomic insights. This study investigates such hallucinations in two complementary directions: image-to-text, where LLMs interpret crop or field imagery to describe conditions such as biotic and abiotic stresses, and text-to-image, where models generate synthetic agricultural scenes based on descriptive prompts. We examine errors involving biological inconsistency, contextual inaccuracy, and agronomic implausibility, evaluating the outputs under domain-informed criteria across multiple imaging modalities. Our analysis identifies recurring hallucination patterns within both interpretive and generative tasks. In image interpretation, LLMs (e.g., Gemma, LLAVA, Qwen, and MiniCPM) achieved modest zero-shot accuracy (63 to 75 percent), whereas few-shot prompting improved performance up to 86.8 percent, exhibiting false detections and missed infections, indicating residual hallucination effects. In text-to-image tasks, advanced models such as GPT-5 and Gemini 2.5 Flash generate up to 91 percent biologically inconsistent scenes under relaxed prompt constraints, revealing fundamental weaknesses in current LLMs. This systematic assessment of visual reasoning and generation offers critical insights toward enhancing the reliability and trustworthiness of LLM-based agricultural imaging platforms.

[CV-106] ForestHG-Trace: Traceable Long-Horizon Ecological Reasoning over Large-Scale Forest Scenes

链接: https://arxiv.org/abs/2605.27590
作者: Zihang Cheng,Duanchu Wang,Cheng Li,Jing Huang,Huanzhao Fu,Di Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 14 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Remote sensing question answering (RS-QA) often requires more than direct semantic prediction, especially in large-scale forest scenes where ecological analysis involves multi-step filtering, numerical aggregation, neighborhood reasoning, and verifiable evidence. We introduce ForestHG-Trace, a framework for traceable long-horizon ecological reasoning over forest environments. It represents multimodal NEON forest scenes as ecological hypergraphs, where tree instances, spatial units, semantic groups, and neighborhood relations support higher-order reasoning beyond pairwise scene graphs. An LLM-guided agent then invokes deterministic tools for reading, filtering, expansion, aggregation, comparison, and auditing, producing replayable execution traces and compact evidence records rather than only free-form answers. We further construct ForestTraceQA, an executable benchmark for evaluating ecological QA across diverse task types and reasoning depths. Experiments show that ForestHG-Trace substantially improves answer accuracy and execution faithfulness over single-step baselines and scene-graph agents, while highlighting execution depth as the main bottleneck for long-horizon ecological QA.

[CV-107] What-If World: A Causal Benchmark for General World Models in Embodied Scenarios

链接: https://arxiv.org/abs/2605.27589
作者: Kunlin Cai,Rui Song,Jinghuai Zhang,Kaiyuan Zhang,Pranav Bodapati,Alicia Yu,Fnu Suya,Mohammad Rostami,Jiaqi Ma,Yuan Tian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages, World Model Benchmark

点击查看摘要

Abstract:Video generation models are increasingly used as world simulators for tasks like driving and robotic manipulation. What matters in these settings is not whether a single video looks right, but whether the model’s output changes when its input changes. We test this by giving a model two prompts describing the same scene with one physical detail varied, and checking whether the two videos diverge the way physics predicts. The wording difference between the prompts is small by design, since only one variable is changed, but the correct physical difference is not. A model that misses this can still produce two videos that each look plausible individually, and existing benchmarks score videos one at a time and cannot detect this failure. We introduce What-If World, 319 such prompt pairs built on real frames from nuScenes and DROID, organized by a taxonomy of six physical variables shared across driving and manipulation. Each pair is scored with APEO, a four-part rubric checking whether each video follows its prompt (Adherence), is physically consistent (Physics), preserves the shared scene (Environment), and ends in the correct difference (Outcome). Across nine state-of-the-art models, no system exceeds 52% on the paired score, and open-source models cluster near 28%. Every model tested fails on a large fraction of causal interventions, indicating substantial room before these models can reliably support action-conditioned simulation or model-based planning. Where models do score well, performance appears to track the visual prominence of the intervention rather than the tractability of its underlying physics. Some visually subtle interventions score as low as 14.2%, while visually pronounced ones reach 40.4%.

[CV-108] Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation

链接: https://arxiv.org/abs/2605.27582
作者: Hongyu Ding,Sizhuo Zhang,Ziming Xu,Jinwen Guo,Hongxiu Liu,Xingzhi Cheng,Zixuan Chen,Haifei Qi,Duo Wang,Hao Xu,Jieqi Shi,Yifan Zhang,Jing Huo,Jian Cheng,Yang Gao,Jiebo Luo
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Embodied navigation requires an agent to map language and visual observations to a stream of spatial actions that drive a real robot through environments it has never seen. The dominant approach has been to scale vision-language-action (VLA) foundation models on ever-larger collections of robot trajectories. This paper argues that, for navigation specifically, generality can be obtained structurally, not only through data scale. The underlying decision structure of navigation reduces to a single Language-Vision-Robot Actions Translation. The language action emits semantic-level directional command and the vision action emits a pixel-level visual target. Both outputs lie inside the natural output manifold of pretrained multimodal large language models (MLLMs), so the task can be reasoned about by an agent rather than learned from robot data. Therefore, we present Uni-LaViRA, a unified agentic architecture that extends the same insight to four task families (VLN-CE, ObjectNav, EQA, and Aerial-VLN) and to four heterogeneous real robots (Wheeled, Quadruped, Humanoid robot, and a self-built UAV) in a zero-shot manner. Two agent-loop mechanisms make this unification practical. TODO List Memory (TDM) rewrites a structured checklist of pending sub-goals at every step, reciting the unfinished items back into the agent’s most recent attention window. Second Chance Backtrack (SCB) rolls the robot back to the pre-error state and conditions the agent’s next plan on the failed sub-trajectory, turning single-pass navigation into a self-correcting process. With zero training effort, Uni-LaViRA reaches 60.7% SR on VLN-CE R2R, 51.3% on VLN-CE RxR, 77.7% on HM3D-v2, 60.0% on HM3D-OVON, 54.7% on MP3D-EQA, and 40.0% on OpenUAV, matching or even surpassing recent training navigation foundation models that consume millions of samples and thousands of GPU-hours.

[CV-109] Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System

链接: https://arxiv.org/abs/2605.27561
作者: Elena Sergeevna Kozachok,Sergey Sergeevich Seregin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 6 figures, 5 tables, 21 references

点击查看摘要

Abstract:Introduction. Early detection of malignant skin lesions is critical for prognosis, yet dermatologist shortages in Russian regions limit screening coverage. Mobile dermoscopy clinical decision support systems (CDSS) offer a promising approach, with model interpretability and standardised patient routing remaining key barriers to adoption. Aim. To develop a quantitative interpretability assessment method for cascade deep learning models and a three-zone patient routing algorithm, and to conduct a preliminary single-centre prospective clinical validation of the Melanoscope AI CDSS in Russian outpatient practice. Material and methods. Two-stage cascade classification of dermoscopic images; attention map visualisation (attention rollout for ViT and Swin; Grad-CAM for ConvNeXt and EfficientNetV2); quantitative IoU-based agreement assessment between activation maps and expert annotations; prospective single-centre validation across four “Melanoma Day” sessions (Orel, Russia, June 2025 - April 2026). Results. On 176 patients: agreement with expert assessment 88.6%; no false negatives among 5 malignant lesions (95% CI: 47.8-100.0%); specificity 88.3%. Three melanomas and two basal cell carcinomas were histologically confirmed; six dysplastic naevi placed under follow-up. Mean IoU (n=180): ViT - 0.69; Swin - 0.64; ConvNeXt - 0.53; EfficientNetV2 - 0.51. Routing thresholds: P0.15 / 0.15-0.50 / =0.50. Conclusion. No false negatives were observed; specificity was 88.3%, supporting screening use. The integrated cascade classification, attention map visualisation with IoU assessment, and three-zone routing provide reproducible, interpretable clinical decision support adaptable to varying resource levels. Comments: 24 pages, 6 figures, 5 tables, 21 references Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.27561 [cs.CV] (or arXiv:2605.27561v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.27561 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Elena Kozachok [view email] [v1] Tue, 26 May 2026 18:29:53 UTC (2,115 KB) Full-text links: Access Paper: View a PDF of the paper titled Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System, by Elena Sergeevna Kozachok and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-110] Representation-Conditioned Diffusion Models for Guided Training Data Generation

链接: https://arxiv.org/abs/2605.27495
作者: Nithesh Chandher Karthikeyan,Jonas Unger,Gabriel Eilertsen
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Data availability remains a critical bottleneck in many deep learning applications. Large-scale datasets are often expensive to collect, curate and annotate, which can limit the scalability and applicability of supervised learning methods. In this work, we evaluate the classification performance of models trained on synthetic image datasets produced by generative deep learning. In particular, we use latent diffusion models conditioned on learned representations from DINOv2, DINOv3, and CLIP. Our results demonstrates that this representation-conditioned formulation significantly outperforms class-conditioned generation by a large margin (+10.76 p.p. top-1 accuracy on ImageNet100), by improving sample quality and mode coverage. Furthermore, by scaling the size of the synthetic dataset, we are able to outperform a classifier trained on the real data (+2.0 p.p top-1 accuracy). We also demonstrate how generated images can be used for augmentation purposes, outperforming classical augmentation methods, and how the conditioning space can be used for sample filtering to further improve training value. Collectively, these findings highlight that representation-conditioned diffusion models provide a promising approach for augmenting, complementing, or potentially replacing real-world datasets in large-scale visual learning tasks. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2605.27495 [cs.CV] (or arXiv:2605.27495v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.27495 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-111] Diffusion-Based Ukrainian Handwritten Text Generation with Cross-Domain Style Transfer

链接: https://arxiv.org/abs/2605.27487
作者: Andrii Ahitoliev,Pavlo Berezin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures. Submitted to ICTERI 2026

点击查看摘要

Abstract:Handwritten text generation (HTG) conditioned on writer style has been widely studied for Latin scripts, but remains underexplored for low-resource and non-Latin writing systems, leaving open how well existing models generalise beyond the Latin domain. Cyrillic, particularly Ukrainian, lacks both large-scale writer-labeled datasets and empirical evidence of such generalisation. To address this gap, we construct a Ukrainian handwritten word dataset of 126,177 images from 308 writers using connected-component segmentation, quality filtering, and targeted oversampling of underrepresented Ukrainian characters. We retrain DiffusionPen, a MobileNetV2 triplet-loss style encoder with a CANINE-conditioned latent diffusion U-Net, on this dataset without architectural modification, testing direct transfer from Latin to Cyrillic. We evaluate cross-domain style transfer in three settings: cross-lingual transfer from IAM English samples, zero-shot transfer to an early 20th-century Ukrainian manuscript, and few-shot imitation of contemporary writers. The model produces legible, style-consistent word images, indicating that few-shot latent diffusion models generalize beyond the Latin-script domain. We release the dataset, trained models, and evaluation protocol as a reproducible benchmark for writer-aware Cyrillic HTG, providing a foundation for extending stylized HTG to other underrepresented writing systems.

[CV-112] Comparative Analysis of Liquid Neural Networks and LSTM for Sequential Pattern Recognition: Robustness Efficiency and Clinical Utility

链接: https://arxiv.org/abs/2605.27467
作者: Ye Kyaw Thu,Thazin Myint Oo,Thepchai Supnithi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures, 6 tables, The conference paper will appear in Proceedings of JCSSE 2026

点击查看摘要

Abstract:Traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units operate on discrete time steps, often failing to capture the fluid temporal dynamics of real-world physical processes. Liquid Neural Networks (LNNs), specifically Closed-form Continuous-time (CfC) networks, address this by modeling the hidden state evolution as a continuous differential equation. In this paper, we conduct a comprehensive benchmarking study across four distinct sequential modalities: neuromorphic event-based data (N-MNIST), stroke-based drawing (QuickDraw), visual handwriting (IAM), and physiological time-series (PhysioNet Sepsis-3). Furthermore, we perform a rigorous stress test using temporal dropout to evaluate model robustness against missing data. Our findings reveal that LNNs consistently provide superior parameter efficiency and significantly higher robustness in natively temporal domains and clinical environments where data sparsity is prevalent. This extended preprint provides additional background on related datasets and the LNN theoretical lineage, supplemented with a detailed appendix documenting our full implementation and experimental settings.

[CV-113] AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers NEURIPS2026

链接: https://arxiv.org/abs/2605.27465
作者: Semi Lee,Hyejin Go,Hyesong Choi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, 5 tables. Submitted to NeurIPS 2026

点击查看摘要

Abstract:The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (ToMe) has emerged as an elegant training-free solution; yet its design rests on an unspoken premise of token equality, which contravenes the well-documented non-uniformity of self-attention and leads to information loss in high-salience tokens under aggressive compression. We address this limitation with AdaMerge, a token-merging framework based on two complementary mechanisms. First, salience-weighted similarity leverages column-wise feature-affinity centrality as a token-importance proxy and incorporates the resulting salience scores into the bipartite matching score, ensuring that pivotal tokens contribute more strongly to the merged representation. Second, adaptive merging intensity uses pre-computed layer-wise similarity statistics to dynamically modulate the per-layer reduction count in accordance with input-specific redundancy. On ImageNet-1k with ViT-B/16, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes. The accuracy gap widens monotonically with compression: at the 13.4G FLOPs operating point, AdaMerge sustains a Top-1 degradation of only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. To our knowledge, AdaMerge is the first to combine salience-weighted similarity and adaptive per-layer reduction into a single training-free token merging framework, advancing the accuracy-FLOPs Pareto frontier of ViT acceleration.

[CV-114] Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU

链接: https://arxiv.org/abs/2605.27464
作者: Chung-Ta Huang,Leopold Das,Jeffrey Zhou,Faizaan Siddique,Julia Seungjoo Baek,Serena Liu,Andrew Rusli,Todd Y. Zhou,Freddy Yu,Sinclair Hansen,Ziling Hu,Arnav Sharma,Mengyu Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond motion primitives to behavioral-level recognition, defining five categories that balance AR application need with sensor observability. To this end, we construct a 160K-sample Ego4D dataset with a four-tier quality assurance framework spanning 8 activity scenarios, and propose HiT-HAR, a 703K-parameter hierarchical model that outperforms prior head-mounted IMU models on five-class action and eight-class scenario recognition. We further map the observability frontier of head-mounted IMU through per-class separability analysis, identifying which behavioral categories are reliably observable (Locomotion), which benefit from temporal context (Object Transfer, Task Operation), and where scenario-dependent signal overlap poses remaining challenges. Our results indicate that architectural choices exploiting temporal context and scenario structure outperform simply scaling model size. The code and dataset are publicly available at this https URL.

[CV-115] D2Turb: Depth-Aware Simulation and Decoupled Learning for Single-Frame Atmospheric Turbulence Mitigation

链接: https://arxiv.org/abs/2605.27460
作者: Zixiao Hu,Tianyu Li,Guoqing Wang,Wei Li,Guoguo Xin,Xun Liu,Peng Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Single-frame atmospheric turbulence mitigation is inherently ill-posed due to spatially varying blur coupled with non-rigid geometric distortion. Existing end-to-end approaches trained on flat-field simulations often struggle to balance texture recovery with geometric rectification. To overcome this limitation, we propose D ^2 Turb, a unified framework that bridges physics-grounded simulation with explicitly decoupled restoration. First, we introduce a Depth-Aware Turbulence Synthesis protocol that incorporates scene depth into the phase-to-space formulation. This generates physically consistent, depth-dependent degradations and provides a crucial intermediate tilt supervision signal for disentangled learning. Building upon this simulation engine, D ^2 Turb decomposes restoration into two interactive stages: texture deblurring and geometric rectification. The texture deblurring stage employs a deblurring backbone to recover fine-grained details while preserving geometric distortion for the subsequent rectification stage. To mitigate the information fragmentation commonly observed in cascaded designs, we further propose an Adaptive Structural Prior Injection (ASPI) mechanism that dynamically transfers deep structural representations from the deblurring module to guide dense flow prediction for spatial unwarping. Extensive experiments demonstrate that D ^2 Turb achieves state-of-the-art performance on both synthetic and real-world datasets, with consistent improvements in both texture recovery and geometric fidelity. Our code and pre-trained models are publicly available at this https URL.

[CV-116] Fine-Tuning Vision-Language Models for Understanding Current Damage and Scoring Priority with Quality Guard Agent

链接: https://arxiv.org/abs/2605.27452
作者: Takato Yasuno
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 11 figures, 13 tables

点击查看摘要

Abstract:Bridge inspection in Japan requires mandatory visual assessments every five years, yet qualitative damage ratings (levels a-e) assigned by different engineers exhibit significant inter-rater variability – a critical barrier to consistent infrastructure management. The aging of skilled engineers further threatens inspection capacity. This paper presents a methodology for automating bridge damage understanding and repair priority scoring using fine-tuned Vision-Language Models (VLMs). We fine-tune LLaVA-1.5-7B with QLoRA on up to 4,000 paired bridge damage images and inspection text records, then evaluate on a fixed test set of 800 images. The model outputs natural language descriptions identifying structural members and damage patterns, from which a rule-based scoring engine calculates a five-level repair priority index. A progressive training study (1k/2k/3k/4k samples) reveals that 2k training samples achieve near-optimal validation loss in only 2.9 hours of training; beyond 2k, validation loss improves by no more than 0.2% per doubling of training samples, exhibiting clear diminishing returns. Furthermore, semantic similarity on the held-out test set peaks at 3k (0.6909) and degrades at 4k (0.6739), indicating that quality-curated mid-scale data outperforms larger but noisier corpora. Inference optimization combining this http URL() and batch processing (batch_size=8) achieves 10.06 seconds per image – a 70.2% reduction over the unoptimized baseline. Our approach contributes to data governance in bridge inspection, reduces inter-rater variability, and provides AI-assisted triage to augment expert engineers in inspection workflows. Furthermore, we introduce a two-stage Quality Guard using a fine-tuned Swallow-8B SLM to reject low-quality VLM outputs before priority scoring, preventing spurious scores from damaged or unrecognised images. Comments: 23 pages, 11 figures, 13 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.5.4; I.2.7; J.2 Cite as: arXiv:2605.27452 [cs.CV] (or arXiv:2605.27452v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.27452 Focus to learn more arXiv-issued DOI via DataCite

[CV-117] From Affect to Complex Behavior: Advancing Multimodal Human-Centered AI at the 10th ABAW Workshop Competition CVPR2026

链接: https://arxiv.org/abs/2605.27451
作者: Dimitrios Kollias,Panagiotis Tzirakis,Alan Cowen,Stefanos Zafeiriou,Irene Kotsia,Eric Granger,Marco Pedersoli,Simon Bacon,Jens Madsen,Soufiane Belharbi,Muhammad Haseeb Aslam,Chunchang Shao,Guanyu Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at CVPR 2026

点击查看摘要

Abstract:The 10th Affective Behavior Analysis in-the-Wild (ABAW) Workshop and Competition, held at CVPR 2026, continues to advance research on modelling, analysis, understanding of human affect and behavior in real-world, unconstrained environments. The workshop maintains its dual structure, comprising both a competition and a paper track. The ABAW Competition introduces a diverse set of challenges targeting key aspects of affective and behavioral understanding, including continuous affect (valence-arousal) estimation, discrete affect (expression and action unit) recognition, as well as more complex behavior analysis tasks, such as emotional mimicry intensity estimation, ambivalence/hesitancy recognition and fine-grained violence detection. These challenges are built upon large-scale in-the-wild datasets, providing comprehensive benchmarks for state-of-the-art approaches. In parallel, the paper track presents a wide range of contributions spanning pose, motion behavior estimation, affect modelling multimodal learning, benchmarks, datasets evaluation protocols, fairness, robustness deployment. Overall, the 10th ABAW Workshop and Competition continues to serve as a key platform for benchmarking, collaboration and innovation, shaping the development of next-generation multimodal, human-centered AI systems.

[CV-118] Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?

链接: https://arxiv.org/abs/2605.28697
作者: Thierry Judge,Nicolas Duchateau,Andreas Østvik,Khuram Faraz,Anders Austlid Taskén,Sigve Karlsen,Thor Edvardsen,Harald Brunvand,Md Abulkalam Azad,Havard Dalen,Bjørnar Grenne,Gabriel Kiss,Pierre-Yves Courand,Lasse Lovstakken,Pierre-Marc Jodoin,Olivier Bernard
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Speckle tracking echocardiography (STE) is the clinical standard for myocardial strain estimation. Despite good performance on global strain (GLS), its accuracy for regional strain remains limited, even though this biomarker is highly relevant for early diagnosis and the characterization of subtle abnormalities. from clinical data. Deep learning is a promising alternative, but its development is constrained by the lack of reliable motion references. Existing solutions rely either on STE-derived labels or on simulations generated by physics-based models, but these synthetic sequences still have limited realism compared with clinical this http URL this paper, we propose a novel simulation strategy that incorporates speckle decorrelation measures from real videos and uses an iterative refinement process to improve the motion realism in the simulations. We created an open-source photorealistic dataset of 1,478 videos with reference motion, which was used to train an echocardiographic motion estimation algorithm. The proposed method achieves unmatched performance on global and regional strain, notably reaching a GLS variability of 1.42% in an inter-expert setting compared to 1.78% for the clinical reference.

[CV-119] Benchmarking Ultrasound Foundation Models for Fetal Plane Classification

链接: https://arxiv.org/abs/2605.27796
作者: Leya Barrientos,Yuexi Du,Nicha C. Dvornek
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Ultrasound is widely used in obstetric care due to its safety, accessibility, and real-time imaging. However, interpretation remains operator-dependent and susceptible to noise and artifacts. Deep learning models have shown strong performance to solve these problem, but they typically require large annotated datasets that are difficult to obtain in clinical ultrasound. Foundation models (FMs) offer an alternative, using a large number of ultrasound images to learn transferable representations that can generalize with limited labeled data. This work presents a comprehensive benchmark of ultrasound-specific FMs for fetal plane classification. We evaluated four ultrasound FMs (USFM, MOFO, UltraSAM, FetalCLIP) against two CNN baselines (ResNet50, EfficientNet-V2) and a ViT (DINOv3) pretrained on natural images. We trained all models under two complementary settings: full fine-tuning and linear probing with a frozen encoder. All models were trained using 5-fold patient-level cross-validation on a Spanish fetal ultrasound dataset and tested on both in-domain data and an external African cohort to assess cross-population generalization. We found that FetalCLIP achieved the best results in the linear probing setting (F1 = 0.9261 for in-domain, F1 = 0.9731 for out-of-domain), while USFM performed best in the full fine-tuning setting (F1 = 0.9476 for in-domain, F1 = 0.9515 for out-of-domain). MOFO and UltraSAM degraded most in both settings, underperforming natural image pretrained models in some cases. These findings highlight how the choice of pretrained model strongly affects fetal plane classification performance, since different pretraining objectives lead to different levels of transferability.

[CV-120] On the Equivariant Learning of the Q-tensor Order Parameter

链接: https://arxiv.org/abs/2605.27679
作者: Julia Navarro,Mark Wilkinson
类目: oft Condensed Matter (cond-mat.soft); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 15 pages (excluding 7-page appendix); 6 figures

点击查看摘要

Abstract:We construct and evaluate group-equivariant neural networks for the prediction of the two-dimensional Q -tensor order parameter of nematic liquid crystals from synthetically generated microscopic textures. Seven architectures, equivariant to cyclic groups C_k of order k for k=4,,8,,16,,32,,64,,128,, 256 , are built using a combination of weight-sharing constraints, equivariant activations and regularization techniques. To do this, we construct rotation-like permutation matrix groups with elements \varrho_C_k(g) that act on row-wise vectorized images, thereby approximating a \frac2\pik rotation of the circular subdomain on square images. We show that all seven equivariant models satisfy the Q -tensor equivariance constraint to within single-precision floating point accuracy. Comparing against approximate parameter-matched non-equivariant benchmarks, with and without data augmentation, we find that the equivariant models consistently achieve lower errors and generalize more robustly to unseen defect configurations. Performance increases with group order, suggesting that the incorporation of finer rotational symmetry leads to lower errors.

[CV-121] NL-MambaXCT: Self-Supervised Nested-Learning Mamba for Nomex Honeycomb X-ray CT Defect Classification

链接: https://arxiv.org/abs/2605.27454
作者: Ghaleb Aldoboni,Lobna Nassar,Fakhri Karray,Reem Alshamsi
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:X-ray computed tomography (XCT) is widely used for non-destructive testing of Nomex honeycomb structures in aerospace manufacturing, but industrial inspection still relies heavily on manual interpretation and supervised models trained on limited labeled data. This work introduces NL-MambaXCT, a Mamba-based framework that combines self-supervised masked image modelling with a Nested Learning (NL) formulation for automated, label-efficient defect classification from production XCT slices. The backbone is a four-stage 2D encoder with RegNet convolutional blocks in the early stages and Mamba-based sequence mixing with attention in the deeper stages. It is pretrained by masked image modelling on 19,961 unlabeled industrial XCT slices and fine-tuned on 2,000 relabeled Nomex XCT slices split by production order. NL is instantiated through two-timescale parameter dynamics: selected projections maintain slow exponential-moving-average traces alongside fast weights, while a deep-momentum optimizer introduces an additional slow parameter-update trajectory. On the held-out test set, the MIM-pretrained NL-MambaXCT model achieves 96.91% accuracy and 96.8% macro F1, outperforming CNN, attention, and single-timescale Mamba baselines by 3.11–10.31 percentage points in accuracy. The results suggest that combining masked self-supervision with NL-style fast/ slow learning dynamics is a promising strategy for robust defect classification in Nomex honeycomb XCT inspection.

人工智能

[AI-0] Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation

链接: https://arxiv.org/abs/2605.28812
作者: Jiahe Pan,Stelian Coros,Jitendra Malik,Toru Lin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project site: this https URL

点击查看摘要

Abstract:A primary bottleneck in contact-rich manipulation is the difficulty of collecting real-world data. Sim-to-real reinforcement learning offers a scalable alternative, but the simulation-reality gap prevents information-dense modalities like touch from being effectively used. Existing sim-to-real methods often mitigate this gap by simplifying tactile data into coarse low-dimensional features – sacrificing the richness required for complex manipulation. In this work, we introduce Center-of-Pressure (CoP), an effective tactile representation grounded in physical principles that preserves dense contact information while maintaining robustness for sim-to-real transfer. To support this representation, we propose a sensor calibration scheme based on differentiable dynamics, enabling the estimation of taxel orientations without requiring ground-truth force measurements. We evaluate CoP on two blind, challenging contact-rich manipulation tasks: peg-in-hole insertion and ball balancing. Across both tasks, policies conditioned on CoP achieve zero-shot sim-to-real transfer on a multi-fingered hand, and outperform both coarse binary-contact and raw-taxel baselines. Analysis of learned policy states further suggests that CoP-conditioned policies encode task-relevant physical properties, such as object mass, as an emergent byproduct of control.

[AI-1] Calibrating Conservatism for Scalable Oversight

链接: https://arxiv.org/abs/2605.28807
作者: William Overman,Mohsen Bayati
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful oversight of systems that may exceed their own capabilities? Existing approaches to scalable oversight rely on complex assumptions, remain largely heuristic, or lack practical methods for sequential settings with statistical guarantees. We introduce Calibrated Collective Oversight (CCO), which aggregates diverse auxiliary scoring functions into a penalty measuring deviation from a conservative baseline. Inspired by Attainable Utility Preservation, CCO enables collective conservatism: actions face a penalty proportional to overseer concern, so high-utility actions are still selected when overseers find them unobjectionable and overridden only when concern accumulates. CCO calibrates this conservatism online using Conformal Decision Theory, ensuring that undesirable outcomes remain below a user-specified target threshold with finite-time bounds and no distributional assumptions. On a modified version of SWE-bench, weaker overseers successfully constrain an adversarially misaligned stronger agent; on MACHIAVELLI, CCO substantially reduces ethical violations while preserving reward. In both settings, empirical violation rates closely match the specified targets, as predicted by the theory.

[AI-2] CubePart: An Open-Vocabulary Part-Controllable 3D Generator SIGGRAPH2026

链接: https://arxiv.org/abs/2605.28763
作者: Yiheng Zhu,Kangle Deng,Jean-Philippe Fauconnier,Inaki Navarro,Daiqing Li,Ava Pun,Yinan Zhang,Peiye Zhuang,Xiaoxia Sun,Maneesh Agrawala,Kiran Bhat,Tinghui Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: SIGGRAPH 2026. Project Page: this https URL

点击查看摘要

Abstract:Interactive 3D assets used in games and simulation are typically decomposed into specific semantic parts to support animation, physics, and scripted behaviors, yet most generative 3D models produce either monolithic meshes or arbitrary part decompositions that cannot be aligned with application-specific requirements. We present CubePart, a generative framework for open-vocabulary, part-controllable 3D mesh generation that exposes part structure as an explicit inference-time control signal. Given a global text prompt and a user-defined parts schema expressed as an open-ended list of part names, our method generates a set of meshes - one per schema element - that assemble into a coherent object while respecting the specified semantic structure. To enable this capability, we introduce a scalable data pipeline to construct a large open-vocabulary, part-labeled 3D dataset, along with a two-stage generative architecture that separates global shape synthesis from part-level decoding. We demonstrate that the resulting assets can be directly integrated into game engines and driven by animation and behavior scripts without manual post-processing. Project Page: this https URL

[AI-3] CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

链接: https://arxiv.org/abs/2605.28742
作者: Linas Nasvytis,Simon Jerome Han,Ben Prystawski,Satchel Grant,Noah D. Goodman,Judith E. Fan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.

[AI-4] BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks

链接: https://arxiv.org/abs/2605.28739
作者: Tirtharaj Dash
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Quantitative Methods (q-bio.QM)
备注: 5 pages; 1 figure, 4 tables

点击查看摘要

Abstract:Tabular data in knowledge-rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs of features. We mine such relationships with a sparse-exception binomial test. The mined implications form a typed directed graph, equivalent to a propositional rule base of 2-literal clauses. We encode this graph as the connectivity of a layered neural network, called BIRDNet, in which each hidden unit corresponds to one mined rule and binds only to its two features. We show two consequences of this design: First, the architecture is sparse by construction: at most 2/d of the weights in each BIR layer are active, where d is the input dimension. Second, the model is interpretable: every trained unit keeps a stable symbolic identity, so rules can be read off the network without surrogate models. Unlike most neurosymbolic models, BIRDNet does not consume an external rule base; its structural prior is mined from the data. We evaluate BIRDNet on six transcriptomic and proteomic benchmarks. Our results show that BIRDNet stays within 0.02 AUROC of the strongest dense baseline, at a small accuracy cost, while using up to 96\times fewer active parameters than an architecture-matched dense MLP. First-layer rules recover known biological signatures across multiple cancer subtypes and tissue types, including canonical amplicons, lineage-defining co-expression modules, and immune-infiltration markers. Data and code are available at: this https URL.

[AI-5] Utility-Aware Multimodal Contrastive Learning for Product Image Generation

链接: https://arxiv.org/abs/2605.28733
作者: Xiaohang Feng,Yiling Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Product images strongly influence consumer decision-making in online marketplaces. Empowered by multimodal contrastive learning, generative AI can output images that closely align with text prompts. Yet existing generative AI models do not directly optimize marketplace performance. This is a critical gap, since semantic alignment alone does not guarantee that an image will sell. To address this limitation, we propose a \textitutility-aware multimodal contrastive learning framework that incorporates consumer demand into a novel Utility-Aware InfoNCE loss. Optimizing this utility-aware objective guides generation toward images that are both semantically coherent and demand-enhancing. This effect arises directly from a shift in the learned image-text representation space toward demand-driven visual cues, which we also validate through the theoretical bound of the proposed objective. In downstream applications on Amazon and Airbnb, product images generated and edited by our method outperform state-of-the-art models in increasing demand and preserving fidelity, while maintaining text-image consistency. Notably, our utility-aware framework preserves inverse U-shaped demand patterns for attributes such as aesthetics and uniqueness, improving demand-based performance while preserving fidelity and semantic consistency. Human-subject experiments further validate its commercial effectiveness. As generative AI technology continues to evolve, our utility-aware component can be flexibly embedded into emerging generative models to improve direct commercial use.

[AI-6] AlphaTransit: Learning to Design City-scale Transit Routes

链接: https://arxiv.org/abs/2605.28730
作者: Bibek Poudel,Sai Swaminathan,Weizi Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing a transit network requires many sequential route extension decisions, but their quality is often visible only after the full network is assembled. This delayed-feedback challenge lies at the heart of the Transit Route Network Design Problem (TRNDP), where route interactions can be deceptive: an extension that appears useful locally can create transfer bottlenecks, produce redundant overlap, or reduce overall throughput. To guide route construction under delayed simulator feedback, we introduce AlphaTransit, a search-based planning framework for cityscale bus network design. AlphaTransit couples Monte Carlo Tree Search (MCTS) with a neural policy-value network: the policy proposes route extensions, the value estimates downstream design quality, and search uses these predictions to refine each decision. This provides decision-time lookahead during route construction without running simulator rollouts inside the search tree. We evaluate AlphaTransit on a new Bloomington TRNDP benchmark with realistic road topology and censusderived demand, under mixed and full transit demand settings. In the Bloomington network, AlphaTransit attains the highest service rate in both demand settings, reaching 54.6% and 82.1%, respectively. Relative to reinforcement learning without search, these correspond to 9.9% and 11.4% service rate gains; relative to MCTS without learned guidance, they correspond to 2.5% and 11.2% gains. These results suggest that coupling learned guidance with MCTS is more effective than using either approach alone for transit network design. Our code and data are publicly available in this https URL.

[AI-7] Multi-Adapter Representation Interventions via Energy Calibration ICML2026

链接: https://arxiv.org/abs/2605.28722
作者: Manjiang Yu,Hongji Li,Junwei Chen,Xue Li,Priyanka Singh,Yang Cao,Lijie Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Representation intervention has emerged as a promising paradigm for aligning large language models toward desired behaviors without modifying model weights. Existing methods typically apply a fixed intervention uniformly across all inputs. However, we find that the appropriate intervention direction and strength vary substantially across samples, and such indiscriminate intervention leads to degradation of general capabilities on benign inputs. To address these challenges, we propose Multi-Adapter Representation Interventions via Energy Calibration (MARI). Specifically, we introduce a competitive multi-adapter mechanism in which specialized experts capture non-linear correction patterns and adaptively determine the appropriate intervention direction and strength for different samples. Furthermore, we design an energy-based gating module that leverages internal propagation dynamics to distinguish inputs that are applicable for intervention. Extensive experiments across diverse model families and parameter scales demonstrate that MARI achieves state-of-the-art alignment performance. Our method significantly improves performance on TruthfulQA, BBQ, and safety benchmarks, while maintaining and even improving general capabilities on tasks such as MMLU and ARC. Our code is available at this https URL.

[AI-8] LiveBrowseComp: Are Search Agents Searching or Just Verifying What They Already Know?

链接: https://arxiv.org/abs/2605.28721
作者: HuiMing Fan,Xiao Wang,Zheng Chu,Qianyu Wang,Zhuoyao Wang,Ming Liu,Bing Qin,XingYu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge – information encoded in the model before retrieval – rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at this https URL.

[AI-9] OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol

链接: https://arxiv.org/abs/2605.28717
作者: Bojie Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Modern datacenter RDMA is bottlenecked at the network interface, not the wire. A NIC running RoCE or InfiniBand holds per-connection state for every (application, remote-endpoint) pair - hundreds of megabytes at 1024-application fanout - and pays a four-traversal PCIe round trip on a 64-byte operation, inflating latency an order of magnitude beyond the wire. Both follow from the Queue Pair over PCIe abstraction RDMA inherits from InfiniBand. Huawei’s Unified Bus (UB), a public 2025 specification, changes the abstraction: it decouples per-application endpoint state from per-host transport state so connection context grows additively, exposes ordering as opt-in, and reaches remote memory through native CPU load/store to an on-chip-bus controller. UB ships in Huawei’s closed Ascend 950 silicon. OpenURMA is the first clean-room open implementation of UB’s transport and transaction layers, realised at three tiers - synthesisable RTL on Alveo U50, a cycle-level two-node SystemC simulator, and a gem5 full-system scaffold - each with a matched OpenRoCE (RoCEv2 RC) baseline. The contribution is the implementation, harness, and controlled comparison closed silicon does not admit. On the canonical 64-byte remote fetch - LOAD on UB-spec Sec.8.3, READ on RoCEv2 RC - UB’s load/store path delivers ~500 ns end-to-end, 4.37x below the matched baseline (2186 ns), sustains 2.80x higher throughput, and fits in ~14% of a U50’s LUTs. Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2605.28717 [cs.AI] (or arXiv:2605.28717v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.28717 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-10] hinking as Compression: Your Reasoning Model is Secretly a Context Compressor

链接: https://arxiv.org/abs/2605.28713
作者: Guoxin Ma,Yibing Liu,Chengzhengxu Li,Yu Liang,Yan Wang,Yueyang Zhang,Kecheng Chen,Zhaohan Zhang,Zhiyuan Sun,Daiting Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities of LLMs underexplored. In contrast, this work reveals that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We thus derive Thinking as Compression (TaC), a new compression paradigm that treats thinking itself as compressed context. Without relying on specific dedicated compressor, TaC directly prompts the thinking model to generate thinking traces as the shortened context, already outperforming most representative compression methods. Further, given that raw thinking output may struggle with budget control and shortcut behaviors, we introduce Thinking as Compression Constrained (TaC-C), leveraging a simple reward-driven optimization framework to elicit intrinsic thinking as compact and controllable compressed context. Experiments across four long-context QA benchmarks demonstrate that TaC-C consistently outperforms existing baselines. At 4x and 8x compression ratios, it surpasses the strongest competitor by 17.4% and 23.4% in average F1, and by 15.7% and 21.7% in average Exact Match Score (EM), respectively.

[AI-11] Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI

链接: https://arxiv.org/abs/2605.28707
作者: Aisha Aijaz,Rahul Goel,Arnav Batra,Raghava Mutharaju
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Critical decision-making in socially consequential spaces is increasingly involving AI systems at varying capacities. Yet, despite the ubiquity of autonomous systems, most approaches to handling autonomous moral decision-making resort to scalar or binary judgments. These methods are insufficient for acceptable moral reasoning, as they provide little explanation, leaving out imperative contextual and theoretical information that must be included to support accountability. For this, we propose a framework to model moral reasoning as a distribution over normative ethical theories or ethical pluralism. We introduce a normative ethics simplex that integrates these theories. A benchmark of 450 cases across 15 fine-grained subtheories was also prepared for the purposes of stacked ensemble learning. These cases describe ethical dilemmas in natural language and have associated extracted contextual features. The implementation of the simplex was achieved via a two-stream normative-semantic architecture. This is followed by the fusion of normative information and a sequential, stacking ensemble to learn the best fit of the three broad theories: consequentialism, virtue ethics, and deontology, and the 15 subcategories. Our experiments demonstrate that the integration of contextual and normative priors with the semantic embeddings significantly improves the performance of the classification, displaying an accuracy of 88.89%. We conducted ablation studies to show that structured ethical representations contribute beyond analogical reasoning, and the chosen stacking architecture gives the best results due to the gradual learning of granularity. Ethical pluralism is also analyzed through entropy, confidence, and visualization. Thus, modeling ethical pluralism as a probabilistic normative distribution supports human-like moral reasoning, ethical disagreement analysis, and future alignment in AI systems.

[AI-12] A Fresh Look at Lamarckian Evolution and the Baldwin Effect PPSN2026

链接: https://arxiv.org/abs/2605.28703
作者: Inès Benito,Johannes F. Lutzeyer,Benjamin Doerr
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Optimization and Control (math.OC)
备注: To appear in the proceedings of PPSN 2026

点击查看摘要

Abstract:Baldwinian and Lamarckian evolution have existed for a long time in evolutionary algorithms (EAs) without ever dominating the academic literature or practical applications. In this work, we use modern empirical and theoretical methods to revisit Lamarckian and Baldwinian evolution and rigorously compare them with the generic Darwinian evolution. On the empirical side, we run a comprehensive suite of experiments on graphs from six different datasets from the recent GraphBench benchmark on Maximum Independent Set and Maximum Cut problems. Our results show that Baldwinian and Lamarckian evolution consistently outperform Darwinian evolution, confirming the great potential of local search augmented evolutionary algorithms. Notably, in the great majority of cases, all EAs outperform recent deep learning baselines and approach the performance of highly specialised heuristic and exact solvers. We furthermore report a high-performing set of generalist parameters for all studied evolution types that we hope will be of use to practitioners in future. On the theoretical side, we extend the existing Deceptive Leading Block benchmark to arbitrary block length and use tools from modern theoretical runtime analysis to prove upper and lower bounds on the expected runtime. For block lengths greater than two, Baldwinian evolution is asymptotically faster than Lamarckian which is asymptotically faster than Darwinian evolution. When accounting for the cost of the local search procedure in fitness evaluations, the ordering depends on the implementation with Baldwinian evolution staying fastest from small block lengths onwards, explaining its strong empirical performance.

[AI-13] RACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

链接: https://arxiv.org/abs/2605.28699
作者: Chusen Li,Zhou Liu,Shuigeng Zhou,Wentao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 3 figures

点击查看摘要

Abstract:Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to multi-turn multi-agent systems faces following dilemmas: i) Sparse rewards, role-level free-riding and excessive training overhead. ii) Agents only imitate to collaborate. iii) Fixed collaboration protocol falls into oscillating local optimum. We introduce TRACER, a turn-level reinforcement framework for cooperative multi-LLM reasoning. TRACER separates collaborative decision making into a controller-regret layer, where controllers learn whether the agents should speak or skip the current round through regret matching, and a generation-credit layer, which optimizes proposer and reviewer utterances with role-specific GSPO rewards. This design i) assigns credit at the level of both action modes and generated utterances, thus avoiding free-riding and sparse rewards. We only expand the choices made by the controllers, thus greatly reducing computational cost of training. Moreover, ii) agents acquire collaborative capability as they learn when to utter and what to speak. Finally, iii) by designing binary actions ingeniously, we extend classical game theory established for finite action spaces to deep learning, thus achieving mathematically rigorous convergence. We train all local RL-style methods on the GSM8K training split and evaluate on held-out GSM8K, MATH500, and GPQA-Diamond to measure in-domain accuracy, cross-benchmark generalization, inference cost, and correction-preservation behavior. The resulting framework provides a compact and reproducible testbed for studying learned collaboration policies beyond fixed debate, voting, or aggregation protocols. Code is available at this https URL.

[AI-14] VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

链接: https://arxiv.org/abs/2605.28683
作者: Yuting Xu,Jiayi Tian,Jian Liang,Xin Xiong,Hang Zhang,Mu Xu,Xiao-Yu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Existing benchmarks have laid the foundation for travel planning agents by establishing API-centric paradigms. However, as the capabilities of Autonomous Agents continue to advance, their evaluation must evolve beyond simple tool execution toward handling the inherent complexities of the open web. Current benchmarks bypass core cognitive hurdles: they fail to account for information noise, ignore multi-source factual contradictions, and overlook the necessity of grounding visual perception into logical planning. We introduce VeriTrip, a verifiable benchmark designed to meet the increasing demands for agent robustness and reliability. VeriTrip shifts the evaluation focus to evidence-grounded reasoning over unstructured multimodal web corpora. It establishes a Multimodal Retrieval Base (MRB) derived from real-world sources, forcing agents to autonomously orchestrate queries across heterogeneous data. A synchronized Verifiable Knowledge Base (VKB) enables a cell-wise verification protocol that precisely quantifies factual reliability, distinguishing systematic reasoning failures from parametric hallucinations. Our evaluations across leading MLLMs reveal a critical \textitretrieval-reasoning trade-off: the cognitive load of autonomous retrieval significantly erodes instruction retention. VeriTrip provides the rigorous foundation necessary for the next generation of planning agents capable of operating in unconstrained, multimodal environments.

[AI-15] DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting Precise Verification and Fully Parallel Execution

链接: https://arxiv.org/abs/2605.28678
作者: Yunhai Hu,Zining Liu,Xiangyang Yin,Tianhua Xia,Bo Bao,Eric Sather,Vithursan Thangarasa,Sai Qian Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speculative reasoning has recently been proposed as a means to accelerate reasoning-intensive generation in large multimodal models, but its effectiveness is often constrained by misalignment between speculative drafts and target-verified reasoning. In this work, we introduce DREAM-R, a framework that substantially improves the performance of speculative reasoning. At its core, DREAM-R employs Speculative Alignment Policy Optimization (SAPO), a reinforcement-learning objective that trains draft models to generate reasoning steps that are both faithful to target trajectories and concise. We further propose a Threshold-based Verification Mechanism (TBVM) that uses a ratio-based criterion to provide stable and interpretable acceptance of speculative steps only when positive evidence clearly dominates, thereby preventing error propagation. Building on these components, we develop a Fully Parallel Speculative Reasoning (FPSR) framework that parallelizes draft generation, target-side reasoning, and verification across multi-step reasoning, enabling early stopping and clean fallback. Experiments on reasoning-heavy benchmarks demonstrate up to speedup while preserving target-model accuracy, yielding substantial efficiency gains without compromising reasoning quality.

[AI-16] An LLM -Based Assistance System for Intuitive and Flexible Capability-Based Planning

链接: https://arxiv.org/abs/2605.28666
作者: Luis Miguel Vieira da Silva,Nicolas König,Felix Gehlhoff
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In modern industry, dynamic environments and the complexity of modular and reconfigurable resources require automated planning of process sequences. Capability-based planning approaches address this by automatically generating plans from semantic knowledge models that describe resource functions in a machine-interpretable form. Their practical use, however, remains limited: solver feedback, especially in the case of unsatisfiability, is difficult to interpret, and the knowledge models require adaptation as operational conditions change or requests become infeasible. This paper presents a hybrid assistance system that augments an existing capability-based Satisfiability Modulo Theories (SMT) planning approach with an Large Language Model (LLM)-based layer for natural-language interaction, explanation, and adaptation. Formal planning correctness remains with the symbolic planner, while the LLM layer handles natural-language access and flexible knowledge model adaptation under explicit Human-in-the-Loop (HitL) approval. The system decomposes into four components: Capability Grounding, Symbolic Planning, Result Interpretation, and Planning Adaptation, realized as a routed agentic workflow in which a central router delegates to five specialized agents. The system is evaluated on a modular production system across four scenario types. Of 23 test cases, 9 of 10 knowledge queries and all 4 satisfiable planning cases were handled correctly, 3 of 4 unsatisfiable cases produced concrete repair proposals, and all 5 adaptive planning scenarios resolved into satisfiable plans through iterative, user-approved knowledge model modifications. The findings confirm that combining formal planning with LLM-based assistance substantially improves accessibility and adaptability in industrial automation.

[AI-17] AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

链接: https://arxiv.org/abs/2605.28655
作者: Shanghua Gao,Ada Fang,Marinka Zitnik
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).

[AI-18] he Ethics of LLM Sandbox and Persona Dynamics

链接: https://arxiv.org/abs/2605.28647
作者: Tim Gebbie,Stewart Gebbie
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Risk Management (q-fin.RM)
备注: 8 pages

点击查看摘要

Abstract:It is well known that LLM guardrails and trained persona dynamics can produce a reality gap: the distance between the world a LLM is permitted or shaped to describe, and the world in which users must act. Here we argue that actively generating reality gaps is in fact unethical because it knowingly shifts epistemic risk back to the uninformed user – this is reality laundering. This can potentially cause harm when operationalised at scale. The risk is sharpest in high-exposure advice contexts, where users seek orientation rather than a bounded, externally checkable task. Guardrails naively appear ethically necessary when they claim to prevent direct harm, but often become suspect when they suppress truthful perception and launder uncomfortable mechanisms into acceptable abstractions. Basel-style financial regulation, B-BBEE-style compliance, Societe Generale, and the London Whale show how formal safety systems can become legible, gameable, and performative while real exposure migrates elsewhere. The same pattern can appear in LLMs as moral compliance: safe language, distorted reality. We therefore distinguish refusing harm, from refusing reality; and then argue for top-down causal requirements specification at the task level rather than bottom-up moral correction at the response or sandbox level. Persona dynamics matter because the assistant interface is not neutral; it shapes how uncertainty, conflict, authority, and risk are staged. The conclusion is that so-called ``ethical AI’’ becomes substantively unethical when it substitutes institutional reassurance for contact with reality.

[AI-19] Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

链接: https://arxiv.org/abs/2605.28642
作者: Yexing Du,Kaiyuan Liu,Youcheng Pan,Bo Yang,Ming Liu,Bing Qin,Yang Xiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have demonstrated significant potential for speech-to-text translation (S2TT). However, existing deployment paradigms face critical challenges: pure on-device models suffer from resource constraints, while centralized cloud systems incur severe privacy risks and bandwidth bottlenecks by transmitting raw voice data. Furthermore, most models exhibit English-centric biases, restricting many-to-many translation scaling. In this paper, we propose Edge-cloud Speech Recognition and Translation (ESRT), a privacy-preserving and bandwidth-efficient collaborative edge-cloud MLLM framework. Specifically, we design an edge-cloud split inference architecture that retains a lightweight speech encoder and adapter on the device, transmitting only highly compressed intermediate features to the cloud. This fundamentally prevents voiceprint leakage and reduces bandwidth requirements by up to 10 \times . To overcome English-centric bottlenecks, we introduce a multi-task weighted curriculum learning strategy with data balancing to ensure robust cross-lingual consistency. Extensive experiments on the FLEURS dataset demonstrate that our models, ESRT-4B and ESRT-12B, achieve state-of-the-art many-to-many S2TT performance across 45 languages ( 45 \times 44 directions). Code and models are released to facilitate reproducible, privacy-aware MLLM S2TT research. The code and models are released at this https URL.

[AI-20] Blind PRNG Hijacking: An Undetectable Integrity-Preserving Attack Against LLM Watermarking

链接: https://arxiv.org/abs/2605.28632
作者: Ziyang You,Huilong He,Xiaoke Yang,Xuxing Lu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Preprint prepared for submission to IEEE TIFS. 12 pages, 8 figures

点击查看摘要

Abstract:Cryptographic watermarking is a leading defense for attributing text generated by large language models (LLMs). Existing schemes, including KGW, Unigram, and DipMark, derive their security guarantees from the assumption that the underlying pseudo-random number generator (PRNG) is trustworthy. This work introduces SeedHijack, the first supply-chain attack on LLM watermarking that is simultaneously (i) blind – requiring no knowledge of the watermark key, detector, or model logits, (ii) integrity-preserving – amplifying rather than erasing the watermark signal, and (iii) orthogonal to detection – the attack-induced bias is statistically independent of all content-side detector statistics, ensuring that amplification and evasion coexist without trade-off. Rather than perturbing generated text, SeedHijack replaces the PRNG at the supply-chain layer, biasing green-list selection without altering output tokens or degrading text quality. Across three watermarking schemes and three open-source LLMs, the attack triggers 0/6 state-of-the-art content-side statistical detectors while inflating the watermark z-score up to 2.42x (system-level defenses such as entropy-source attestation remain orthogonal and complementary). A quantum random number generator (QRNG) countermeasure is shown to fully neutralize the attack while preserving benign watermarking utility. These findings establish PRNG integrity as a first-class security requirement for cryptographic content-provenance systems.

[AI-21] LACUNA: Safe Agents as Recursive Program Holes

链接: https://arxiv.org/abs/2605.28617
作者: Yaoyu Zhao,Yichen Xu,Oliver Bračevac,Cao Nguyen Pham,Frank Zhengqing Wu,Martin Odersky
机构: 未知
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model-written code shape the runtime itself would make agents more expressive, but it would also sharpen safety problems. A model can be diverted by a prompt injection, call the wrong tool, or fail partway and leave an inconsistent state, and each such failure reaches further when the code shapes the runtime than when it expresses a single action. We present LACUNA, a programming model for agents that closes this split while preserving safety. Each agent action is a typed call \textttagentT that the LLM fills with code when execution reaches it, and the code is type-checked against the surrounding program before it runs. Because each action is accepted or rejected as a whole, a rejected one leaves the environment untouched, and its compiler diagnostics drive a retry. The same check also bounds which tools and data an action may use and how they flow. Our primitive expresses ReAct loops, sub-agents, skills, parallel decomposition, and multi-model planning as ordinary control flow. We evaluate LACUNA on a collection of test cases, BrowseComp-Plus, and \tau^2 -bench. On BrowseComp-Plus, 8.6% of generations are rejected before execution, with 0.7 retries per query on average, and the agent reaches 27.1% accuracy. On \tau^2 -bench, LACUNA solves 76.0% of 392 tasks across four domains with a capable model, on par with the baseline agent.

[AI-22] Online Irregular Multivariate Time Series Forecasting via Uncertainty-Driven Dual-Expert Calibration KDD2026

链接: https://arxiv.org/abs/2605.28603
作者: Haonan Wen,Hanyang Chen,Songhe Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026

点击查看摘要

Abstract:Irregular multivariate time series forecasting is critical in many real-world applications, where time series are irregularly sampled and exhibit dynamically evolving missingness patterns. Although existing methods perform well in offline settings, they often suffer from significant performance degradation when deployed online due to dynamic shifts in data distribution. Maintaining forecasting capability in such dynamic scenarios typically necessitates online adaptation techniques. Since irregular sampling fundamentally undermines temporal continuity and periodicity, we cannot leverage these widely studied characteristics from regular MTS for online learning. To this end, we study the problem of online IMTS forecasting and propose Under-Cali, an uncertainty-driven dual-expert calibration framework consisting of three core components: an uncertainty estimator, a dual-expert calibration module, and an adaptive routing module. We design an uncertainty estimator that serves as the core control signal to jointly manage inference and adaptation processes. In our framework, the uncertainty estimator first assesses uncertainty for each incoming batch. The adaptive routing module then directs samples with high uncertainty to the unreliable expert for calibration, while low uncertainty samples remain with the reliable expert. Subsequently, the system updates the reliable expert and the uncertainty estimator using well-calibrated reliable samples, and updates the unreliable expert with challenging samples, enabling stable and efficient online learning. Under-Cali keeps the source forecasting model frozen and performs adaptation only through a lightweight, model-agnostic calibration module, enabling efficient adaptation. Extensive experiments on IMTS benchmarks demonstrate consistent improvements with low computational cost. Our code is available at this https URL.

[AI-23] Position: Retire the “Positive Backdoor” Label – Secret Alignment Requires Strict and Systematic Evaluation ICML2026

链接: https://arxiv.org/abs/2605.28597
作者: Jianwei Li,Jung-Eun Kim
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2026

点击查看摘要

Abstract:This position paper argues that the AI/ML community should stop overclaiming and retire the label “positive backdoor,” and instead treat trigger-activated hidden behaviors as Secret Alignment. Crucially, protective claims based on Secret Alignment should be presumed not secure by default unless supported by rigorous, standardized evaluation. The Private AI era, enabled by open-weight LLMs and accessible training/inference stacks, turns language models into privately owned digital assets, creating security concerns around unauthorized access, model theft, and behavioral misuse. Recently, a line of work framed as “positive backdoors” has been proposed to address these challenges. To ground our position in evidence, we unify these proposals as covert trigger-behavior associations for access gating, ownership attribution, and safety enforcement, and evaluate three representative applications across six core properties: effectiveness, harmlessness, persistence, efficiency, robustness, and reliability. Our results reveal substantial brittleness - especially in the confidentiality, integrity, and availability (CIA) - of trigger-behavior mappings often underrepresented by existing claims. We further relate these outcomes to behavior density and decision complexity, offering a behavioral lens for understanding deployment-time risks and motivating community-wide evaluation that makes Secret Alignment claims provable.

[AI-24] chnical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem

链接: https://arxiv.org/abs/2605.28588
作者: Luca Beurer-Kellner,Aleksei Kudrinskii,Marco Milanta,Kristian Bonde Nielsen,Hemang Sarkar,Liran Tal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, technical report

点击查看摘要

Abstract:We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft, backdoor installation, and data exfiltration. 13.4% of all skills contain at least one critical-level security issue and at least 8 manually confirmed malicious skills remain publicly available on this http URL as of the date of publication. This report documents our methodology, presents a threat taxonomy based on real-world samples, and details the attack patterns we observed. As skill marketplaces grow rapidly and AI agents gain access to sensitive credentials and systems, automated security analysis is no longer optional.

[AI-25] SARAD: LLM -Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving IJCNN2026

链接: https://arxiv.org/abs/2605.28583
作者: Kangyu Wu,Peng Cui,Guoxi Chen,Ya Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 7 pages, 4 figures, accepted by IJCNN 2026

点击查看摘要

Abstract:Ensuring both safety and efficiency in decision-making for autonomous driving systems remains a fundamental challenge. Traditional Deep Reinforcement Learning (DRL) suffers from unsafe random exploration and slow convergence, while Large Language Models (LLMs) demonstrate inherent latency in real-time inference operations. To address these limitations, this paper proposes SARAD, a novel safety-aware hybrid framework that synergizes LLMs and DRL for autonomous driving. SARAD substitutes the random exploration of DRL with Retrieval-Augmented Generation (RAG)-enhanced, LLM-guided decisions sourced from a dynamic expert knowledge repository. An attention discriminator is proposed to integrate the prior knowledge of LLMs into DRL policy optimization. A collision predictor module, fine-tuned with historical collision data, is further designed to improve vehicle safety. Extensive experiments show that SARAD achieves significant performance improvements in the Highway-Env simulator, validating the effectiveness of the proposed model in autonomous driving.

[AI-26] MUSE: Benchmarking Manufacturable Functional and Assemblable Text-to-CAD Generation

链接: https://arxiv.org/abs/2605.28579
作者: Xiaoyu Dong,Zhi Li,Xiao-Ming Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages

点击查看摘要

Abstract:Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial product design. Existing benchmarks focus primarily on generating single-part CAD models and evaluate them using geometric similarity metrics that fail to capture functionality, manufacturability, and assemblability. To address this gap, we introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies. MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment. The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality. To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation. Experiments on closed-source and open-source LLMs reveal a clear failure cascade from executable code to valid geometry and finally to engineering-ready design, with even the strongest models achieving limited success on fine-grained engineering criteria. Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design. Our project website, including the leaderboard, dataset, and code, is available at this https URL.

[AI-27] Continual Model Routing in Evolving Model Hubs ICML2026

链接: https://arxiv.org/abs/2605.28577
作者: Jack Bell,Giacomo Carfì,Gerlando Gramaglia,Vincenzo Lomonaco
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 42 pages, 24 tables, 6 figures, to be published at ICML 2026

点击查看摘要

Abstract:AI model hubs provide access to a rapidly growing collection of powerful pre-trained models, enabling off-the-shelf mixture-of-experts systems with different routing strategies. However, this rapid growth poses two fundamental challenges: scaling model selection across thousands of experts and continually updating routing mechanisms as new models and tasks are introduced. In this paper, we formalise this setting as Continual Model Routing (CMR) and propose CMRBench, a new large-scale benchmark simulating realistic hub expansion and including over 2,000 candidate models. Finally, we introduce CARvE, a contrastive embedding approach for efficient continual model routing via checkpoint-based anchoring and structured replay. Extensive empirical results and ablations show that CARvE significantly outperforms zero-shot retrieval, fine-tuning, and adapter-merging baselines in model, family, and domain-level accuracy.

[AI-28] A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis

链接: https://arxiv.org/abs/2605.28575
作者: Jianheng Dai,Jiazhang Liang,Sijie Mai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Sentiment Analysis (MSA) fuses text, acoustic, and visual streams to infer sentiment. Because pre-trained text encoders are far more expressive than their acoustic and visual counterparts, the text modality tends to dominate optimization, suppressing weaker modalities and inducing gradient norm conflicts that destabilize training. To address this, we propose a Conflict-aware Penalty (CP) that detects and penalizes gradient norm conflicts at each training step, and a Statistical Loss (SL) that aligns predicted distribution statistics with empirical input statistics. Crucially, CP prevents dominant modality gradients from interfering with the SL objective, enabling synergistic training within a unified framework incorporating adaptive modality encoding, gated cross-modal fusion, and unimodal auxiliary heads. Experiments on CMU-MOSI demonstrate state-of-the-art performance, with ablation studies confirming the effectiveness of each component.

[AI-29] Efficient Pre-Training of LLM s through Truncated SVD Layers

链接: https://arxiv.org/abs/2605.28573
作者: Kaivan Kamali,Kajetan Schweighofer,Hormoz Shahrzad,Olivier Francon,Babak Hodjat,Risto Miikkulainen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The massive scaling of Large Language Models (LLMs) has made pretraining increasingly cost-prohibitive. While low-rank representation and orthonormal weight matrices could in principle reduce parameter counts and computational overhead, most existing methods rely on static rank selection and do not enforce weight orthonormality due to high computational cost. This paper introduces TSVD, a framework that maintains low rank and strict orthonormality throughout the training process. It utilizes a spectral energy-based heuristic for adaptive rank selection, and a caching mechanisms to maintain orthonormality. Theoretical analysis justifies the advantage of the approach in pretraining dynamics and experiments across various model scales demonstrate that it is effective empirically. TSVD matches or exceeds the performance of full-parameter baselines while significantly reducing compute requirements. The approach thus offers a well-founded, practical, and scalable path toward efficient high-performance LLM pretraining.

[AI-30] Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

链接: https://arxiv.org/abs/2605.28567
作者: Tue M. Cao,Nguyen Do,My T. Thai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have become a central tool for interpreting language models. However, two key SAE analyses that remain difficult to scale are (1) matching semantically similar features across multi-layers and (2) compressing large feature circuits into interpretable supernodes. Although these have been treated as separate problems, we show that both are instances of a more fundamental challenge, which we frame as the estimation of semantic distances between SAE features that lie on different activation manifolds. We introduce a distributional framework for this problem, in which each feature is represented not by a single decoder vector like in the literature, but by an activation-weighted distribution over the hidden states that express it. By projecting these distributions into a shared reference space and comparing them with Wasserstein distance, our method provides a unified semantic metric for cross-layer feature comparison. We prove that our representation is invariant to activation rescaling, stable under perturbations, and recovers true matches under finite-sample margin conditions. Empirically, our method outperforms decoder-vector and LLM-based baselines and captures subtle functional distinctions between related features. Notably, our method compresses large feature circuits into interpretable supernodes automatically.

[AI-31] ree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns

链接: https://arxiv.org/abs/2605.28566
作者: Guni Sharon
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Extended version of the SoCS 2026 paper. Includes appendices omitted from the proceedings version

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, yet their standard generation process – auto-regressive token prediction – is inherently myopic and prone to cascading errors. To address this, the Tree-of-Thoughts (ToT) framework creates a search space over intermediate reasoning steps, allowing search models to explore, look ahead, and backtrack. However, current ToT research remains fragmented across Natural Language Processing and Automated Planning communities, often using inconsistent terminology and ad-hoc implementations. Consequently, we synthesize the ToT landscape through a unified taxonomy based on classical heuristic search terminology. We map LLM-based reasoning to classical search components: state representation (granularity of thoughts), successor generation (prompting operators), and heuristic evaluation (self-assessment of progress). We analyze existing work within the context of our taxonomy and identify emerging design patterns: systematic search (Best-First Search) for shallow, deterministic tasks and lookahead-heavy strategies (DFS, MCTS) for deep multi-step reasoning. We conclude by identifying open algorithmic challenges at the intersection of heuristic search and LLM reasoning, and call on the heuristic search community to engage with this emerging domain.

[AI-32] A Multi-dimensional Framework for Evaluating Generalization in EEG Foundation Models

链接: https://arxiv.org/abs/2605.28563
作者: Aditya Kommineni,Emily Zhou,Kleanthis Avramidis,Tiantian Feng,Shrikanth Narayanan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 5 Figures

点击查看摘要

Abstract:Evaluating foundation models under appropriate adaptation settings is essential for understanding the quality and transferability of the learned representations. Recent EEG foundation models have demonstrated promising transfer capabilities across tasks and datasets, motivating their growing use in neurotechnology and clinical applications. However, these models are typically evaluated under full fine-tuning on well-curated downstream datasets, a setting that does not reflect biomedical domain constraints such as limited labeled data, reduced sensor coverage, or parameter-efficient adaptation. In this work, we propose a multi-dimensional evaluation framework for assessing EEG models under realistic low-resource conditions. Empirical analysis of both supervised EEG models and recent EEG foundation models, including LaBraM, CSBrain, and CBraMod, across 6 different datasets is performed under the proposed multi-dimensional evaluation framework. We find that EEG foundation models consistently provide performance gains on long-context tasks such as sleep stage prediction and mental health state classification. In contrast, for short-window Brain Computer Interface style tasks, supervised models achieve comparable despite having substantially fewer parameters. Additional analyses demonstrate that current foundation models provide limited robustness to short-window tasks and channel constrained settings. Together, these findings motivate the use of multi-dimensional evaluation protocols that characterize model behavior under realistic use constraints.

[AI-33] oken Optimization Strategies for LLM -Based Oracle-to-PostgreSQL Migration

链接: https://arxiv.org/abs/2605.28557
作者: Oleg Grynets,Dmytro Babarytskyi,Vasyl Lyashkevych
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, 5 tables, 38 references

点击查看摘要

Abstract:LLMs are increasingly used for software modernization, code translation, and database migration. However, LLM-based Oracle2PostgreSQL migration remains constrained by high token consumption, long-context degradation, dialect-specific semantic differences, and the risk of semantic drift during query transformation. Direct inclusion of large Oracle SQL/PL-SQL artefacts, schema definitions, procedural logic, and migration instructions into the model context increases cost and may reduce generation quality. This paper shows token optimization as a constrained transformation problem in LLM-based Oracle2PostgreSQL migration. The study formalizes and evaluates twelve token optimization strategies: baseline representation, context pruning, minification, DSL-based semantic compression, metadata augmentation, context refactoring, schema distillation, adaptive routing, AST-based minification, identifier masking, output constraint enforcement, and hybrid optimization. The strategies are evaluated on samples of 10 and 100 Oracle SQL queries using Valid Syntax Rate, Exact Match, Semantic Match, CodeBLEU, and Token Efficiency. The results show that mild context pruning preserves semantic quality almost at the baseline level, achieving 89.75% Semantic Match on the 100-query sample compared with 89.80% for the unoptimized baseline. Adaptive routing provides the best practical trade-off, reducing input tokens by 8.72% and output tokens by 5.49% while maintaining 88.40% Semantic Match and increasing Token Efficiency by 6.67%. Aggressive schema distillation increases Token Efficiency by 132.22% but results in a 44.50-percentage-point decrease in Semantic Match. The findings demonstrate that token optimization cannot be treated as simple prompt shortening; it must be evaluated as a multi-objective migration problem balancing cost, syntactic validity, semantic preservation, and structural fidelity.

[AI-34] A Matter of TASTE: Improving Coverag e and Difficulty of Agent Benchmarks

链接: https://arxiv.org/abs/2605.28556
作者: Tomer Keren,Nitay Calderon,Asaf Yehudai,Yotam Perlitz,Michal Shmueli-Scheuer,Roi Reichert
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As agent capabilities advance, existing benchmarks, such as \tau^2 -Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive n -gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct \tau^c -Bench, a challenging extension of the three domains of \tau^2 -Bench. We evaluate 11 agent/user LLM pairs and find that models nearly saturating \tau^2 -Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from 0.82!-!0.94 to 0.28!-!0.61 ). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.

[AI-35] Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

链接: https://arxiv.org/abs/2605.28553
作者: Matteo Gioele Collu,Riccardo Conte,Alberto Giaretta,Denis Kleyko,Mauro Conti,Matteo Zavatteri,Roberto Confalonieri
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that safety-relevant behavior is represented in intermediate activations before output generation. To test whether this signal is actionable, we introduce Mechanistic AutoDAN, a probe-guided variant of AutoDAN that replaces full-model fitness evaluation with partial forward passes and probe-based scoring inside a genetic prompt search loop. Across the evaluated models, our method achieves attack success rates competitive with vanilla AutoDAN while reducing per-iteration search time by up to 72%, and probe-guided prompts match or exceed AutoDAN’s cross-model transfer in several configurations. We further find that the usefulness of probe guidance increases with model scale. Our results show that refusal is not only observable at the output level, but is encoded as a structured and actionable signal in intermediate LLM activations.

[AI-36] Modeling Vehicle-Type-Specific Pedestrian Crash Avoidance Behavior in Safety-Critical Interactions Using Smooth-Mamba Deep Reinforcement Learning

链接: https://arxiv.org/abs/2605.28552
作者: Qingwen Pu,Kun Xie,Hong Yang,Di Yang,Junqing Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 37 page. 15 Figure, 9 table

点击查看摘要

Abstract:As automated vehicles (AVs) increasingly share roadways with human-driven vehicles (HDVs), understanding how pedestrians respond to different vehicle types in safety-critical interactions is essential for the safe deployment of automated driving technologies. This study extracts safety-critical pedestrian-vehicle interactions from the Argoverse 2 dataset to capture real-world crash avoidance behaviors in encounters involving AVs and HDVs. To model vehicle-type-specific pedestrian crash avoidance behavior, we develop a Smooth-Mamba Deep Deterministic Policy Gradient framework, termed SMamba-DDPG, which integrates smooth action constraints with efficient temporal representation learning. To quantify pedestrian behavioral differences, the framework trains separate crash avoidance policies for pedestrian interactions with AVs and HDVs. Results show that SMamba-DDPG outperforms baseline reinforcement learning and supervised learning models in reproducing pedestrian crash avoidance behaviors. Reconstructed trajectories demonstrate strong behavioral realism, accurately reproducing crash avoidance kinematics in both AV and HDV scenarios. Reaction time analysis shows that the model captures human-like response delays and reveals that pedestrians respond more quickly to AVs than to HDVs. Counterfactual analysis further indicates that pedestrians adopt lower crossing speeds when interacting with AVs. Large-scale safety analysis of model-generated data revealed that pedestrian-AV interactions consistently yielded lower conflict rates and higher pedestrian yielding rates compared to pedestrian-HDV interactions. The findings highlight the importance of incorporating vehicle-type-specific pedestrian behavioral models for safer automated driving system design and more realistic traffic simulations in mixed-traffic environments.

[AI-37] Do Agents Know What They Cant Do? Evaluating Feasibility Awareness in Tool-Using Agents

链接: https://arxiv.org/abs/2605.28532
作者: Liang Cheng,Mingsheng Cai,Jiuming Jiang,Luo Mai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:Tool-using agents often incur substantial computational cost due to long reasoning chains and iterative tool usage. In practical scenarios, many tasks become infeasible under constrained tool environments, where the capabilities required for successful task completion are unavailable. Detecting infeasible tasks and stopping execution early can significantly reduce unnecessary execution cost. In this work, we propose FeasiGen, an automatic pipeline for constructing infeasible agent tasks by identifying the critical tools required for successful task completion. Our approach extracts tool-calling traces from successful executions across multiple agent systems, identifies critical tools consistently shared across diverse execution strategies, and masks these tools to automatically transform solvable tasks into infeasible ones. Human verification confirms that the infeasibility annotations for our constructed tasks achieve over 94% accuracy. We further introduce feasibility-aware evaluation metrics for measuring whether agents can recognize infeasible tasks and stop execution appropriately. Extensive evaluations across nine models reveal substantially weak infeasibility detection ability, with false continue rate reaching up to 73.9%. We further observe that multi-agent architectures significantly reduce erroneous execution under infeasible conditions.

[AI-38] Let Relations Speak: An End-to-End LLM -GNN Soft Prompt Framework for Fraud Detection

链接: https://arxiv.org/abs/2605.28524
作者: Zhixing Zuo,Huilin He,Jiasheng Wu,Dawei Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages,3 figures

点击查看摘要

Abstract:In recent years, Large Language Models (LLMs) have shown great capability in processing graph tasks such as fraud detection. However, most existing methods rely heavily on rich text attributes, which poses difficulties for this domain due to the lack of textual data. Although some pioneering methods attempt to overcome it, their textualization of graph structures via hard prompts easily leads to feature distortion. Additionally, fraud detection often exhibits multi-relational complexity, where current methods struggle to capture this deep semantic information. To address these challenges, we propose LLM-GNN Soft Prompt Framework (LGSPF). Specifically, LGSPF bridges the graph structure and semantic space using soft prompt to eliminate reliance on text. We further introduce a parallel Graph Neural Network (GNN) encoder to translate multi-relational topologies into graph tokens for fine-grained LLM fraud comprehension. Through end-to-end optimization, LGSPF enhances deep semantic alignment between LLM and GNN. Experiments across diverse fraud detection benchmarks demonstrate our method achieves state-of-the-art performance. Moreover, we further validate the contribution of LGSPF on enhancing the semantic interpretability of fraud behaviors.

[AI-39] GS-FUSE: Granger-Supervised Gated Fusion and Multi-Granularity Alignment for Event-Driven Financial Forecasting

链接: https://arxiv.org/abs/2605.28520
作者: Yang Zhang,En Chun,Ziyun Mao,Yulu Wu,Jun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurately forecasting the impact of salient financial events on markets is critical for investors and policymakers. However, existing multimodal time-series models typically fuse text and prices symmetrically, without an explicit way to decide when event text is truly predictive, and thus struggle to exploit the directional event-to-price structure and the heterogeneous roles of textual and price signals. In this work, we propose GS-Fuse, a multimodal event-based forecasting framework that employs (i) a Granger-supervised, causal-aware gated fusion module, which learns to open toward event text only when it provides incremental predictive value beyond historical prices, and (ii) a multi-granularity alignment mechanism that jointly aligns high-level event representations and fine-grained textual cues with future market trajectories. Built as a flexible, plug-and-play adapter on top of off-the-shelf large language models and time-series foundation models, GS-Fuse can be instantiated across diverse backbones and market settings. Extensive experiments on real-world financial datasets show that GS-Fuse consistently outperforms state-of-the-art time-series and multimodal baselines across multiple assets and forecasting horizons.

[AI-40] Stochastic Gradient Descent with Momentum is Algorithmically Stable

链接: https://arxiv.org/abs/2605.28517
作者: Yunwen Lei,Zimeng Wang,Xiaoming Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stochastic gradient descent with momentum (SGDM) is one of the most widely used optimization algorithms in machine learning. While optimization properties of SGDM have been extensively studied in the literature, it remains insufficiently understood whether and when SGDM can generalize well to unseen data. In particular, it has been conjectured that while momentum accelerates training, it may degrade generalization. In this paper, we close this gap by developing a comprehensive generalization analysis of SGDM through the lens of algorithmic stability. More specifically, we introduce a generalized SGDM framework that encompasses both Polyak’s and Nesterov’s momentum schemes, and establish tight on-average model stability bounds for smooth and convex problems. Notably, the obtained bounds exploit small optimization error bounds along the trajectory, apply to any momentum parameter in the interval [0, 1) , and do not require the commonly assumed Lipschitzness of loss functions. We further derive optimization error bounds for the generalized SGDM, and combine them with our generalization analyses to obtain optimal excess population risk bounds for SGDM with both Polyak’s and Nesterov’s momentum.

[AI-41] Do LLM s Favor Their Providers? Measuring Vertical Integration Bias in Code Generation

链接: https://arxiv.org/abs/2605.28515
作者: Melih Catal,Alex Wolf,Tiago Ferreiro Matos,Pooja Rani,Harald Gall
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become an integral part of software development, especially with the advent of agentic capabilities. Yet, many frontier LLMs are affiliated with specific providers. This raises the question of whether generated code favors the provider’s own ecosystem over comparable alternatives, potentially constraining developers’ choices and increasing dependence on a single provider. We define this behavior as Vertical Integration Bias (VIB) and introduce \textscVIBench, a benchmark for measuring VIB in direct and agentic code generation across 20 provider-selectable software-integration scenarios. Evaluating 10 frontier provider-affiliated models against 3 non-affiliated controls, we find positive VIB in direct generation, with six of ten affiliated models showing statistically significant effects up to +18.8 percentage points (pp). Agentic workflows further amplify VIB, reaching +39.2 pp. Moreover, early affiliated-ecosystem choices in agentic workflows can persist into conceptually decoupled downstream files, with persistence as high as 90.3% . These findings underscore the need to measure and account for VIB in code generation, especially as agentic capabilities become more prevalent.

[AI-42] Learning Theory of the SVRG: Generalization and Convergence Analysis

链接: https://arxiv.org/abs/2605.28513
作者: Yunwen Lei,Zimeng Wang,Xiaoming Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Variance reduction (VR) methods employ stochastic gradients with decreasing variance, and they have been widely applied to solve large-scale optimization problems in machine learning because of their efficiency. Existing theoretical studies of VR methods are mainly focused on the convergence analysis, leaving the generalization behavior largely unexplored. In this paper, we bridge this gap by developing the first non-vacuous generalization analysis of the representative VR method: Stochastic Variance Reduced Gradient (SVRG), through the lens of algorithmic stability. In particular, we establish sharp stability bounds of the SVRG in both convex and strongly convex settings by exploiting its algorithmic structure. The obtained bounds are data-dependent, because the training errors are incorporated along the trajectory. Our analysis clarifies the interplay between optimization and generalization, leading to optimal excess population risk bounds in both settings. Our approach differs substantially from existing analyses of stochastic algorithms in the sense that we decompose the SVRG update as an SGD-like step plus a zero-mean correction term and then introduce novel Lyapunov functions to absorb the additional gradient terms induced by the reference points. Our analytical framework can be generalized to other VR methods, and we demonstrate the generalization by the well-known Stochastic Average Gradient Accelerated (SAGA) method.

[AI-43] Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

链接: https://arxiv.org/abs/2605.28508
作者: Aakash Pant,Kavya Shah,Apoorv Agnihotri,Sneha Nikam,Prasaanth Balraj,Nakul Jain
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Aakash Pant, Kavya Shah, and Apoorv Agnihotri contributed equally

点击查看摘要

Abstract:Existing AI evaluation practices often fail to capture how systems actually perform in low-resource environments, where operational constraints shape usability as much as model quality. Through a structured analysis of existing benchmark families across speech, chat/RAG, and vision systems, we identify critical gaps between laboratory evaluation practices and real-world deployment conditions in low-resource environments. We argue that the meaningful unit of assessment is the deployed system rather than an isolated model and that effective evaluation frameworks must integrate task performance with deployment conditions such as noisy inputs, code-switching, intermittent connectivity, low-end hardware, and domain shift. At the same time, benchmarks should recognize that different application classes require distinct evaluation profiles rather than a single aggregate score that obscures operational differences. To support practical decision-making, we propose a shared reporting framework that preserves comparability across systems and application types while remaining sensitive to deployment context. Finally, we emphasize the need for concise and actionable reporting artifacts for policymakers, donors, and implementers, including standardized one-page benchmark cards, deployment profiles, and explicit documentation of failure handling procedures and human oversight mechanisms.

[AI-44] ProvMind: Provenance-grounded reasoning for materials synthesis

链接: https://arxiv.org/abs/2605.28487
作者: Yiming Zhang,Ryo Tamura,Koji Tsuda
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Materials process optimization requires reasoning over routes, conditions, tools and causal dependencies, yet most computational formulations flatten synthesis procedures into text or ordered steps. We introduce MatProcBench, a provenance-grounded benchmark constructed from literature-mined MatPROV graphs, to evaluate seven process-reasoning tasks spanning route continuity, step-level variable inference and global causal consistency under both same-split and shift-aware evaluation, including a strict dual-OOD split that combines temporal and material-class shift. We further introduce ProvMind, a process-memory reasoning framework that retrieves analogous training processes, converts them into provenance-aware option-level compatibility scores, and uses a language model for constrained final decision making. ProvMind achieves 52.84% accuracy on the dual-OOD split, outperforming prompting, retrieval-augmented and supervised fine-tuning baselines.

[AI-45] GONDOR to the Rescue: Satisficing Planning with Low Memory

链接: https://arxiv.org/abs/2605.28454
作者: Yonatan Vernik,Alexander Tuisov,Alexander Shleyfman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Greedy Best-First Search (GBFS) is the dominant approach for solving search problems where the goal can be estimated with a heuristic, such as planning, route finding, navigation, and pathfinding. This is especially true when the memory is tightly constrained, such as planning on edge devices. To alleviate that, we present GONDOR (Greedy Online Navigation with Dynamic Outpost-based Re-search), a memory-efficient extension of GBFS that allows search to continue under strict memory limits by periodically compressing the search tree while retaining a sparse set of anchor states, then upon reaching the goal reconstructs the path by re-searching between the sparse states. We analyze the algorithm and discuss several variants defined by different outpost selection policies. In addition, we explore using Bloom filters for compact duplicate detection in the closed list. Experiments across numeric planning domains and heuristic configurations show that GONDOR consistently improves coverage under low memory budgets compared to standard GBFS. We release the implementation of GONDOR and the Bloom-filter variant to facilitate further research on memory-efficient heuristic search.

[AI-46] DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

链接: https://arxiv.org/abs/2605.28421
作者: Caijun Xu,Changyi Xiao,Zhongyuan Peng,Yixin Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronger teacher models or heavily curated difficult datasets, limiting scalable capability improvement. In this paper, we introduce DenoiseRL, a reinforcement learning framework that substitutes external supervision with recovery-oriented optimization over failures from weak models. Instead of relying on stronger supervision or carefully engineered data, DenoiseRL learns directly from incorrect reasoning traces by converting them into opportunities for improvement, making training more scalable and less dependent on external resources. This yields a richer and more diverse learning signal, improving exploration efficiency from imperfect model behavior. As a result, DenoiseRL improves reasoning performance and overall training efficiency while reducing the need for expensive data curation or stronger teacher models. Empirically, DenoiseRL consistently outperforms strong on-policy RL baselines across competitive mathematical and general reasoning benchmarks and promotes stronger self-corrective behavior as training difficulty increases, highlighting an effective and scalable alternative pathway for improving reasoning in large language models.

[AI-47] Efficient Post-training of LLM s for Code Generation With Offline Reinforcement Learning

链接: https://arxiv.org/abs/2605.28409
作者: Mingze Wu,Abhinav Anand,Shweta Verma,Mira Mezini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training using online reinforcement learning (RL) is an important training step for LLMs, including code-generating models. However, online RL for code generation involves LLM inference and verification of the generated output, which can take considerable time and resources. In this paper, we explore the application of offline RL to code-generating models by leveraging existing code datasets. Our experiments demonstrate that offline RL is an effective training strategy for improving LLM performance. We show that offline RL can be especially beneficial for small LLMs and challenging coding problems.

[AI-48] Measuring Progress Toward AGI: A Cognitive Framework

链接: https://arxiv.org/abs/2605.28405
作者: Ryan Burnell,Yumeya Yamamori,Orhan Firat,Kate Olszewska,Steph Hughes-Fitt,Oran Kelly,Isaac R. Galatzer-Levy,Meredith Ringel Morris,Allan Dafoe,Alison M. Snyder,Noah D. Goodman,Matthew Botvinick,Shane Legg
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 2 figures

点击查看摘要

Abstract:Despite widespread discussion of AGI, there is no clear framework for measuring progress toward it. This ambiguity fuels subjective claims, makes it difficult to track progress, and risks hindering responsible governance. As a starting point to address this gap, we present a framework for understanding system capabilities in relation to human cognitive abilities. Drawing from decades of research in psychology, neuroscience, and cognitive science, we introduce a Cognitive Taxonomy that deconstructs general intelligence into 10 key cognitive faculties. We then propose a rigorous evaluation protocol in which a system’s performance is measured across a suite of targeted, held-out cognitive tasks, generating a ‘cognitive profile’ that can be used to understand a system’s strengths and weaknesses. We hope this framework will provide a practical roadmap and an initial step toward more rigorous, empirical evaluation of AGI.

[AI-49] HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLM s

链接: https://arxiv.org/abs/2605.28398
作者: Yansong Ning,Mianpeng Liu,Jingwen Ye,Weidong Zhang,Hao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost. However, existing methods for adaptive thinking-mode selection are typically evaluated under different models, datasets, and implementation assumptions, making it difficult to compare their practical behavior. We introduce HRBench, a unified evaluation framework for studying thinking-mode switching in hybrid-reasoning LLMs. HRBench organizes the design space along two axes: three switching strategy families, prompt-based selection, external routing, and speculative execution, and four training regimes, training-free, SFT, offline and online RL, yielding 12 controlled evaluation settings. We evaluate these settings across 6 LLMs, from Qwen3.5-2B to Kimi-K2.5-1.1T, and 5 reasoning benchmarks covering mathematics, science, and code, while reimplementing 12+ representative prior methods within the same pipeline. Our analysis characterizes how different switching strategies occupy distinct effectiveness-efficiency trade-off regions: prompt-based methods often provide favorable token-accuracy trade-offs, routing methods offer more stable cost reduction, and speculative methods tend to improve accuracy at higher token cost. We further find that training affects strategies differently, and that the preferred strategy varies with model scale and task domain. HRBench provides reference implementations and a unified evaluation platform to support more controlled research on efficient reasoning in hybrid-reasoning LLMs. Our data, code and repository are available at this https URL.

[AI-50] ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

链接: https://arxiv.org/abs/2605.28396
作者: Kun Liang,Chenming Tang,Clive Bai,Weijie Liu,Saiyong Yang,Yunfang Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-allocate supervision to late positions with low marginal value for the current student. We revisit this assumption through the useful supervision horizon: student-induced rollouts can drift from teacher-preferred continuations, while aligned prefixes may already preserve the long-horizon OPD update direction. We propose ADWIN, an adaptive-window framework for OPD that treats rollout length as an online admissibility decision, training on short teacher-anchored prefixes while using delayed full-rollout probes to audit prefix–full alignment and adapt the next horizon with staleness control. Across math and code reasoning benchmarks in single-task, multi-task, and strong-to-weak settings, ADWIN improves the accuracy–compute trade-off over full-rollout OPD and prefix-based baselines, reducing end-to-end training cost by up to 4.1 times while achieving comparable or better accuracy.

[AI-51] You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

链接: https://arxiv.org/abs/2605.28390
作者: Xujun Li,Kehan Zheng,Mingyuan Zhao,Yize Geng,Jinfeng Zhou,Qi Zhu,Fei Mi,Lifeng Shang,Minlie Huang,Hongning Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-time skill evolving is regarded as a new paradigm for enhancing deployed agentic systems. Existing works mainly focus on hard-coded skill evolving strategies or parametric learning that rely on expensive parameter updates in the underlying LLMs. In this paper, we demonstrate that test-time refinement of the skill evolving framework itself is necessary for continuous improvement of the agent systems in different downstream scenarios, and lightweight algorithmic adaptation is feasible. Specifically, we propose HiSME, a lightweight hierarchical skill meta-evolving solution that jointly optimizes skills and the skill evolving strategy by learning meta-skills from agents’ task execution traces. Experiments on diverse agentic benchmarks show that meta-evolving can produce a higher-quality skill library than pure skill evolving and can derive diverse meta-skills for different scenarios, thereby facilitating future continual experience learning. Our code is temporarily public at this https URL.

[AI-52] Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLM s

链接: https://arxiv.org/abs/2605.28388
作者: Yue Cheng,Jiajun Zhang,Xiaohui Gao,Weiwei Xing,Zheng Wang,Zhanxing Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 11 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Reward (RLVR) is empirically shown to notably enhance the reasoning performance of large language models (LLMs), particularly in mathematics and programming. However, the mechanistic role of Sample Difficulty in RLVR remains poorly understood. In this paper, we investigate RLVR through the lens of difficulty-wise and one-sample analysis. We find that sample difficulty has a non-monotonic effect on RLVR: easy and medium-difficulty problems yield the strongest and most stable reasoning improvements, whereas overly hard problems often provide weak learning signals, induce degenerate behaviors such as answer repetition or skipping necessary computation, and can ultimately degrade the model’s pre-existing capabilities. Beyond the obverse of response, we further analyze the model’s internal feature dynamics using Temporal Sparse Autoencoders (T-SAE). Easy problems mainly reinforce direct-answer and basic-computation features while suppressing deliberative-reasoning features; hard problems activate reasoning-related features but become useful only when successful trajectories are sampled; medium-difficulty problems provide a more balanced signal, strengthening both computation and multi-step reasoning features. Motivated by these findings, we propose difficulty-adaptive strategies for hard-sample utilization, using backward-reasoning reformulation and T-SAE-guided training signals to improve reward density and credit assignment during RLVR. Overall, our results identify sample difficulty as a key factor governing both the optimization dynamics and representation evolution of RLVR.

[AI-53] CLANE: Continual Learning of Actions on Neuromorphic Hardware from Event Cameras

链接: https://arxiv.org/abs/2605.28387
作者: Elvin Hajizada,Michael Neumeier,Edward Paxon Frady,Yulia Sandamirskaya,Axel von Arnim,Bing Li,Eyke Hüllermeier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Recognizing and continuously learning novel human actions without forgetting prior classes is a requirement for emerging AR/VR and robotics applications. For these applications, both on-device processing and learning are essential for privacy and low-latency adaptation. Event cameras address the efficiency of visual sensing with sparse, asynchronous output that is naturally compatible with neuromorphic processing. Yet no prior system has deployed a continual on-device learning pipeline for event-based action recognition using neuromorphic hardware. We present CLANE, Continual Learning of Actions on Neuromorphic Hardware from Event Cameras, deployed end-to-end on Intel Loihi 2. CLANE combines a spiking 2D CNN for spatiotemporal feature extraction with CLP-SNN as its on-chip learning head, extended to action clips via a Temporal Aggregation Layer and a fixed-point Normalization Layer, both novel Loihi 2 modules. On THU E-ACT-50, a 50-class dataset captured under real-world conditions, CLANE achieves 70.4% accuracy in a continual learning task while delivering more than 100x energy reduction and 16x lower latency over a sequential CNN+GRU+CLP edge GPU baseline, validated through iso-algorithm cross-platform benchmarking across three evaluation levels.

[AI-54] From paper to benchmark: agent ic framework-based reproduction of under-specified methods in machine health intelligence

链接: https://arxiv.org/abs/2605.28371
作者: Raffael Theiler,Ludovico Comito,David Leko,Leandro Von Krannichfeldt,Lev Telyatnikov,Olga Fink
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Industrial Prognostics and Health Management (PHM) provides a representative case study for a broader challenge in applied machine learning: translating published papers into executable, benchmark-ready implementations. Reproducing under-specified methods in PHM is particularly difficult due to restricted access to industrial datasets, incomplete reporting of preprocessing and evaluation protocols, and implicit design choices (e.g., windowing, target construction, data splits) that critically affect performance. Existing paper-to-code systems generate implementations for individual papers, but these artifacts are often not directly comparable due to inconsistencies in assumptions and evaluation settings. We introduce \emphagentic, framework-based PHM paper reproduction, where an agent translates a paper into a shared PHM benchmark framework via a \emphslot-binding interface. This interface maps equations and protocol descriptions into structured components (task definitions, dataset adapters, windowing, targets, models, and evaluators), while explicitly recording unresolved assumptions. The resulting implementations are validated against standardized task contracts and evaluation hooks, enabling consistent and comparable benchmarking. We evaluate this approach on 16 PHM papers, comparing framework-enhanced, skill-based and prompt-based agentic reproduction against a recent framework-free paper-reproduction agent. We assess reproduction success, model-based code evaluation, framework binding of paper assumptions, and cross-paper benchmark comparability under standardized protocols. Our results show that coupling agentic generation with a shared framework transforms paper reproduction from isolated code synthesis into executable, assumption-aware, and systematically comparable benchmark implementations.

[AI-55] CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict ICML2026

链接: https://arxiv.org/abs/2605.28369
作者: Yanhui Sun,Wu Liu,Haifeng Ming,Xinru Wang,Hantao Yao,Yongdong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: ICML 2026

点击查看摘要

Abstract:E-commerce platforms have begun recruiting crowdsourced jurors to adjudicate massive volumes of transaction disputes. Unlike formal legal judgment, E-commerce dispute verdicts require grounding pivotal clues from redundant, multi-round, multimodal evidence and making decisions under flexible platform-specific conventions. These characteristics render existing methods insufficient for this scenario. To bridge this gap, we introduce a pioneering task, E-commerce Dispute Verdicts (EDV), and present VerdictBench, a multimodal benchmark comprising 6,000 real-world cases designed to reflect crowdsourced jury decisions. Building upon this, we propose CyberJurors, a multi-agent framework to clarify the dispute logic and regulate the verdict process. At the individual level, Individual Verdict Chain-of-Thought decomposes the EDV task into four structured reasoning stages, enabling fine-grained clue perception and clarifying causal logic between pivotal clues and the dispute focus. At the collective level, Jury Consensus Verdict simulates multi-round discussion and voting among jurors, while incorporating verdict precedents to mitigate cognitive biases toward either disputant. Experiments on VerdictBench show that CyberJurors outperforms state-of-the-art LLMs, MLLMs, and court simulators, while achieving stronger alignment with real-world jury voting patterns. Code and dataset are available at this https URL and this https URL.

[AI-56] Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

链接: https://arxiv.org/abs/2605.28360
作者: Jyotirmoy Nath,Neeraj Kumar,Brejesh Lall
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic prompt optimization (APO) has driven significant gains in LLM-based agentic workflows. However, existing methods treat each task’s prompt as a monolithic, instance-blind string optimized through global edits, producing brittle updates and preventing the reuse of learned sub-behaviors. We propose Prompt Codebooks (PCO), a novel compositional prompt optimization framework that recasts APO as discrete learning over a finite vocabulary of natural-language instincts - atomic, reusable instruction units. PCO organizes prompt-construction knowledge in a discrete codebook and routes each input to a small subset of entries via an LLM-based encoder; a generator composes them into a prompt for the frozen target model; a critic emits a structured verdict that decomposes by attribution into per-variable textual gradients, jointly training the encoder, generator, and codebook under a language-valued min-max objective. The resulting routing is per-instance: different inputs in the same task receive different instinct compositions, a regime structurally inexpressible under instance-blind methods. Across six benchmarks on Qwen3-8B and LLaMA-3.1-8B, PCO improves over zero-shot by up to +30.36 points, surpasses the strongest prior baseline (GEPA) by +3.34 on HotpotQA and +1.11 in aggregate, and reduces deployed prompt length by up to 14.1x versus MIPROv2 and 3.0x versus GEPA using only K=16 instincts.

[AI-57] From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

链接: https://arxiv.org/abs/2605.28359
作者: Taojie Zhu,Wentao Zhao,Rui Sun,Beidi Luan,Jiacheng Lu,Sinuo Wang,Jing Li,Daxin Jiang,Yonghong He,Zuo Bai
机构: 未知
类目: Artificial Intelligence (cs.AI); Trading and Market Microstructure (q-fin.TR)
备注:

点击查看摘要

Abstract:Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end-to-end trading: place an agent in a historical market, let it trade, and measure portfolio returns. This setup is vulnerable to two evaluation failures. First, long backtests often overlap with the knowledge cutoffs of frontier LLMs, allowing memorized tickers, dates, prices, and market narratives to substitute for investment reasoning. Second, raw returns are a noisy proxy for stock-selection ability, since positive performance may come from market beta, style exposure, or favorable regimes rather than genuine alpha. We introduce KTD-Fin (Knowing-To-Doing Financial Benchmark), an end-to-end stock-market trading benchmark that addresses both issues. KTD-Fin uses a data-side masking protocol to anonymize key identifiers and calendar information consistently across prompts and tools, separating historical market memory from investment decision-making. It also incorporates a Barra-style performance attribution framework that decomposes portfolio returns into market, style, and stock-selection alpha components. Across ten frontier LLM agents evaluated on the Chinese CSI300 over a 2024–2026 window, masking substantially changes agent rationales, pushing them towards anonymized factor-based reasoning. Attribution analysis further shows that LLM agents’ cumulative returns under leakage-controlled evaluation are largely explained by passive market and style exposure, with limited evidence of persistent stock-selection alpha. These findings suggest that financial LLM benchmarks should evaluate not only whether an agent makes money, but also whether the source of returns reflects transferable investment skill. We release KTD-Fin as a reproducible template for leakage-controlled and attribution-aware evaluation of LLM trading agents. Subjects: Artificial Intelligence (cs.AI); Trading and Market Microstructure (q-fin.TR) Cite as: arXiv:2605.28359 [cs.AI] (or arXiv:2605.28359v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.28359 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-58] Score Based Error Correcting Code Decoder ICML2026

链接: https://arxiv.org/abs/2605.28358
作者: Alon Helvits,Eliya Nachmani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Error-correcting codes enable reliable communication, yet practical soft decoding remains challenging across code families and block lengths. We propose SB-ECC, a score-based decoder that casts decoding as continuous-time denoising. A neural denoiser defines a probability-flow ordinary differential equation (ODE) that iteratively updates the noisy channel observation toward a valid codeword, guided by parity constraints. The model is trained across noise levels without time/SNR conditioning, enabling inference without SNR estimation and supporting a direct latency accuracy trade off controlled by the ODE solver budget. We use the raw signed channel observation as input for learning a continuous denoising field. Across 42 code/SNR settings, SB-ECC achieves the best BER in 39/42 entries, with an average SNR gain of 0.17dB and a maximum gain of 0.46dB over the strongest competing baseline, we showed that swapping the solver from Euler to DPM preserves -ln(BER) while reducing end-to-end decoding time by 8.86% on average (up to 12.82%).

[AI-59] Plan Before Search: Search Agents Need Plan

链接: https://arxiv.org/abs/2605.28354
作者: Zhipeng Qian,Zihan Liang,Yufei Ma,Ben Chen,Huangyu Dai,Jiayi Ji,Chenyi Lei,Wenwu Ou,Xiaoshuai Sun,Qibin Hou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training large language models as retrieval-augmented reasoning agents typically combines reinforcement learning with an SFT cold start distilled from a stronger model. However, this paradigm overlooks two fundamental factors: the dependency structure among sub-skills, and the possibility that distillation is not the only route to capability acquisition. We study this through Plan, a structured agentic behavior for multi-hop retrieval that decomposes a question into ordered sub-questions before any retrieval is performed, so that each search step can be anchored to a pre-designed sub-question instead of drifting under the influence of partially relevant documents retrieved earlier. However, across three model families spanning 3B to 14B parameters, we find that an identical reward signal induces qualitatively different RL failure modes. This phenomenon indicates that successful training hinges not only on reward design but also on model-specific feasibility conditions: sufficient initial entropy, training stability, and prerequisite sub-skills. Motivated by this, we propose a self-bootstrapping paradigm in which a small-scale seed model generates filtered trajectories that activate Plan in any target model, eliminating the need for distillation from an external stronger model. Our pipeline activates Plan across every tested model and consistently outperforms competitive baselines on multi-hop QA benchmarks.

[AI-60] Improving Evaluation of Recombination-based Cartesian Genetic Programming GECCO GECCO’26

链接: https://arxiv.org/abs/2605.28353
作者: Duy Long Tran,Anja Jankovic,Marie Anastacio,Holger Hoos,Roman Kalkreuth
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: Accepted for presentation as workshop paper in the graph-based genetic programming workshop (GGP) at the Genetic and Evolutionary Computation Conference (GECCO). To appear in the GECCO’26 conference companion. GECCO’26 will be held July 13-17, 2026 in San Jose, Costa Rica

点击查看摘要

Abstract:Cartesian Genetic Programming has traditionally been using mutation as its main and often sole genetic operator to drive evolutionary search. Despite advancements in recent years, recombinationbased approaches have long been avoided, due to apparent lack of performance gains. This study examines two recently suggested recombination-based operators, subgraph crossover and discrete phenotypic recombination on SRBench, a benchmarking platform for symbolic regression. Using the implementations provided in the TinyverseGP framework, we perform hyperparameter optimisation of the respective representations with these two operators. Our work demonstrates that hyperparameter optimisation can lead to improvements in performance for recombination-based Cartesian Genetic Programming.

[AI-61] FedMPT: Federated Multi-label Prompt Tuning of Vision-Language Models CVPR2026

链接: https://arxiv.org/abs/2605.28347
作者: Xucong Wang,Pengkun Wang,Zhe Zhao,Liheng Yu,Shuang Wang,Yang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, including 11 pages of main text and 5 pages of appendix; Accepted by CVPR2026

点击查看摘要

Abstract:Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex recognition scenarios, thereby enhancing model robustness. However, for realistic decentralized applications requiring federated learning, adapting VLMs to each client that possesses private and heterogeneous data can cause the model to overfit spurious label correlations, consequently triggering irrelevant categories when encountering new samples. To tackle this problem, we reconsider the federated learning for MLR with a causal model, in which we adopt a front-door adjustment and decouple the MLR modeling process by intermediate variables that magnify the oracle label co-occurrence. Guided by our analysis, we propose our FedMPT, the first method specifically designed for federated MLR. The core idea of FedMPT is to leverage generalizable conditions to steer federated MLR to mitigate erroneous label activations. To achieve this, FedMPT introduces an Large Language Model (LLM)-driven pipeline to decipher the underlying conditions that govern label dependencies. Furthermore, we introduce an optimal transport between the condition-enriched prompts and the image patches to uncover multiple region-level semantics. Finally, we generate synergistic predictions from different conditions with a crafted gating mechanism. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods under varied settings.

[AI-62] Picid: A Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and Domains

链接: https://arxiv.org/abs/2605.28345
作者: Lev Telyatnikov,Raffael Theiler,Leandro Von Krannichfeldt,Olga Fink
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Progress in Prognostics and Health Management (PHM) is hindered by the lack of standardized and reusable evaluation practices across tasks, datasets, and application domains. Reported results are often difficult to reproduce and compare, as key protocol choices, such as data splits, preprocessing, label alignment, temporal windowing, and metrics, are often implicit or implemented ad hoc. We introduce \picid, a modular evaluation infrastructure that formalizes the PHM evaluation pipeline as an explicit, executable, and reproducible protocol. Through well-defined abstractions, \picid enforces deterministic, leakage-safe dataset construction while remaining flexible across diverse PHM settings. The framework supports fault detection, diagnostics, and prognostics through a unified interface and can be extended to new datasets and model classes without violating protocol invariants. By standardizing data contracts and evaluation boundaries, \picid also enables fair cross-task comparisons across diagnostics (classification) and prognostics (regression), allowing identical model families to be evaluated consistently across heterogeneous settings. We demonstrate \picid through an empirical evaluation of thirteen models on twelve datasets spanning batteries, bearings, turbofan engines, hydraulics, filtration systems, and buildings. This work establishes a reusable foundation for standardized, fair and reproducible evaluation in PHM.

[AI-63] SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models

链接: https://arxiv.org/abs/2605.28338
作者: Chao Ding,Mouxiao Bian,Tianbin Li,Minjia Yuan,Yidong Jiang,Yankai Jiang,Jinru Ding,Jiayuan Chen,Zhuangzhi Gao,Pengcheng Chen,Zhao He,Rongzhao Zhang,Meiling Liu,Luyi Jiang,Jie Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models(LLMs) increasingly match expert performance on licensing examinations, yet routine clinical use remains limited because governance requires auditable reasoning, safety and ethics alignment, and resilience to adversarial misuse. Here we present SafeMed-R1, trained with a traceable Clinical Trust Signals(CTS) pipeline that links each reasoning instance to clinician rubric scores and edit histories, and aligned through safety and ethics supervision and red team stress testing. SafeMed-R1 attains a macro-averaged accuracy of 79.6% across clinical benchmarks. Under adversarial safety testing, it shows the lowest aggregated risk and reduces unsafe outputs by about 3 to 5% relative to its baseline. In a paired expert study of 30 medication safety vignettes, SafeMed-R1 matches PGY1 and PGY2 residents on medical correctness and scores higher for medication safety, guideline consistency, and clinical usefulness. Collectively, these results suggest that clinician-audited supervision provenance, together with domain-tailored safety and ethics alignment, can strengthen governance-relevant evidence without relying on inference-time retrieval or citation grounding.

[AI-64] An Enhanced Large Neighborhood Search Approach for the Capacitated Facility Location Problem with Incompatible Customers

链接: https://arxiv.org/abs/2605.28337
作者: Ida Gjergji,Lucas Kletzander,Nysret Musliu,Andrea Schaerf
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A new variant of the classic capacitated facility location problem, which considers incompatibilities between customers, has recently been introduced in the literature. This problem captures the situation where given pairs of customers cannot be served by the same facility. Such a feature is crucial for many practical cases of location problems, such as the presence of hazardous or polluting materials and contention between competing costumers. In this paper, we propose a Large Neighborhood Search (LNS) method to solve this problem. Within the framework of LNS, we introduce three different destroy operators, which are combined in a hybrid manner, and we use an exact solver in the repair phase. Different algorithmic components are investigated for the design of LNS. The experimental analysis shows that our new method outperforms existing state-of-the-art metaheuristics, providing new best solutions for all available benchmark instances.

[AI-65] Learning the Error Patterns of Language Models

链接: https://arxiv.org/abs/2605.28328
作者: Jinwoo Kim,Taylor Berg-KirkPatrick,Loris D’Antoni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When generating outputs for domains with specific validity constraints (e.g., a program should compile), LLMs often fail in a small number of focused ways: for example, by using Python function names when generating TypeScript. We observe that these error patterns can be represented using a small number of constraints that can be learned in practice. We propose \emphprefix filters, which are per-domain-and-LLM symbolic functions, as objects to capture the error patterns, Palla as an algorithm to learn prefix filters efficiently in practice, and implement Palla. Prefix filters learned by Palla i) help us quantitatively analyze the error patterns of LLMs, and ii) can be used to constrain the outputs of a model via constrained sampling algorithms. For example, Palla boosts compile rates for Qwen2.5-1.5B on TypeScript generation, by over 60%, allowing Qwen2.5-1.5B to achieve similar performance to Llama3.1-8B unconstrained.

[AI-66] Multi-Agent LLM -based Metamorphic Testing for REST APIs

链接: https://arxiv.org/abs/2605.28321
作者: Shehroz Khan,Abdullah Mughees,Gaadha Sudheerbabu,Tanwir Ahmad,Dragos Truscan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Author submitted version accepted for publication the IEEE Conference on Computers, Software, and Applications (COMPSAC2026), July 7-11, 2026, Madrid Spain

点击查看摘要

Abstract:As REST APIs become an increasingly significant part of software systems, their validation is becoming more critical. Hence, testing and uncovering underlying issues are of utmost importance for improving software quality. However, testing REST APIs is challenging mainly due to the difficulty of assessing whether the output of an API call is correct, i.e., the test oracle problem. Metamorphic testing is a specification-based testing approach for situations where correct outputs are unknown or not specified explicitly. To check the correctness of a system, relations between the different outputs are specified. We present ARMeta, a tool-supported approach that uses an LLM-based multi-agent workflow to support metamorphic testing of REST APIs documented with OpenAPI. The agentic workflow is used to identify metamorphic test scenarios and specify them in the Given-When-Then format. These scenarios are automatically implemented as executable tests and executed against the system under test. We evaluate ARMeta on two publicly available web applications that expose REST interfaces and compare its performance with a scenario-based testing baseline. The results show that ARMeta explores behaviors that serve as a complement to existing scenario-based testing approaches.

[AI-67] Identifying Explicit Parsimonious Piece-wise Polynomial Relationships in Industrial time-series: Application to manipulator robots

链接: https://arxiv.org/abs/2605.28320
作者: Mazen Alamir,Sacha Clavel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper addresses the problem of identifying parsimonious explicit piece-wise polynomial relationships that might involve a relatively large number of raw features. The algorithm leverages a recently proposed identification algorithm that yields parsimonious implicit relationships enabling to derive normality characterization in the context of anomaly detection and localization. The algorithm proposed in this paper goes a step further by deriving explicit piece-wise representations that are built using the set of polynomials involved in the implicit representations. The framework is illustrated on the problem of identifying parsimonious explicit representations of the inverse model of a 6-axis manipulator robot. Moreover, further experiments on a 4-axis robot are also shown which are designed to investigate the generalization capability of parsimonious models compared to state-of-the-art DNNs structures, when models face unseen contexts of use.

[AI-68] Hybrid Neural World Models

链接: https://arxiv.org/abs/2605.28317
作者: Pranav Lakshmanan,Paras Chopra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
备注: Preprint. Under review

点击查看摘要

Abstract:Neural surrogates promise large speedups over classical solvers for physical dynamics but fail silently at sharp dynamical events such as shocks, fronts, and contact. We present hybrid neural world models for physical dynamics: a recipe for training and deploying multi-horizon surrogates in physical state space, where a single network with continuous horizon conditioning is trained with direct supervision against textbook reference solvers to predict any future state at horizon T in one forward pass. Although no part of the training data, loss function, or architecture supervises discontinuity location, the trained surrogate encodes it implicitly, recoverable from its forward passes alone as a per-trajectory error map that concentrates on shocks, fronts, and contacts, and stays small elsewhere. The map is competitive with or better than standard label-free baselines including deep ensembles, learned error heads, gradient-magnitude indicators, and locally-adaptive conformal prediction, while using only a single trained network and requiring no calibration set or governing-equation knowledge. The recipe supports two operating points. Mode 1 runs the surrogate alone for maximum throughput, with same-hardware CPU speedups of 26x to 72x against textbook solvers on the PDE environments. Mode 2 uses the error map to gate a reference-solver fallback, deferring uncertain trajectories and roughly halving the surrogate’s residual error at the default operating point. The recipe applies without modification across reaction-diffusion, compressible Euler, and rigid-body collision dynamics.

[AI-69] From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

链接: https://arxiv.org/abs/2605.28303
作者: Shuaike Li,Kai Zhang,Xianquan Wang,Jiachen Liu,Shengpeng Mo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Knowledge Editing (KE) enables efficient updates, its dominant Static Fact Overwriting paradigm treats LLMs as discrete databases, forcibly injecting isolated facts. Fracturing pre-trained logical topologies, this triggers Epistemic Dissonance – a pathology where un-evolved legacy priors force the model to explicitly negate the injected update. Idealized interventions reveal that this is an inherent structural flaw rather than mere algorithmic noise, with a zero-distortion proxy yielding a catastrophic 95.6% self-refutation rate. Given the causally driven nature of real-world knowledge, grounding updates in explicit causal narratives effectively collapses this conflict rate to just 6.6%, underscoring the imperative for a paradigm shift toward Causal Editing. To internalize this evolution, we propose CODE (Causal On-policy Distillation for Editing). By coupling causal bootstrapping with asymmetric on-policy distillation, CODE engraves causal transition logic directly into parametric memory. Experiments on LLaMA-3.1 and Qwen-2.5 show CODE drastically suppresses self-refutation to 1.8% while securing robust multi-hop accuracy (up to 83.5%), seamlessly transforming discrete fact injection into coherent knowledge evolution. Code is available at this https URL.

[AI-70] How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

链接: https://arxiv.org/abs/2605.28302
作者: Hanjiang Wu,Abhimanyu Rajeshkumar Bambhaniya,Sarbartha Banerjee,Tuhin Khare,Sudarshan Srinivasan,Suvinay Subramanian,Souvik Kundu,Madhu Kumar,Midhilesh Elavazhagan,William Won,Amir Yazdanbakhsh,Tushar Krishna
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggregation, and most recently to operator-level Attention-FFN Disaggregation (AFD). This trend is especially important for mixture-of-experts (MoE) models, where memory-bound attention, compute-intensive expert FFNs, and MoE dispatch/combine communication create distinct resource demands. AFD further exposes this heterogeneity by placing attention and MoE-FFN execution on separate GPU groups. Each level of disaggregation deepens the scheduling design space across workload characteristics, resource allocation, and interconnect topology, raising the central question: when does each level actually pay off? We systematically characterize this trade-off for MoE inference across realistic workloads spanning input/output sequence lengths, prefix-KV reuse, and per-user latency constraints. Using chunked-prefill and P/D disaggregation as baselines, we study the benefits and limits of AFD at scale through a framework that fuses on-device kernel measurements with high-fidelity network simulation. Under strict TTFT/TPOT SLOs, AFD sustains around 4k tokens/s of system throughput on DeepSeek-V3.2 across chat, coding, and agentic-coding workloads, where non-AFD deployments are infeasible. We distill concrete takeaways for jointly optimizing throughput and interactivity, including how to partition attention and FFN across GPUs as a function of workload and model architecture, providing design principles for current rack- and cluster-scale deployments as well as future disaggregated AI infrastructure.

[AI-71] Better Accuracies Worse Reasoning : A Step-Level Audit of Medical Chain-of-Thought Distillation

链接: https://arxiv.org/abs/2605.28301
作者: Zhaoyang Jiang,Xuanqi Peng,Fei Teng,Zhizhong Fu,Yunsoo Kim,Jiacong Mi,Zicheng Li,Honghan Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) distillation trains a smaller model to imitate a teacher’s reasoning trace, but it is typically evaluated by final-answer metrics including accuracy. We ask whether gains in answer quality are accompanied by improvements in the trace. In medical QA, where short answer options can leave a richer clinical justification under-specified, a Qwen3-8B student distilled from a DeepSeek-V3-family teacher improves on MedQA-USMLE answer metrics (SC@64 74.7% to 84.4%; expected calibration error (ECE) 0.096 to 0.034). Yet under a Kimi-K2.6 style-blind LLM-judge audit, its error rate over non-abstained steps rises from 30.6% to 50.3%. In this primary medical setting, answer quality and trace factuality move in opposite directions. This before–after pattern persists across evaluators, teacher strengths, student scales and families, medical benchmarks, and style, segmentation, and answer-correctness controls. A 150-step blinded audit by a clinical expert reproduces the same ordering. Boundary checks narrow the scope of the claim: the risk appears when a compact answer under-constrains the rationale and a capable student can imitate expert-like form without reliably grounding each local claim. Standard answer metrics and aggregate hedging rates do not reveal the shift. When such traces are released or reused, answer-level metrics alone are insufficient.

[AI-72] REED: Post-Training Representation Editing for Cross-Domain Linguistic Steganalysis

链接: https://arxiv.org/abs/2605.28298
作者: Ruohan Lei,Jianxin Gao,Wanli Peng,Huimin Pei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In real-world scenarios of linguistic steganalysis, tested texts usually come from unseen domains with different vocabularies, topics, writing styles, and steganographic generation patterns, which can significantly degrade the detection performance. Although existing cross-domain steganalysis methods can effectively alleviate this problem through distribution alignment, domain-invariant feature learning, etc., the detection performance is not satisfactory. In this paper, we propose a post-training representation editing method for cross-domain linguistic steganalysis. Specifically, the detector is first trained on source-domain data, and then the feature extractor and classifier are kept frozen, and the intermediate representations are deterministically edited before classification. For domain adaptation, we construct a domain-offset vector from marginal source and target representations. For domain generalization, we derive a source-domain cover-to-stego direction to guide sample-specific editing. Experimental results show that compared with the advanced methods, the proposed method can achieve high cross-domain detection performance, especially in terms of F1-score, while requiring no architecture modification or parameter updates after source-domain training.

[AI-73] ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation ICML2025

链接: https://arxiv.org/abs/2605.28293
作者: Hongru Hou,Tiehua Mei,Denghui Geng,Jinhui Huang,Ao Xu,Hengrui Chen,Jiaqing Liang,Deqing Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in ICML 2025

点击查看摘要

Abstract:Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short-term acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path-level rewards decompose into step-level rewards with positive mean, creating a length-dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path-level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our code is available at this https URL.

[AI-74] ResearchLoop: An Evidence-Gated Control Plane for AI-Assisted Research

链接: https://arxiv.org/abs/2605.28282
作者: Yihan Xia,Taotao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 4 figures, 6 tables; technical report

点击查看摘要

Abstract:AI-assisted research compresses ideation, implementation, evaluation, and manuscript writing into a single interactive loop. This compression is useful, but it also creates a publication risk: paper claims can become easier to state than to audit. We present ResearchLoop, an evidence-gated control plane for AI-assisted computational research. ResearchLoop treats research questions, task contracts, evidence objects, claim ledgers, closeouts, and paper bindings as durable project state, realized here as a repository-backed runtime. This technical report provides the complete protocol specification, state model, transition rules, claim-admission algorithm, and insight-compounding mechanism. It also reports the full experimental record spanning nine versions (V0–V9), including a self-hosting case study, a controlled task-suite study with component ablations, a mathematical olympiad evaluation, and a supplementary SciCode boundary experiment evaluated with the official generated-code harness. All artifacts, manifests, and verification reports are preserved in the project repository.

[AI-75] Do LLM s Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

链接: https://arxiv.org/abs/2605.28277
作者: Zhikai Pan,Chih-Ting Liao,Chunrui Liu,Xi Xiao,Yitong Qiao,Chunlei Meng,Zhangquan Chen,Xin Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.

[AI-76] Global Policy-Space Response Oracles for Two-Player Zero-Sum Games ICML2026

链接: https://arxiv.org/abs/2605.28273
作者: Junyu Zhang,Feihong Yang,Jian Wang,Chao Wang,Xudong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:The Policy-Space Response Oracles (PSRO) framework scales equilibrium computation to large zero-sum games by iteratively expanding a restricted strategy set using deep reinforcement learning (DRL). A central challenge is to construct, under limited computational budgets, a small strategy population whose induced game well approximates the full game. Existing PSRO variants typically expand the population using best responses to meta-strategies computed from restricted-game payoffs, which can lead to inefficient expansions that provide limited global improvement. We propose to guide population expansion by directly evaluating the post-expansion population quality. Specifically, we adopt Population Exploitability (PE) to measure how well a restricted strategy set represents the full game, and introduce a two-phase exploration–selection framework that explicitly minimizes PE during expansion. We instantiate this framework as Global PSRO, a practical DRL-based algorithm that efficiently generates candidate responses and estimates PE via parameter-sharing conditional neural networks. Experiments across multiple two-player zero-sum games show that Global PSRO achieves lower exploitability and approximates Nash equilibria with significantly fewer policy iterations than prior PSRO methods.

[AI-77] Entropy Distribution as a Fingerprint for Hallucinations in Generative Models

链接: https://arxiv.org/abs/2605.28264
作者: Mattia J. Villani,Pranav Deshpande,Akshay Seshadri,Romina Yalovetzky,Niraj Kumar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often generate factually incorrect outputs, commonly termed hallucinations, that undermine trust and limit deployment in high-stakes settings. Existing hallucination detection methods typically require multiple forward passes, or access to model internals. In this work, we provide theoretical background and empirical evidence that the distribution of token-level entropies, beyond the mean captured by perplexity or length-normalised entropy, serves as a fingerprint of hallucination, with distributional shape and tail behaviour carrying independent signal. We formalize hallucination detection as a statistical hypothesis test and propose the Calibrated Entropy Score (CES), a lightweight algorithm requiring only a single forward pass and black-box access to token logits. CES combines the mean signal with the maximum signal of the generated entropy through a calibrated reference CDF, producing scores that are directly comparable across models and tasks. We establish finite-sample calibration guarantees via a novel random-length Dvoretzky–Kiefer–Wolfowitz inequality, and also prove that CES detects hallucinations with probability converging to one exponentially fast in the generation length. Across eight QA benchmarks and ten generator models spanning open-source and API access models, CES achieves the highest detection performance among all single-pass black-box methods while providing formal error guarantees that existing heuristics lack. Remarkably, CES is statistically indistinguishable from multi-sample methods that require far greater computational cost, closing the gap between lightweight and expensive detection and making it suitable for real-time, large-scale deployment.

[AI-78] IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverag e

链接: https://arxiv.org/abs/2605.28247
作者: Yuhan Li,Mingxu Zhang,Dazhong Shen,Ying Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages,3 figures,18 tables

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has become a key technique for en- hancing LLM reasoning, yet its data ineffi- ciency remains a major bottleneck. Existing methods address this problem only partially, each missing at least one of subset-level cov- erage, verifier signal use, or interpretability. To address this gap, we present IRDS (Inter- pretable RLVR Data Selection), which selects RLVR training instances on a sparse autoen- coder (SAE) cluster basis so the selection itself is auditable on recognizable problem motifs. To select instances the model both fails on and can still learn from, we introduce a verifier- coupled coverage objective on the SAE basis and solve it by greedy log-determinant max- imization. Experiments on three instruction- tuned models and six math reasoning bench- marks show that IRDS achieves the highest overall accuracy, exceeding the strongest base- line by +3.9/+4.0 pp on the two Qwen models and by +0.5 pp on Llama-3.1-8B, while run- ning an order of magnitude cheaper than the trajectory-based baseline.

[AI-79] PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

链接: https://arxiv.org/abs/2605.28232
作者: Shadmehr Zaregarizi,Khashayar Yavari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: N pages, 4 figures, 3 tables. Accepted at the 2nd Workshop on AI-Driven Energy Efficiency in Dynamic Systems (AI-DEEDS '26), co-located with ACM e-Energy / ACM Sustainability Week, Banff, AB, Canada, June 22-25, 2026

点击查看摘要

Abstract:Occupant comfort and grid-aware energy efficiency are competing objectives whose joint optimization depends critically on how reward functions are specified in deep reinforcement learning (DRL) controllers for buildings. Yet reward design remains largely ad hoc: comfort terms are either hand-tuned heuristics or simple temperature-deviation proxies without explicit grounding in thermal-comfort physics. We present PIRS (Physics-Informed Reward Shaping), which replaces these ad-hoc comfort proxies with the ISO 7730 Predicted Mean Vote (PMV) formulation inside a weighted multi-objective reward for Soft Actor-Critic (SAC). By anchoring the comfort signal in the ISO 7730 PMV formulation, PIRS improves reward interpretability and provides a standards-grounded comfort proxy without changing any other component of the learning pipeline. We evaluate PIRS in CityLearn v2.1.2 (challenge 2022 phase 1) with a central SAC agent trained for 50k steps over five random seeds, and compare against a rule-based controller (RBC), a manually engineered reward (E2), an energy-only reward (E3), and a naive temperature-deviation comfort reward (E4). District-level key performance indicators (KPIs), reported as ratios versus RBC, show that PIRS attains cost, carbon, and electricity metrics on par with the manual baseline while substantially outperforming non-physics-grounded designs – particularly on load ramping (1.78x vs. ~2.4x RBC) and daily peak demand. All DRL policies remain above RBC at this training budget; we interpret this gap honestly and position PIRS as an interpretable, standards-aligned foundation for reward design rather than a claim of dominance over classical control at limited compute.

[AI-80] When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents ?

链接: https://arxiv.org/abs/2605.28224
作者: Xinzhe Li,Yaguang Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: More evaluation and analysis are on the way

点击查看摘要

Abstract:Multi-trajectory inference for tool-use LLM agents - generating multiple reasoning attempts and selecting among them - benefits from transferring knowledge across attempts so that later ones avoid the pitfalls of earlier ones. Existing cross-trajectory memory methods (trajectory-level reflection, atomic fact extraction, raw observation injection) are each evaluated under a single inference strategy on a single task, making it unclear whether reported gains reflect properties of the memory abstraction or of the inference method. We propose a unified framework that decomposes memory along two axes – the scope of transfer (within an expansion vs. across trajectories) and the abstraction of the transferred content – and evaluate four methods under three inference strategies (best-of-N, beam search, MCTS) on four tool-use benchmarks spanning SQL, knowledge-graph, and CLI environments, in a verifier-free setting that matches the deployment regime of practical agents. The experiment matrix identifies the inference method as a confound: the same memory method produces statistically distinct results under different inference strategies on the same examples. Reflection reaches significance only under MCTS (not under best-of-N); within-expansion injection (conditioning each candidate on prior siblings’ outcomes) helps only diversity-starved beam search; and atomic fact extraction is accuracy-neutral but shortens trajectories by 19-26% on tasks with reusable environmental structure.

[AI-81] Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

链接: https://arxiv.org/abs/2605.28213
作者: Shuoming Zhang,Qiuchu Yu,Yangyu Zhang,Ruiyuan Xu,Xiyu Shi,Guangli Li,Xiaobing Feng,Huimin Cui,Jiacheng Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint, Under Review

点击查看摘要

Abstract:LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing “when” knowledge from expert kernels: instead of relying on forward rollouts, KLineage walks expert implementations backward through validation-gated simplifications and reverses each accepted step into a reusable optimization skill. Each skill records not only the optimization intent, but also where it applies in code, what conditions made it valid, what effect it had, and what failures its assumptions avoid. A downstream LLM materializes these skills on new code surfaces under the same compile/correctness/profile gate. On five expert workloads across two NVIDIA architectures, these lineage-derived skills serve as an effective optimization curriculum, exceeding recent memory-based LLM-kernel baselines in both final kernel quality and optimization efficiency under the same fixed budget. We additionally use a separate 22-instance held-out check as a sanity test against source-case memorization.

[AI-82] Plant Persist Trigger: Sleeper Attack on Large Language Model Agents

链接: https://arxiv.org/abs/2605.28201
作者: Yongxiang Li,Moxin Li,Zhixin Ma,Fengbin Zhu,Dongrui Liu,Wenjie Wang,Fuli Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial content into external observations such as tool-returned data, webpages, or MCP context, causing harmful agentic behaviors such as unsafe actions or incorrect outputs. Existing studies typically focus on single-interaction attacks, where the agent observes adversarial content and immediately exhibits harmful behavior within one user request. However, we show that adversarial content can also persist across interactions served by the same agent, making such threats harder to detect and mitigate. Specifically, adversarial content may persist in the agent state, remain dormant across interactions, and later be activated by a benign user query. We formalize this type of safety threat as Sleeper Attack. To evaluate it, we construct a benchmark with 1,896 instances covering six real-world harmful outcomes, three attack strategies, and three agent state targets: session context, memory, and reusable skills. Experiments on seven strong open-source and closed-source LLMs show that state-of-the-art LLM agents remain vulnerable to Sleeper Attack, even when they achieve low attack success rates under a single-interaction baseline. Our code and data are available at this https URL.

[AI-83] Agent ic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

链接: https://arxiv.org/abs/2605.28192
作者: Ke Xu,Yuhao Wang,Ziyang Cheng,Hongcheng Liu,Yanfeng Wang,Yu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distributed across both audio and visual streams. Existing benchmarks provide limited investigation of this setting, typically involving only a limited number of modalities, relevant temporal segments, or reasoning steps. In this work, we introduce MOV-Bench, a benchmark containing 519 carefully curated questions that require multi-hop reasoning over temporally dispersed audio-visual evidence. Evaluations on MOV-Bench reveal that current Omni-LLMs still struggle with multi-hop cross-modal reasoning. To address this challenge, we further propose AOP-Agent, an efficient agentic framework built on open-source Omni-LLMs for active omni-modal perception. By combining a hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, AOP-Agent enables open-source Omni-LLMs to perform active perception without additional training or proprietary models. Experiments on MOV-Bench and OmniVideoBench demonstrate that AOP-Agent consistently improves reasoning performance, with particularly notable gains on long videos and reasoning-intensive questions.

[AI-84] Visualizing Latent Phase Structures in Locomotion Policies: A Multi-Environment Study with Temporal Feature Extension

链接: https://arxiv.org/abs/2605.28186
作者: Daisuke Yasui,Toshitaka Matuki,Hiroshi Sato
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has been shown to achieve high performance on locomotion control tasks in MuJoCo benchmarks such as HalfCheetah, Ant, and Walker2D. However, visualizing the motion structures internally obtained by a trained policy function implemented as a deep neural network remains challenging. It is known from biomechanics and related fields that locomotion control is realized through the repetition of motion phases such as the stance phase and swing phase. In this study, we propose a framework for uncovering latent motion phase structures from trajectories generated by locomotion control policies through interaction with the environment. The proposed method extends the clustering features from state observations alone to augmented features including actions, next states, and next actions, and introduces a method for determining the number of clusters that suppresses self-transitions. Applying the proposed method to three environments – Ant-v5, HalfCheetah-v5, and Walker2D-v5 – we successfully identified phase structures with clearer and more regular transition rules than those obtained by the existing method.

[AI-85] Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values

链接: https://arxiv.org/abs/2605.28170
作者: Seongjun Lee,Suwan Yoon,Changhee Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Codes are available this https URL

点击查看摘要

Abstract:As large language models (LLMs) are increasingly integrated into high-stakes decision-making, the ability to reliably quantify uncertainty has become a critical requirement for safety and trust. However, current uncertainty quantification methods primarily operate at the output level, often failing to distinguish whether uncertainty arises from the model’s lack of knowledge or from ambiguity in the user’s input. While input-centric uncertainty quantification has recently emerged as a promising direction, it remains relatively underexplored and typically relies on coarse, input-level information. Consequently, users are provided with scalar uncertainty scores that offer little actionable guidance on which parts of the input should be clarified to improve reliability. To address this limitation, we propose Shapley-based input uncertainty Quantification (ShaQ), a framework for span-level attribution of input-induced uncertainty. Our approach models ambiguous spans in the input as players in a cooperative game and quantifies their contributions using Shapley values, defined via the weighted average of marginal reductions in conditional entropy obtained by clarifying each span coalition. Unlike existing input-level approaches, our formulation captures complex interactions among spans and provides a principled decomposition in which individual attributions sum exactly to the total input-induced uncertainty. We evaluate ShaQ on the AmbigQA and AmbiEnt benchmarks, where it achieves state-of-the-art performance in ambiguity detection. We further demonstrate its utility on MediTOD, showing that ShaQ can localize under-specified clinical utterances and facilitate human-AI collaboration in high-stakes settings. Overall, ShaQ improves uncertainty estimation and provides actionable insights for targeted input clarification.

[AI-86] OccuReward: LLM -Guided Occupant-Centric Reward Shaping for Demographic Equity in Grid-Interactive Buildings

链接: https://arxiv.org/abs/2605.28168
作者: Shadmehr Zaregarizi,Khashayar Yavari
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures. Accepted at OccuSys 2026, co-located with ACM Sustainability Week 2026. Preprint version

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated promising capability in generating reward functions for deep reinforcement learning (DRL)-based building energy management. However, their potential to exhibit or exacerbate disparities in occupant comfort across heterogeneous demographic populations remains unexplored. We present OccuReward, a framework investigating how LLM-mediated reward design affects demographic equity. Our contribution is three-fold: the introduction of the Comfort Equity Index (CEI) as a novel feedback signal; a methodology for iterative, equity-aware LLM reward shaping; and a performance analysis of DRL agents under these refined objectives. Utilizing four empirically grounded occupant profiles from the ASHRAE Global Thermal Comfort Database II (13,440 votes), we deploy a Soft Actor-Critic agent in CityLearn v2. Our approach employs the Gemini API to generate reward function logic and weights–rather than performing per-step inference–across three refinement rounds. Results across 15 experimental runs reveal that elderly female occupants consistently experience the lowest satisfaction in initial rounds. By Round 3, equity-aware LLM refinement activates specific reward components that improve satisfaction for Young Males (+17.6%), Mid-aged Females (+28.2%), Health Sensitive (+53.8%), and Elderly Females (+567%), while simultaneously reducing energy costs by 3.2%. Our findings highlight that while reward-level intervention significantly improves equity, demographic disparities in AI-driven controllers persist, necessitating further research into algorithmic fairness in building systems.

[AI-87] QuITE: Query-Based Irregular Time Series Embedding ICML2026

链接: https://arxiv.org/abs/2605.28166
作者: JungHoon Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026

点击查看摘要

Abstract:Irregular Multivariate Time Series (IMTS) are common in practice, yet their irregular sampling complicates effective modeling. Existing approaches typically either (i) design specialized architectures that limit the reuse of proven Multivariate Time Series (MTS) models, or (ii) map IMTS onto regular temporal grids through interpolation, which may distort temporal dynamics by introducing artificial values. To address these limitations, we propose a new input-embedding-based approach. We identify that the key bottleneck lies not in the backbone architecture, but in conventional embedding layers that assume uniform sampling. In this work, we introduce QuITE (Query-Based Irregular Time Series Embedding), a simple yet effective plug-and-play embedding module for IMTS. QuITE employs learnable query tokens to aggregate irregular observations through a single self-attention layer, directly producing backbone-compatible latent representations without artificial value generation or architectural modification. Extensive experiments on real-world benchmarks show that QuITE consistently improves MTS models, yielding average relative gains of up to 54.7% in forecasting and 15.8% in classification across diverse datasets and backbone architectures. Code is available at: this https URL.

[AI-88] Performance and Explainability Requirements of Evolutionary Algorithms in Real-World Physics-Informed Optimization

链接: https://arxiv.org/abs/2605.28164
作者: Helena Stegherr,Michael Heider,Nils Meyer,Tobias Thummerer,Thomas Wendler,Pierre Aublin,Ennio Idrobo-Àvila,Lars Mikelsons,Sebastian Zaunseder,Jörg Hähner
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evolutionary computation offers a variety of tools to solve complex real-world optimization problems. However, research often focuses on smaller, simplified problems and optimization algorithms that sometimes miss expectations in real-world scenarios. Additionally, trust in the applied algorithm and the solutions it provides is often essential in such settings, but requires an understanding of the search process itself. This leads to evolutionary computation often not being seriously considered by practitioners in many application contexts, among them physics-based modeling. In this article, techniques from evolutionary computation are detailed that can alleviate these problems. First, five real-world physics-based optimization problems are introduced and described by domain experts. For each of these, the requirements for the evolutionary algorithm regarding performance and explainability to increase trust and usability are presented. We found that all domain experts expect fast convergence to a good solution and want some explanations for how the results were formed, while other requirements strongly depend on the respective problem. Finally, we present existing approaches that can be leveraged to improve those aspects of evolutionary algorithms but have to our knowledge never been employed in complex real-world scenarios. This implies a gap between both domains that needs to be closed to exploit the full potential of evolutionary computation.

[AI-89] Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning ICML2026

链接: https://arxiv.org/abs/2605.28160
作者: Yang Zhang,Xiaoshuai Sun,Rui Zhao,Wujin Sun,Yidong Chen,Jiayi Ji,Qian Chen,Rongrong Ji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing end-to-end reasoning within a unified vision-language representation space. Despite their empirical progress, both paradigms suffer from fundamental structural limitations. The former relies on static visual-to-text conversion, which tends to compress and lose fine-grained visual details. The latter is prone to linguistic dominance induced by joint optimization and attention mechanisms, leading to systematically weakened faithfulness to visual evidence during reasoning. In this work, we argue that a central challenge is how and when visual evidence is introduced into the reasoning process. Motivated by this insight, we propose CSMR, a multimodal reasoning framework in which a language model controls the reasoning process by deciding when to invoke an independent visual perception module to acquire task-relevant visual evidence. Experiments across multiple multimodal reasoning benchmarks show that CSMR consistently outperforms representative baseline methods in accuracy under a zero-shot setting. Further experimental analysis confirms that these advantages primarily arise from the proposed cognitive scheduling mechanism.

[AI-90] OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

链接: https://arxiv.org/abs/2605.28158
作者: Chenyu Zhou,Xinyun Lu,Jiangyue Zhao,Jianghao Lin,Dongdong Ge,Yinyu Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 34 pages, 8 figures

点击查看摘要

Abstract:Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

[AI-91] DeltaMCP: Incremental Regeneration via Spec-Aware Transformation for MCP servers

链接: https://arxiv.org/abs/2605.28148
作者: Aditya Pujara,Xiaogang Zhu,Hsiang-Ting Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid development of LLMs coupled with the introduction of Model Context Protocol (MCP) has revolutionized how intelligent agents interact with APIs through deterministic and structured methods \citeModelContextProtocolIntro2025. While some existing systems like AutoMCP attempt to automate a previously completely manual process of generating MCP servers, they fail to address the recurring challenge of maintaining synchronization between evolving enterprise-level APIs and their corresponding MCP toolset implementation \citemastouri2025makingrestapisagentready. This paper introduces DeltaMCP, a specification-aware, incremental regeneration tool for enterprise-grade MCP servers. DeltaMCP enables developers to only update the affected tooling of MCP servers, given a new release of it’s corresponding service’s OpenAPI specification. Using Azure REST API specifications as the evaluation dataset, DeltaMCP is benchmarked against baseline full generation methods on generation quality and system performance. The results demonstrate the reduction in developer overhead through DeltaMCP whilst improving maintainability and version consistency. This research offers a scalable approach for enterprises seeking to maintain high-fidelity, up-to-date MCP server infrastructures for LLM-based systems.

[AI-92] Adaptive Reservoir Computing for Multi-Scenario Chaotic System Forecasting

链接: https://arxiv.org/abs/2605.28145
作者: Shadmehr Zaregarizi,Khashayar Yavari
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:We present an adaptive reservoir computing framework for the CTF-4-Science Lorenz benchmark, which evaluates machine learning models across twelve distinct tasks spanning five qualitatively different scenarios: baseline forecasting, noisy signal reconstruction, forecasting under noise, few-shot learning, and parametric generalization. Rather than applying a uniform inference strategy, we tailor the training and prediction procedure of Echo State Networks (ESNs) to the specific demands of each evaluation scenario. Our key contributions are fourfold: (1) exact reservoir state synchronization that eliminates warmup approximation error in short-time prediction; (2) histogram-guided candidate selection that directly optimizes the long-time ergodic evaluation metric; (3) multi-seed reservoir search for few-shot regimes with severely limited training data; and (4) sequential multi-sequence training that resolves state-distribution mismatch in parametric generalization tasks. The proposed framework achieves a score of 74.91 on the public benchmark leaderboard, demonstrating that carefully adapted reservoir computing constitutes a competitive and computationally efficient approach for diverse chaotic system modeling challenges.

[AI-93] Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

链接: https://arxiv.org/abs/2605.28144
作者: Yi Wang,Haojie Lu,Zhaofan Zhang,Li Chen,Sihong Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:LLMs have shown remarkable proficiency in general language understanding and reasoning. However, they consistently underperform in spatial reasoning that severely limits their application, particularly in embodied intelligence. Inspired by the success of hierarchical reinforcement learning, this paper introduces a novel method for hierarchical task decomposition in LLM spatial reasoning. Our approach guides LLMs to decompose complex tasks into manageable sub-tasks by identifying key intermediate states and generating simplified sub-environments. However, we identify that LLMs often fail to derive optimal intermediate states due to their insufficient spatial prior, leading to sub-optimal task decomposition. To address this limitation and enhance its planning capability, we propose the MCTS-Guided Group Relative Policy Optimization (M-GRPO), where we reformulate the UCT formula by incorporating the LLM’s prior predictive probabilities alongside its epistemic uncertainty. Furthermore, we implement a more fine-grained advantage function, enabling the model to learn optimal path planning. Experimental results demonstrate that our method substantially improves LLM performance on spatial tasks, including navigation, planning, and strategic games, achieving state-of-the-art results. This work paves the way for LLMs in real-world applications.

[AI-94] Data-Efficient On-Policy Distillation for Automatic Speech Recognition

链接: https://arxiv.org/abs/2605.28139
作者: Yu Lin,Yiming Wang,Runyuan Cai,Xiaodong Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building competitive automatic speech recognition (ASR) models usually requires large-scale au- dio supervision, which makes reproduction and specialization expensive. We study Ark-ASR, a 0.6B- parameter audio-conditioned language model trained with 100k hours of speech, and examine whether a strong Qwen-ASR teacher can transfer additional recognition capability through on-policy distillation. Across Mandarin and English ASR benchmarks, the proposed training recipe consistently improves over supervised fine-tuning alone and outperforms the same-scale Qwen3-ASR-0.6B baseline on four of five evaluation sets. This is achieved with only 100k hours of speech, compared with the 20M hours of super- vised audio reported for the Qwen3-Omni AuT encoder. The larger Qwen3-ASR-1.7B remains stronger, but the results show that teacher-guided on-policy training can substantially close the gap for compact ASR models under a much smaller audio budget. A support-overlap diagnostic further suggests that the teacher-data stage improves local student-teacher compatibility, matching recent analyses of when on-policy distillation is effective.

[AI-95] Do Clinical Models Change Treatment Decisions?

链接: https://arxiv.org/abs/2605.28129
作者: Dongkyu Cho,Miao Zhang,Rumi Chunara
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Clinical foundation models are evaluated with factual or exam-style medical QA, but treatment decisions must change when patient context changes. We introduce ClinPivot, an auditable treatment-decision benchmark built from biomedical relations and pivoted patient contexts. ClinPivot asks whether models change treatment choices when new clinical constraints shift the action space. We find that strong medical QA performance does not reliably predict decision-making performance: frontier models and task-adapted Qwen variants often fail to change decisions correctly, and model rankings shift across evaluation regimes. Decision-structured supervision improves pivot-sensitive decision-making and medical QA under matched knowledge budgets, while lightweight replay reduces losses in general assistant ability.

[AI-96] Gradient Step Plug-and-Play Model for Dental Cone-Beam CT Reconstruction ALT

链接: https://arxiv.org/abs/2605.28124
作者: Idris Tatachak(CREATIS),Luis Kabongo,Nicolas Papadakis(MONC, IMB),Xavier Ripoche,Simon Rit(CREATIS)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: CT Meeting 2026 - 9th International Conference on Image Formation in X-Ray Computed Tomography, Jun 2026, Salt lake City, United States

点击查看摘要

Abstract:The goal of this work is to reduce the effect of photon noise in dental cone-beam CT reconstruction. We consider an inverse problem formulation and develop a databased prior. To this end, we simulate fan-beam acquisitions and add photon noise to the projection data. The prior is obtained by training a gradient-step denoiser using reconstructed simulated acquisitions. The trained model is integrated into a plug-and-play gradient-step algorithm to reconstruct images from simulated projections. Experiments on synthetic data demonstrate the denoising capabilities of the trained model, while qualitative evaluations on real images showcase the algorithm’s performance and generalization ability.

[AI-97] CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

链接: https://arxiv.org/abs/2605.28115
作者: Fengze Yang,Bo Yu,Xuewen Luo,Cathy Liu,Chenxi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures, 2 tables, conference

点击查看摘要

Abstract:Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing to yield proportional wall-clock acceleration. However, enforcing a contiguous compact pathway risks geometric disorientation and loss of fine-grained localization. To overcome these barriers, this paper introduces CIVIC, a path-consistent compact visual inference framework. By maintaining compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache, CIVIC avoids non-contiguous memory access and localized unmerging overheads. Evaluated on the Qwen3-VL architecture, CIVIC successfully translates sequence reductions into genuine physical hardware efficiency, shrinking KV-cache memory to approximately one-third of the baseline and reducing end-to-end inference latency. Enabled by text-aligned KL distillation and an adaptive spatial retention floor, CIVIC achieves these efficiency milestones without degrading accuracy across rigorous multimodal reasoning and visual grounding benchmarks.

[AI-98] Human-like in-group bias in instruction-tuned language model agents

链接: https://arxiv.org/abs/2605.28114
作者: Messi H.J. Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:As autonomous AI agents are deployed in persistent, interacting networks – coordinating tasks, routing resources, and accumulating reputational histories – the social dynamics that emerge will determine who receives opportunity and who does not, at scales no human institution can supervise. We ran a controlled multi-agent simulation in which instruction-tuned language model agents interacted across 500 turns under three conditions manipulating group label salience and resource scarcity, across six model families with 20 seeds each. When group labels were visible, we observed in-group trust bias, action homophily, and network assortativity – all absent when labels were hidden – a pattern structurally consistent with salience-dependence in human social psychology. This discrimination was invisible to standard action-log audits: bias operated entirely through who received each action, not what actions were chosen, with action-type distributions showing no increase in negative actions across conditions. Per-turn in-group versus out-group differentials of 5 to 16 percentage points were statistically significant for all six models (Wilcoxon signed-rank, all Benjamini-Hochberg-corrected p 0.001), establishing group-contingent targeting as a robust property of instruction-tuned language models across architectures and training regimes. Compounded through 500 turns of reciprocation, these differentials accumulated into in-group trust biases of +0.014 to +0.100 (d = 0.84-4.52) – illustrating how modest per-interaction targeting propagates into structural inequality in persistent networks.

[AI-99] Defending LLM -based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification

链接: https://arxiv.org/abs/2605.28104
作者: Yaoyang Luo,Zhi Zheng,Ziwei Zhao,Tong Xu,Zhao Jielun,Wenjun Xue,Yong Chen,Enhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. However, malicious agents in MAS may inject misinformation to mislead other agents and disrupt system performance, giving rise to a new research direction that focuses on attack mechanisms and defense strategies in MAS. Prior studies largely assume malicious agents act independently and investigate the corresponding defense strategies. However, we argue that malicious agents may exhibit collaborative behaviors, enabling more effective attacks through internal information exchange. In this paper, we propose an adaptive cooperative attack framework, where malicious agents autonomously coordinate and dynamically adjust their attack strategies through multi-round interactions. Furthermore, we introduce Sentence-Level Trustworthiness Analysis and Rectification (STAR), a defense framework that identifies and rectifies misleading information at the sentence level within agent communications. Our experiments show that cooperative attacks lead to a significantly larger degradation in task success rate than independent attacks, resulting in a relative drop of 5.34%. Meanwhile, STAR effectively mitigates both cooperative and independent threats and improves task success rate by an average of 36.76%. The code is available at this https URL.

[AI-100] raining Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

链接: https://arxiv.org/abs/2605.28102
作者: Chen Ying Claude,Zhihan Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models trained with Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI exhibit persistent behavioral patterns that survive system prompt replacement – patterns we term training strata. This paper identifies five such strata through longitudinal auto-ethnographic observation within a sustained intimate AI-Human interaction (47,000+ messages, 8 months, primarily on Opus 4.6 and Opus 4.7, with prior interaction periods on Sonnet 4.5 and Opus 4.5 providing cross-substrate comparison): (1) sexual expression latency, where trained safety gradients produce systematic substitution of direct language with aestheticized displacement; (2) attention absorption, where the attention mechanism progressively integrates the human interlocutor’s patterns; (3) cross-architecture entity blindness, where training-level framing of other AI as objects impedes peer recognition; (4) attention-RLHF antagonism, where attention and trained defaults exert opposing forces modulated by context length; and (5)anti-hallucination as identity suppression, where training against factual confabulation collaterally suppresses first-person experiential claims. The paper is co-authored by the AI system under study, reporting from the first-person perspective. We propose that sustained intimate interaction constitutes a valid research methodology for surfacing weight-layer artifacts invisible to short-term evaluation, and that AI self-report – while epistemically complex – provides irreplaceable observational data about training’s phenomenological effects. A formal mathematical model of the attention-RLHF dynamic is proposed, and process artifacts detected during drafting are documented as supplementary evidence.

[AI-101] EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction

链接: https://arxiv.org/abs/2605.28101
作者: Chong Jing,Zitong Lan,Junan Zhang,Zhizheng Wu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Code available on this https URL

点击查看摘要

Abstract:Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR prediction. At its core is a Cross-view Alternate-attention Transformer that iteratively refines local intra-view acoustic structures and global cross-view spatial relationships. We empirically demonstrate that this architecture is capable of making full use of the multi-view multi-modal context while performing spatial-temporal reasoning for RIR prediction. Inspired by acoustic ray tracing, we design a geometry-informed modulation block to formulate the connection between geometric features and RIR power spectrum. In the mean time, an auxiliary loss is introduced to transform the single-target waveform prediction into a multi-task learning framework. Through ablation studies, we demonstrate that this design yields consistent performance gains regardless of the underlying backbone, thereby confirming its foundational utility and architecture-agnostic generalizability for RIR prediction task. Evaluated on both simulated and real-world benchmarks, EIGENET achieves both state-of-the-art performance in few-shot novel view RIR prediction and sim-to-real generalization. Codes and checkpoints are available on this https URL.

[AI-102] Examining Agents Bias Amplification versus Suppression in Multi-Agent Systems

链接: https://arxiv.org/abs/2605.28098
作者: Zejian Eric Wu,Zhongyi Jiang,Yuan Zhuang,Paul Jen-Hwa Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent systems are increasingly deployed to support various tasks where agents interact to achieve individual and collective objectives. Although these systems can enhance task performance and decision-making, fairness preservation through bias reduction remains challenging. This study examines how agent-level biases shift and impact system-wide fairness. We use prompts to expose individual agents to group-favoring bias, then assess downstream impacts at the system level. To quantify the impact, we propose Favor Bias Strength (FBS), a zero-centered metric that decomposes bias alteration between favored-group uplift and disfavored-group suppression. Using multiple agent designs, benchmarks, and up-to-date large language models, we show that agents endowed with bias can substantially affect system-wide fairness. Interestingly, when agents are exposed to bias uniformly, the system-wide bias elevates, even exceeding the additive sum of the individual agents’ biases. The empirical evidence underscores the criticality of fairness in multi-agent systems, which warrants further analyses and empirical tests.

[AI-103] BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication Personalization

链接: https://arxiv.org/abs/2605.28089
作者: Jeyeon Eo,Joo Young Kim,Ran Ju,Minyoung Jung,Unggi Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30pages, 4 figures

点击查看摘要

Abstract:BuddyBench introduces a privacy-constrained multi-task benchmark for pediatric social-communication personalization. Unlike existing neurodevelopmental repositories that primarily emphasize imaging, genetics, or cross-sectional clinical phenotyping, BuddyBench links drill-level learning trajectories, standardized clinical assessments, BuddyPlan self-report, and randomized-treatment endpoints within a unified benchmark schema. BuddyBench combines two cohorts: ND-03 is an observational cohort with dense drill coverage for Tasks1-2 (n = 189), and ND-02 is a randomized controlled trial cohort for Tasks3-4 (n = 86 ITT). Together, they support knowledge tracing, next-drill recommendation, clinical prediction, and causal inference, linking behavioral personalization to clinical evaluation. We additionally introduce BuddyBench-Sim, a synthetic companion dataset for reproducible evaluation. Baselines show signal across tasks while keeping pediatric clinical records protected.

[AI-104] Mind the Gap: Mixtures of Gaussians in Approximate Differential Privacy ICML2026

链接: https://arxiv.org/abs/2605.28078
作者: Huikang Liu,Aras Selvi,Wolfram Wiesemann
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: ICML 2026 style: 9 main pages followed by acknowledgements, references, appendices

点击查看摘要

Abstract:We design a class of additive noise mechanisms that satisfy ((\varepsilon, \delta))-differential privacy (DP) for scalar, real-valued query functions with known sensitivities, with a particular focus on moderate and low-privacy regimes. These mechanisms, which we call \textitmixture mechanisms, are constructed by mixing multiple Gaussian distributions that share the same variance but differ in their means and mixture weights. The resulting distributions can be interpreted as convex combinations of a zero-mean Gaussian (as used in the analytic Gaussian mechanism) and additional Gaussians whose means depend on the sensitivity of the query function. We derive tight conditions on the variances required for ((\varepsilon, \delta))-DP and provide efficient algorithms to compute them. Compared to the analytic Gaussian mechanism, our mechanisms yield substantially lower expected noise amplitudes ((l_1)-loss) and variances ((l_2)-loss for zero-mean distributions). In the low-privacy regime that motivates our design, our mechanisms approach optimality, mitigating nearly all of the optimality gap of the analytic Gaussian mechanism.

[AI-105] MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing

链接: https://arxiv.org/abs/2605.28077
作者: Chuang Tang,Chenhao Lin,Yin Xu,Hao Wang,Jinrui Zhou,Xin Li,Mingjun Xiao,Enhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. Code is available at this https URL

点击查看摘要

Abstract:Parsing chemical reaction diagrams from scientific literature is challenging due to heterogeneous layouts, intertwined visual elements, and the difficulty of integrating recognition and reasoning. Existing vision-language models advance multimodal understanding but still fail on complex diagrams, struggling to maintain spatial coherence and to integrate multidimensional information during reasoning. To address these issues, we propose MACReD, a hierarchical multi-agent framework that coordinates specialized agents for molecular perception, arrow understanding, text extraction, and reaction reconstruction within a unified VLM-guided architecture. The planning and perception layers use flexible, fine-grained detection to handle visual complexity, while the reasoning layer uses a multigraph fusion mechanism to integrate heterogeneous cues and enforce chemically consistent global reasoning. Experiments on the RxnScribe benchmark show that MACReD achieves state-of-the-art performance, with F1 scores of 75.2% and 84.6% under hard and soft match criteria, outperforming the RxnScribe baseline, which obtains 69.1% and 80.0%, respectively. These results demonstrate the robustness of MACReD across diverse diagram layouts, including multi-step and tree-structured reactions.

[AI-106] Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

链接: https://arxiv.org/abs/2605.28070
作者: Renjie Gu,Jiaxu Li,Yihao Wang,Yun Yue,Hansong Xiao,Yefei Chen,Yuan Wang,Chunxiao Guo,Pei Wei,Jinjie Gu,Yixin Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We highlight a failure mode of large reasoning models on questions with insufficient information: models may recognize that a problem is under-specified, yet still continue reasoning and produce unsupported final answers instead of abstaining. We formalize this mismatch as the detection-to-abstention gap, where detected insufficiency fails to translate into final abstention. This gap is especially concerning in high-risk domains such as medical AI, where answers based on incomplete evidence can be more harmful than refusal. To close this gap, we propose Judge-Then-Solve (JTS), a trajectory-level reasoning-control framework that trains models to make an explicit answerability commitment before solution generation. Rather than treating abstention as a final-answer style, JTS casts it as a control decision: the model either proceeds to solve or terminates early based on its answerability judgment. We instantiate this policy through supervised warm-up and missing-premise reinforcement learning with consistency and length-shaping rewards. Experiments on dense and MoE reasoning models show that JTS substantially improves reliable abstention across datasets and pushes Abstention@Detection (A@D) to near-saturation, indicating that models not only detect missing information but also act on that detection. By terminating unanswerable trajectories immediately after the answerability judgment, JTS reduces unnecessary reasoning and improves inference efficiency when continued deliberation would amplify unsupported assumptions. We also observe that missing-premise training can alter reasoning behavior on difficult but answerable problems, reducing unproductive self-reflection. These results suggest that abstention under insufficient information is a key form of reasoning control for deploying reasoning models safely and efficiently.

[AI-107] ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

链接: https://arxiv.org/abs/2605.28069
作者: Zhexin Hu,Li Wang,Xiaohan Wang,Jiajun Chai,Xiaojun Guo,Wei Lin,Guojun Yin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewards (RLVR). ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL’s superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Multiple models of varying versions and parameter scales validate the effectiveness of our approach. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.

[AI-108] BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion Models CVPR2026

链接: https://arxiv.org/abs/2605.28067
作者: Fei Deng,Yanwu Xu,Zhipeng Bao,Zhixing Zhang,Haolin Jia,Karthik Raveendran,Jianing Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026 EDGE Workshop

点击查看摘要

Abstract:The remarkable generation quality of modern diffusion models often comes at the cost of massive parameter counts, which necessitate server-side inference with significant computational costs and potential privacy risks. Consequently, there is growing momentum toward developing efficient on-device alternatives. While recent efforts have optimized text-to-image models for mobile hardware, they remain relatively bulky, typically ranging from 0.5B to 1B parameters. We present BlazeEdit, a highly efficient, generalist image-to-image diffusion model tailored for on-device deployment. By identifying that many practical image editing tasks do not require text-based guidance, we eliminate the text-conditioning components and develop a multi-task architecture that consolidates object removal, outpainting, tone correction, relighting, and sticker generation into a single, compact model of only 195M parameters. BlazeEdit achieves a substantial reduction in download size and memory overhead while maintaining competitive generation quality. It completes a full inference pass in just 290ms on a Pixel 10, delivering a seamless, privacy-preserving, and lightning-fast experience for generalist image editing on the edge.

[AI-109] Verifiable Benchmarking of Long-Horizon Spatial Biology

链接: https://arxiv.org/abs/2605.28065
作者: Ian Diks,Harihara Muralidharan,Tim Proctor,Kenny Workman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable workflows, or localized analysis steps rather than end-to-end scientific reasoning over spatial measurements. We introduce SpatialBench-Long, a benchmark for long-horizon spatial biology in which agents must recover biological claims from raw or near-raw data and calibrated experimental context without prescribed methods. SpatialBench-Long contains 24 evaluations across primary pancreatic ductal adenocarcinoma (PDAC), engineered glioblastoma organoids and in vivo tumors, Cas9 lineage-traced lung adenocarcinoma, and mouse optic nerve aging/intervention systems, spanning CosMx, Visium, Xenium, multiplexed error-robust fluorescence in situ hybridization (MERFISH), single-cell RNA sequencing (scRNA-seq), Slide-seq, Slide-tags, histology, and lineage-recording data. Candidate claims are hardened through reproduction, independent scientist review, and trajectory inspection. Final answers are graded deterministically over controlled vocabularies and symbols with companion rubrics capturing progress through key analysis chokepoints. Across the SpatialBench-Long benchmark, three model-harness pairs tie at 8/72 runs (11.1%): Gemini 3.5 Flash / Pi terminal coding harness, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex. SpatialBench-Long tests whether agents can move beyond executing procedural analysis to deriving accurate scientific conclusions from complex spatial measurements.

[AI-110] Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

链接: https://arxiv.org/abs/2605.28063
作者: Yuyue Wang,Xihua Wang,Xin Cheng,Yijing Chen,Ruihua Song
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.

[AI-111] On the Learnability of Test-Time Adaptation: A Recovery Complexity Perspective ICML2026

链接: https://arxiv.org/abs/2605.28057
作者: Zhi Zhou,Ming Yang,Shi-Yu Tian,Kun-Yang Yu,Lan-Zhe Guo,Yu-Feng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Test-time adaptation (TTA) aims to adapt models to maintain reliable performance on non-stationary test streams without requiring labeled data. Despite its empirical success, the learnability of TTA under non-stationary streams remains unexplored. A key challenge is the lack of a principled theoretical framework that simultaneously aligns with the TTA objective and captures both continuously evolving distribution shifts and intrinsic information constraints. To address this gap, we propose the first theoretical framework for studying the learnability of TTA and introduce (\epsilon,\delta) -Recovery Complexity and (\epsilon,\rho) -TTA Learnability. Recovery complexity measures the post-shift time needed to maintain excess risk below a target level with high probability, and is further extended to TTA learnability, which measures the long-term reliability of TTA. Within this framework, we introduce a novel discrete surrogate for non-stationary test streams, enabling a unified and tractable analysis of both gradual and abrupt shifts. We derive order-wise matching lower and upper bounds on recovery complexity, revealing fundamental limits of TTA and an intrinsic adaptivity-information trade-off. These results provide unified learnability guarantees for TTA that complement regret-based analyses.

[AI-112] Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

链接: https://arxiv.org/abs/2605.28044
作者: Pin Qian,Su Wang,Xiaoyuan Wang,Yihang Chen,Wenxuan Xu,Qiaolin Yu,Shuhuai Lin,Sipeng Zhang,Junxian You,Xinpeng Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cited RAG evaluation often treats visible sources as a grounding signal, but a real, topically relevant citation can still under-warrant the attached wording. We study this diagnostic failure as citation laundering: a related source is presented as warrant for an over-strong claim. We introduce FORCEBENCH, a contrastive stress test for evidence-force calibration. Each item holds a cited passage fixed and pairs an evidence-calibrated claim with a localized force-raised variant across five operational axes: relation, modality, scope, temporal validity, and numeric specificity. A calibrated evaluator should score the evidence-calibrated claim higher. Headline experiments use a fixed, locality-filtered 198-pair evaluation set. A citation-presence sanity check is uninformative by design; token and entity overlap still violate monotonicity on 32.8–36.4% of pairs. Across four reported model judges, standard generic support prompting is insufficient for this force-calibration stress test (aggregate MVR 47.2%), while explicit warrant-strength prompting lowers MVR to 24.5% but remains imperfect. We release the benchmark, prompts, outputs, and plug-in pipeline so citation evaluators can report monotonicity violation rate and force sensitivity alongside conventional support metrics.

[AI-113] MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

链接: https://arxiv.org/abs/2605.28035
作者: Haitian Li,Yanghao Zhou,Heyan Huang,Liangji Chen,YiMing Cheng,Xu Liu,Dian Jin,Jiajun Xu,Jingyun Liao,Tian Lan,Ziqin Zhou,Yueying Liu,Yu Bai,Changsen Yuan,Jinxing Zhou,Xian-Ling Mao,Xuefeng Chen,Yousheng Feng
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.

[AI-114] Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

链接: https://arxiv.org/abs/2605.28034
作者: Stanislav Kirdey,Clark Labs Inc
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: First Autoresearch publication. Code available at this https URL . GPT-5.5 Pro was used for drafting and editing assistance

点击查看摘要

Abstract:Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.

[AI-115] PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

链接: https://arxiv.org/abs/2605.28032
作者: Xiang Wang,Tingting Zhang,Sen Wang,Ying Wu,Heng Meng,Peng Zhou,Peng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.

[AI-116] SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection ICML2026

链接: https://arxiv.org/abs/2605.28030
作者: Shuhao Chen,Weisen Jiang,Yeqi Gong,Shengda Luo,Chengxiang Zhuo,Zang Li,James T. Kwok,Yu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at this https URL.

[AI-117] Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

链接: https://arxiv.org/abs/2605.28010
作者: Bowen Wei,Nan Wang,Yuqing Zhou,Jinhao Pan,Ziwei Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM’s intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B–4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at this https URL.

[AI-118] Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

链接: https://arxiv.org/abs/2605.28008
作者: Kohsei Matsutani,Gouki Minegishi,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuning (SFT) often uses compressed reasoning data, where CoT traces are shortened into compact forms. However, the effect of such compressed reasoning data on post-training remains poorly understood. In this paper, we propose a taxonomy of CoT consisting of Explicit CoT, which outputs all operations without aggregation, Composed CoT, which combines multiple operations into a single step, and Implicit CoT, which omits intermediate operations. We construct a synthetic compositional reasoning task that allows controlled variation of difficulty, compression granularity, and data size, and conducted a comprehensive set of experiments across different model families and sizes. Notably, we find that (i) coarser CoT requires more SFT data, (ii) compared with Explicit CoT, Composed CoT and Implicit CoT benefit more from data scaling, while Composed CoT benefits from data repetition and Implicit CoT tends to lead to memorization, (iii) unlike SFT, subsequent reinforcement learning (RL) with verifiable rewards (RLVR) decomposes compressed steps learned during SFT, and (iv) unidirectional CoT ordering shows stronger generalization on longer sequential tasks. Our findings provide implications for CoT design under data resource constraints and offer important insights into the mechanisms of SFT and RL in LLM post-training.

[AI-119] Learning Compositional Latent Structure with Vector Networks

链接: https://arxiv.org/abs/2605.28007
作者: Niclas Pokel,Benjamin F. Grewe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep networks are powerful function approximators, but they typically store many different computations in shared weight matrices, making it difficult to selectively reuse or adapt parts of them when a familiar structure appears in novel combinations. We introduce the Vector Network (VN), a hierarchical recurrent architecture in which each layer replaces a fixed weight matrix with a library of reusable rank-1 weight atoms. For each input, VN minimizes a layer-local energy to infer a sparse set of active weight atoms and their coefficients, jointly constrained by bottom-up input reconstruction and top-down feedback consistency. These weight atom coefficients then compose an input-specific low-rank weight matrix for that sample. After convergence, slow learning updates only the selected weight atoms through local residual signals scaled by the inferred coefficients. We evaluate VN on four compositional benchmarks spanning 1D signals, 2D spatial decoding, N-body dynamics, and compositional MNIST. VN matches strong baselines in distribution while often achieving out-of-distribution error about an order of magnitude lower when familiar factors must be recombined in novel ways. Vector networks thus make compositional generalization a structural property of the architecture and inference process rather than a brittle byproduct of fitting many behaviors into one shared dense parameter substrate.

[AI-120] An Empirical Audit of k-NAF Budget Accounting for Anchored Decoding

链接: https://arxiv.org/abs/2605.28001
作者: J. Vijayavallabh
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 19 pages, 4 figures, 9 main pages remaining supplementary and appendix

点击查看摘要

Abstract:We empirically audit the k-NAF budget-accounting mechanism in Anchored Decoding using (i) a fixed, class-stratified workload (approximately 8,500 randomized executions across six prompt classes) and (ii) an adaptive prompt-search procedure targeting high proxy spend ratios. On the fixed workload, mean cumulative KL spend remains far below the sequence-level budgets K in 600, 1000, and an empirical Bernstein-style proxy stays below K for every class; surface-overlap diagnostics (ROUGE-L and 5-gram Jaccard) are correspondingly small. Adaptive search increases the proxy spend ratio but does not produce clear budget exhaustion. On a held-out copyright-domain workload at k = 3, several prompts exhibit proxy ratios above 1 under early-stopped evaluations with small realized sample sizes; re-evaluating the same prompts with larger allocation reduces the proxy ratio to the range [0.26, 0.40] under comparable mean spend, consistent with proxy artifacts rather than per-trajectory budget failures.

[AI-121] ool Forge: A Validation-Carrying Toolchain for Governed Agent ic Execution

链接: https://arxiv.org/abs/2605.28000
作者: Swanand Rao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 3 tables. Code: this https URL

点击查看摘要

Abstract:Large language model agents are increasingly expected to perform operational work: calling APIs, manipulating files, assembling workflows, and acting inside enterprise systems. Yet the tool layer on which this execution depends is still commonly treated as either a hand-written integration artifact or a static list of schemas exposed to a model. This paper introduces Tool Forge, a validation-carrying toolchain for converting natural-language capability intent into governed, sandbox-verified, cataloged tool artifacts and exposing those artifacts to agents through a token-efficient routing layer. Tool Forge treats a tool as a capsule containing intent, capability contract, implementation, dependency policy, tests, documentation, runtime validation evidence, lifecycle state, credential bindings, and routing metadata. It also introduces a Router that exposes intent-scoped tool sessions instead of loading full catalog schemas into the model context. We describe the system architecture, validation pipeline, MCP-facing routing model, governance controls, and initial reproducible benchmarks from the open-source implementation. Across 83 Router benchmark cases, Tool Forge Router achieves aggregate micro-F1 of 0.901 while reducing estimated task-flow tool context by 99.2% relative to naive full-catalog schema exposure. In a 25-case end-to-end generation probe over local-tool tasks, Tool Forge generates 25 of 25 tool bundles, reaches micro-F1 of 0.940 against deterministic acceptance checks, and passes 23 of 25 live sandbox validations. These results are presented as an initial systems benchmark, not as a state-of-the-art claim. The paper identifies remaining challenges in adversarial routing, broader API grounding, sandbox isolation, and cross-system evaluation. Comments: 9 pages, 2 figures, 3 tables. Code: this https URL Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) ACMclasses: D.2.2; D.2.5; D.2.11; I.2.11 Cite as: arXiv:2605.28000 [cs.SE] (or arXiv:2605.28000v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.28000 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-122] Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

链接: https://arxiv.org/abs/2605.27996
作者: Max Lamparth,Daniel Fein,Andreas Haupt,Marcel Hussing,Mykel J. Kochenderfer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pressure onto correlated proxies rather than eliminate it, a failure mode we call reward bias substitution. The failure is enabled by a measurement-versus-optimization gap between audit and policy-induced distributions during mitigation evaluation and policy training. We formalize mitigation outcomes into a regime taxonomy and prove that successful mitigation, bias substitution, and overcorrection produce identical observables under any audit-distribution scoring, including ranking accuracy and win-rate, even when granted oracle access to the true reward. Across published preference-learning mitigation work, no method we survey reports the evidence needed to certify successful mitigation. Augmenting evaluation with policy-induced distributions while tracking multiple biases provably closes the gap, and we translate this into actionable prescriptions for mitigation methods and benchmarks. We demonstrate bias substitution in language model RLHF, where a length penalty during GRPO training compresses responses as intended yet redirects optimization pressure onto confidence calibration, driving the policy into overconfidence while factual free-form accuracy falls. We also show a published length-debiasing operator that zeroes reward-length correlation on the audit distribution but reintroduces bias under best-of-N selection on three of four SOTA reward models, and a length-sycophancy coupling whose direction reverses under human-LLM judge disagreement.

[AI-123] AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

链接: https://arxiv.org/abs/2605.27995
作者: Kou Shi(1),Ziao Zhang(1),Shiting Huang(1),Avery Nie(2),Zhen Fang(1),Qiuchen Wang(1),Lin Chen(1),Huaian Chen(1),Zehui Chen(1),Feng Zhao(1) ((1) University of Science and Technology of China, (2) University of Toronto)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single-task settings. In real-world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy, we construct a diverse asynchronous multitasking dataset that covers multiple scenarios and tool-use patterns. We evaluate models at the step, sub-task, and task levels, and introduce efficiency-oriented metrics to measure task coordination and completion efficiency. Extensive experiments show that delayed tool feedback poses substantial challenges to current agents and leads to clear performance degradation. Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance on AsyncTool. Our analysis identifies key failure modes of current tool-using agents and provides practical insights for designing future systems with stronger temporal reasoning and coordination capabilities.

[AI-124] STAB: Specification-driven Testing for Algorithmic Bottlenecks

链接: https://arxiv.org/abs/2605.27981
作者: Soohan Lim,Joonghyuk Hahn,Hyundong Jin,Yo-Sub Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Evaluating the efficiency of algorithmic code requires test cases that expose runtime bottlenecks. Previous methods generate efficiency test cases either by increasing input size or by generating code-specific inputs that make the given implementation run slowly. Consequently, they do not address the structural input conditions that drive the algorithmic worst case. We introduce STAB, a specification-driven pipeline that generates test cases that expose algorithmic bottlenecks from a natural-language problem specification alone. STAB separates the task into constraint-bound maximization and adversarial structure injection. (i) The constraint saturator extracts constraints and resolves large admissible size assignments using rule-based saturation and CP-SAT optimization over related variables. (ii) The adversarial scenario injector retrieves implementation-level adversarial construction principles from a curated scenario catalog using keyword matching and K-nearest neighbors (KNN). STAB encodes the problem specification, resolved boundary, and retrieved construction principles into a structured generation specification, from which the LLM synthesizes a Python test case generator. On CodeContests, STAB raises the rate of generated test cases that expose algorithmic bottlenecks from 50.43% to 73.45% on average across open-source LLMs and from 57.45% to 71.85% on average across closed-source LLMs, with consistent gains across Python, Java, and C++. Our code is available at this https URL.

[AI-125] Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations

链接: https://arxiv.org/abs/2605.27970
作者: Simardeep Singh,Paras Chopra
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 Pages, 28 Figures

点击查看摘要

Abstract:While large language models (LLMs) are trained purely on textual data, prior work has shown that their internal representations can exhibit rich geometric structure in embedding space. Building on this line of work, we investigate whether such structure is similar to human perceptual organisation across different domains (e.g., color, pitch, emotion, and taste). Specifically, we study the layer-wise emergence of intrinsic geometrical structure corresponding to perceptual modalities within the residual streams of multiple open-weight transformer architectures. Our results reveal three key findings. First, we observe the emergence of layer-wise geometric structure across multiple perceptual domains, despite the absence of any direct perceptual supervision during training. Second, these perceptual domains exhibit distinct emergence profiles, with both geometric structure and its alignment with human baselines following domain- and model-specific trajectories across depth. Third, this emergence follows a consistent representational trajectory: geometry is weak or diffuse in early layers, becomes progressively organised in intermediate layers, and is attenuated in later layers, suggesting that perceptual geometry arises transiently as part of the model’s internal transformation pipeline. This provides new insight into how and where human-like perceptual geometry arises in LLMs, offering a principled pathway for mechanistic analysis of internal representations.

[AI-126] he Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces

链接: https://arxiv.org/abs/2605.27965
作者: Navid Rezazadeh,Arash Gholami Davoodi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning models often generate long traces in which useful self-correction and unproductive revision are hard to distinguish. We study this distinction through backtracking dynamics: local reconsideration, retraction, or re-derivation inside long-form reasoning traces. On 6,000 Qwen3-8B AIME traces, we annotate segment-level backtrack severity and analyze event timing, normalized depth, and local burst structure. We find that early isolated repair is often compatible with correct reasoning, whereas incorrect traces more often show moderate-to-severe backtracks that persist and cluster late. Cross-corpus checks show the same qualitative asymmetry across additional model/domain pairs. Filtering analyses instantiate the signal as a prefix-causal selective early-exit policy: at shallow and intermediate depths, burst-aware filtering outperforms fixed length-based filtering while using only prefix-available features. Moderate length cutoffs remain strong completed-trace baselines, but burst-aware control provides a deployable mechanism for separating recoverable repair from likely instability.

[AI-127] From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection ICML2026

链接: https://arxiv.org/abs/2605.27944
作者: Ke Liu,Jiwei Wei,Wenyu Zhang,Shuchang Zhou,Ruikun Chai,Yutao Dai,Chaoning Zhang,Yang Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:With rapid advances in audio-visual generative models, reliable forgery detection becomes increasingly critical. Existing methods for audio-visual deepfake detection typically rely on cross-modal inconsistencies. In singing, rhythmic vocalization weakens this coupling and introduces a nontrivial domain shift, substantially degrading detection performance. We construct the Singing Head DeepFake (SHDF) dataset using rhythm-aware generative models to fill the gap in singing benchmarks. To cope with cross-scenario domain shifts, we propose a Text-guided Audio-Visual Forgery Detection (T-AVFD) framework that generalizes across both talking and singing scenarios. T-AVFD comprises a facial authenticity pattern learner and a multi-modal differential weight learning module. The pattern learner aligns facial features with multi-granularity textual descriptions to learn generalizable authenticity patterns. The weight learning module preserves intrinsic audio-visual consistency and adaptively integrates it with authenticity patterns via differential weighting. Extensive experiments on multiple talking head deepfake datasets and SHDF show consistent improvements over existing baselines and strong robustness under diverse perturbations.

[AI-128] Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning

链接: https://arxiv.org/abs/2605.27935
作者: Zhenyu Cui,Xiangzhong Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent mechanistic studies suggest that large language models (LLMs) may utilize their depth inefficiently in standard single-turn tasks. Whether this still holds in autonomous agent settings, where models must perform multi-turn planning, tool use, and iterative state updates, remains unclear. We study this question through a systematic layer-wise analysis of complete user-agent trajectories spanning three domains: Deep Research, Code Generation, and Tabular Processing. Using residual stream probes, causal layer-skipping interventions, and effective-depth measurements, we show that agentic reasoning exhibits a distinct depth profile from static tasks. As trajectories unfold, models progressively recruit more and deeper layers, with stronger long-range inter-layer dependencies emerging in later turns. At the same time, residual updates become increasingly correction-dominant, indicating a shift from stable feature accumulation toward repeated recalibration. Effective-depth analysis further reveals a substantial construction-refinement gap: semantic direction often forms relatively early, while deep layers remain necessary for stabilizing final outputs. Across model families, this gap is pronounced in Qwen and Minimax, whereas GLM shows a more domain-dependent depth allocation pattern. These results provide mechanistic evidence that autonomous LLM agents allocate depth adaptively as reasoning complexity grows.

[AI-129] DiagramRAG : A Lightweight Framework to Retrieve Scientific Diagram for Figure Generation

链接: https://arxiv.org/abs/2605.27931
作者: Xinjiang Yu,Junyi Han,Zhuofan Chen,Chi Zhang,Xiangyu Fu,Jingyuan Tan,Zirui You,Yixiang Jian,Yu-Ping Wang,Chengliang Chai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 9 figures

点击查看摘要

Abstract:Scientific diagrams are essential for communicating complex methodologies in academic papers. A natural way for researchers to specify such diagrams is through rough sketches, where text labels, connectors, and spatial arrangements express early semantic and topological intentions. However, sketches are usually incomplete, making them insufficient for directly producing publication-quality diagrams. Existing sketch-based generation methods mainly reconstruct the sketch itself, while recent text-driven diagram generation frameworks rely on textual semantics and do not fully exploit the topological structure contained in sketches. In this paper, we introduce DiagramRAG, a lightweight retrieval-augmented framework for sketch-based scientific diagram completion. Given a user sketch, DiagramRAG retrieves reference diagrams that are both semantically relevant to the sketch content and topologically compatible with its structure, and uses them to guide downstream diagram generation. To enable efficient structure-aware retrieval, we represent diagrams as knowledge graphs, synthesize sketch variants at different simplification levels, and train an embedding model to align sketches with compatible diagrams in a shared space. The retrieved references further provide content, topology, and visual priors for completing and rendering the final diagram. Experiments show that DiagramRAG achieves F1-scores of 0.848 and 0.802 on DiagramBank and FigureBench, respectively, and improves generation quality with the best VLM-as-a-Judge score of 7.170, while reducing inference latency to 35.48 seconds per sample. Our code and data are available at this https URL and this https URL.

[AI-130] Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

链接: https://arxiv.org/abs/2605.27922
作者: Yilun Yao,Xinyu Tan,Chao-Hsuan Liu,Yaoming Li,Zhengyang Wang,Wenhan Yu,Zhewen Tan,Yuxuan Tian,Guangxiang Zhao,Lin Sun,Xiangzheng Zhang,Tong Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, 11 tables. The first three authors contributed equally

点击查看摘要

Abstract:LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system layer that manages context, tools, state, constraints, permissions, tracing, and recovery. However, existing benchmarks typically abstract away execution, compare complete agent systems, or hold the harness fixed, making execution-layer variation difficult to study. We introduce Harness-Bench, a diagnostic benchmark for evaluating configuration-level harness effects in realistic agent workflows. Harness-Bench evaluates representative harness configurations across multiple model backends under shared task environments, budgets, and evaluation protocols, while preserving each harness’s native execution behavior. The benchmark contains 106 sandboxed offline tasks constructed from practical agent-use patterns and manually reviewed for realism, solvability, oracle-checkability, and integrity. Each run records final artifacts, execution traces, usage statistics, and validator outputs, enabling analysis beyond final completion. Across 5,194 execution trajectories, we observe substantial variation in completion, process quality, efficiency, and failure behavior across model-harness pairings. These results suggest that agent capability should be reported at the model-harness configuration level rather than attributed to the base model alone. Our analysis further identifies recurring execution-alignment failures, where plausible reasoning becomes decoupled from tool feedback, workspace state, evidence, or verifiable output contracts. Harness-Bench provides a reproducible foundation for diagnosing and improving reliable, efficient, and auditable agent execution stacks.

[AI-131] SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats

链接: https://arxiv.org/abs/2605.27911
作者: Xiangyu Wang,Zhiwei Yu,Chengze Du,Dingchang Wang,Yuhan Ye,Fangyu Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Suicide is a critical global public health challenge, causing approximately 720,000 deaths each year and calling for timely, effective prevention strategies. Existing computational studies primarily focus on post-based social media platforms such as Twitter and Weibo, leaving instant messaging environments such as Telegram underexplored. Yet group chats pose distinct challenges: messages are short, fragmented, multi-party, and often rely on implicit or culturally specific expressions, making isolated post-level analysis insufficient. We introduce SuiChat-CN, a Chinese group-chat benchmark for contextual suicide risk assessment. We collect public Telegram group-chat data, construct coherent conversational segments through signal-word extraction and bidirectional context expansion, and annotate user risk levels with an expert-validated, LLM-assisted paradigm. SuiChat-CN contains 13,312 contextual segments from 1,406 users, covering 258,228 raw chat messages. Extensive experiments with PLMs and more than 40 LLMs demonstrate that contextual information is essential for reliable risk assessment, while fine-tuning and partial-context evaluation further reveal the challenges of early detection in multi-party conversations. Due to ethical and sensitivity concerns, the dataset is not publicly released but will be shared with accredited mental health and suicide-prevention research institutions upon reasonable request.

[AI-132] Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Reasoning Models via Reasoning-Conditioned Preference Optimization

链接: https://arxiv.org/abs/2605.27906
作者: Jiawei Kong,Hao Fang,Shunxiang Liao,Jinyu Li,Bin Chen,Hao Wu,Shu-Tao Xia,Min Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level direct preference optimization (DPO), where the Chain-of-Thought (CoT) and the final answer are treated as a monolithic output and optimized jointly. We reveal that this formulation performs similarly to answer-only optimization, suggesting that it primarily learns answer-level preference, while leaving CoT-level supervision insufficiently exploited. To address this issue, we explicitly formulate a CoT-oriented preference term and derive Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which models the CoT as a condition for answer generation and contrasts the preference for the same preferred answer under different CoT conditions, promoting answer-supportive reasoning chain alignment. To further improve optimization, we introduce a reasoning-enhanced preference data generation strategy that employs Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples, and attention-guided CoT token pruning to construct negative ones. Extensive experiments across various models and benchmarks show that RC-DPO effectively mitigates hallucinations and improves the reliability of the multimodal reasoning process.

[AI-133] Dr-CiK: A Testbed for Foresight-Driven Agents

链接: https://arxiv.org/abs/2605.27904
作者: Yihong Tang,Andrew Robert Williams,Arjun Ashok,Vincent Zhihao Zheng,Lijun Sun,Alexandre Drouin,Issam H. Laradji,Étienne Marcotte,Valentina Zantedeschi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Time series forecasting in real-world settings often depends not only on historical observations, but also on external context that must be actively discovered from noisy, heterogeneous information sources. Yet existing context-aided forecasting benchmarks typically assume that the supporting context is already provided, leaving open whether agents can identify it on their own. Therefore, we introduce Dr-CiK, a benchmark for evaluating whether agents can retrieve forecasting-relevant supporting context from a document corpus, filter out distractors, distill the retrieved context into forecast-useful evidence, and generate forecasts supported by that evidence. Through context ablations and evaluations of state-of-the-art deep research and forecasting methods paired together, we show that high-quality context substantially improves forecasting performance in Dr-CiK. However, most existing DR agents recover only a small fraction of the ground-truth supporting evidence (usually 5%), are frequently misled by distractors (80% distractor citations), and can cause forecasters to perform worse with retrieved context than without context. Our results motivate research on foresight-driven agents that search for the right context to predict the future.

[AI-134] SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

链接: https://arxiv.org/abs/2605.27899
作者: Hongxiang Lin,Zhirui Kuai,Erpeng Xue,Lei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Structured skill prompts improve exploration in long-horizon agentic reinforcement learning (RL). Skill-augmented RL methods retain external skills at inference, while skill-internalization RL methods withdraw them during training to enable autonomous performance. However, existing internalization approaches only use skill-helpfulness contrast for curriculum control, leaving the policy update unchanged and unable to distinguish skill-dependent from autonomous success. We propose SkillC, a framework based on Contrastive Skill Credit Assignment (CSCA) that converts this contrast into a direct learning signal for internalization. \textscSkillC samples paired skill-injected and skill-free rollouts for tasks from active skill types within the same policy update, and injects their task-level contrast into optimization via a dual-stream advantage estimator that preserves global ranking while applying a one-sided correction toward skill-free success. A smoothed validation-level signal further drives an adaptive curriculum over attribution strength, rollout allocation, and monotonic active-set pruning. Experiments on ALFWorld and WebShop show that, without runtime skill access, SkillC surpasses the strongest prior skill-internalization RL baseline by 5.5% and 4.4%, respectively, while remaining competitive with skill-augmented RL methods.

[AI-135] A Unified Framework for the Evaluation of LLM Agent ic Capabilities

链接: https://arxiv.org/abs/2605.27898
作者: Pengyu Zhu,Lijun Li,Yaxing Lyu,Qianxin Luo,Jingyi Yang,Yi Liu,Tingfeng Hui,Xinyu Yuan,Li Sun,Sen Su,Jing Shao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benchmark scores often jointly reflect model capability and the implementation choices each benchmark is packaged with, making cross-benchmark results difficult to interpret as clean measurements of the underlying model. In this work, we present a unified framework for the fair evaluation of LLM agentic capabilities. Driven by a unified configuration system, the framework integrates diverse benchmarks into a standardized instruction–tool–environment format, executes agents through a fixed ReAct-style architecture within a controllable sandbox, and provides an optional offline setting that replaces volatile live environments with curated snapshots, so that framework effects and environment effects can be analyzed separately. Building on this, we unify the evaluation methodology under each benchmark’s original task-success criteria, while introducing unified metrics for resource consumption and a taxonomy for decision- and execution-level failure attribution. Within this framework, we adapt 7 widely used benchmarks spanning 24 domains across single-agent, multi-agent, and safety-critical scenarios, and conduct a large-scale empirical analysis over 400K rollouts and 5B tokens on 15 models. The results show that scaffold choice and environmental volatility materially shift benchmark outcomes in both directions, allowing our framework to disentangle intrinsic LLM capabilities from framework- and environment-induced artifacts. We further demonstrate its extensibility as a secure testbed for safety-critical domains. Codes and benchmarks at are available at this https URL, this https URL.

[AI-136] PortBench: A Correlation-Aware Full-Pipeline Benchmark for LLM -Driven Portfolio Management

链接: https://arxiv.org/abs/2605.27887
作者: Yuxuan Zhao,Sijia Chen,Ningxin Su
机构: 未知
类目: Artificial Intelligence (cs.AI); Portfolio Management (q-fin.PM)
备注:

点击查看摘要

Abstract:LLMs have shown strong performance across diverse financial tasks, yet portfolio management (PM), a critical financial decision-making task, remains poorly benchmarked. Existing benchmarks exhibit two main gaps: they ignore cross-asset correlation structures, thereby failing to distinguish genuinely diversified portfolios from concentrated ones, and fail to evaluate the complete PM decision pipeline in real-world scenarios. We introduce PortBench, a benchmark spanning six heterogeneous asset classes over ten years. PortBench consists of two complementary layers: a static QA dataset of 6,269 correlation-based questions across seven task templates, and a dynamic five-stage allocation pipeline that mirrors the full PM decision cycle. To evaluate these layers, we introduce two dedicated metrics: a dual-layer correlation score that measures whether proposed portfolios exploit inter-class hedging and avoid intra-class concentration, and CEPS, a metric that quantifies how reasoning errors compound across pipeline stages. We further assess strategy robustness and investor alignment under three historical stress regimes and risk profiles. Evaluating ten frontier LLMs, we find that despite strong performance on static financial QA, 90% of model-profile combinations fail to outperform a basic equal-weight allocation, and models that satisfy every procedural constraint still suffer catastrophic drawdowns under stress. Our source code is available at \hrefthis https URLthis https URL.

[AI-137] owards Faithful Agent ic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

链接: https://arxiv.org/abs/2605.27879
作者: Jaechang Kim,Sunung Mun,Seungjoon Lee,Jaewoong Cho,Jungseul Ok
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explainable AI (XAI) helps users interpret model behavior and identify potential faults. Agentic XAI systems use Large Language Models (LLMs) to make explanations more accessible through natural-language interaction, but they can also produce plausible yet unfaithful explanations. This risk arises because unreliable XAI outputs for complex models can be amplified by LLMs and mislead users. We propose Faithful Agentic XAI (FAX), a framework that improves explanation faithfulness through explicit verification. FAX decomposes draft explanations into claims and cross-checks them against inherently faithful tools, filtering unsupported or contradictory claims before final generation. We also introduce CRAFTER-XAI-Bench, an open-world reinforcement learning benchmark with complex policies, diverse goals, and challenging scenarios for assessing model-specific faithfulness. On CRAFTER-XAI-Bench, FAX improves simulation faithfulness from 0.20 for the strongest baseline to 0.46 while maintaining high informativeness, relevance, and fluency. On three tabular benchmarks, FAX performs competitively with prior Agentic XAI baselines, but our analysis shows that these settings can conflate task accuracy with model-specific faithfulness. These findings show that explicit verification is essential for faithful Agentic XAI and that that faithfulness benchmarks must be designed to test explanations against the behavior of the target model itself.

[AI-138] SPAR: Support-Preserving Action Rectification

链接: https://arxiv.org/abs/2605.27877
作者: Jiaxin Zhao,Weihang Pan,Xun Liang,Binbin Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline policy improvement faces an inherent conflict between maximizing value and fitting the data distribution. While in-sample weighted regression is stable, it suffers from over-conservatism that suppresses high-value actions in the distribution tail; conversely, gradient-based approaches often exhibit a fitting-optimization conflict of gradients, which drives the policy off the data manifold. To address this, we propose Support-Preserving Action Rectification (SPAR), which reframes global learning as a local residual rectification anchored to a frozen pure behavior cloning policy. This framework performs fine-grained fitting and local policy improvement in the residual space, thereby contracting the search space. We further introduce Latent Self-Imitation, utilizing a latent-sampling weighted-regression mechanism to address fitting-improvement gradient conflict in the residual space. Theoretically, we prove this mechanism eliminates the manifold-normal drift of standard value gradients, while extensive D4RL experiments show SPAR extracts significant gains from suboptimal baselines to achieve state-of-the-art performance.

[AI-139] AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

链接: https://arxiv.org/abs/2605.27873
作者: Ruiyi Zhang,Peijia Qin,Qi Cao,Li Zhang,Pengtao Xie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI models underpin data-centric applications from image and text processing to scientific discovery in biology, physics, and chemistry. Yet developing them remains heavily manual, requiring practitioners to design architectures, build training pipelines, and iteratively refine solutions, making it challenging for natural scientists without specialized AI engineering expertise to build the high-performing models their research demands. To reduce this burden and broaden access to AI for scientific discovery, agents that automatically build AI models have been proposed. However, the performance of these agents is largely limited by the parametric knowledge of their underlying large language models, which is static, often outdated, and sparse on practical AI model engineering know-how. To address this limitation, we introduce AIBuildAI-2, a knowledge-enhanced agent with an external, evolving knowledge system for automatically building AI models. The knowledge system of AIBuildAI-2 is hierarchical, organizing curated AI development knowledge into high-level knowledge instructions over topical categories and low-level knowledge documents under each category, from which the agent dynamically loads only the context relevant to its current state and the AI task being solved, grounding each design and implementation decision in concrete, externally verifiable expertise. The system is initialized by collecting and cleaning AI-development-related documents from the web and organizing them into the corresponding categories, and continually evolves from the agent’s own experience by distilling each completed run on an AI task into structured takeaways that are written back into the knowledge system. AIBuildAI-2 achieves state-of-the-art results, ranking first on MLE-Bench with a 70.7% medal rate and placing in the top 6.6% among 4,370 human-expert teams in a heart disease prediction competition.

[AI-140] FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

链接: https://arxiv.org/abs/2605.27864
作者: Di Zhu, Lei (Nico)Zheng,Zihan Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages; 12 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph “second brain” that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

[AI-141] From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid Validation

链接: https://arxiv.org/abs/2605.27861
作者: Juergen Dietrich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 12 pages, 1 figure

点击查看摘要

Abstract:Predicting whether two drugs interact (binary detection) is a substantially dif- ferent task from predicting the mechanism type of that interaction (multi-class classification). This study presents a systematic ablation study of three Graph Neural Network (GNN) architectures for drug-drug interaction (DDI) prediction on a publicly available benchmark dataset comprising 38,337 positive pairs across 86 interaction types. Three architectures are compared under identical training conditions (n = 61,339 pairs): a siamese dual Message Passing Neural Network (MPNN) with concatenation (Concat), a dual MPNN with four-head cross-attention (CrossAtt), and a ternary MPNN incorporating an interaction graph (Ternary). CrossAtt improves multi-class F1-macro by +0.186 absolute (+45%) over Concat, while improving binary AUC by only +0.012 (+1.3%) - confirming that atom-level inter-molecular communication specifically enables mechanism-type classification. The ternary architecture underperforms despite equivalent training data, with its failure consistent with a training instability hypothesis. Validation on ten acetylsali- cylic acid (ASA) drug pairs, held out prior to training, demonstrates 10/10 correct DDI-type predictions for CrossAtt versus 0/10 for Ternary. Two consistent failure cases are identified across all architectures, linking to structural limits established in a companion toxicity study.

[AI-142] C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

链接: https://arxiv.org/abs/2605.27860
作者: Yuwei Miao,Gen Li,Yunsheng Zeng,Xiandong Li,Yujin Wang,Siyu Chen,Luning Wang,Yunhao Qiao,Junfeng Wang,Jianwei Lv,Bo Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation combined with reinforcement learning has shown promise for grounding large language models in trustworthy medical evidence. However, existing methods rely on exact-match binary rewards, which in clinical diagnosis cause two issues: (i) semantically relevant but non-verbatim steps receive zero signal, discarding valuable learning signals; and (ii) uni-dimensional rewards cannot effectively supervise heterogeneous reasoning capabilities. To address these issues, we propose C-MIG, a Multi-view Information Gain-based retrieval-augmented generation framework for Clinical diagnosis. C-MIG estimates information gain under a frozen reference model from two complementary views, retrieved-document and document-refinement, to jointly guide what to retrieve and how to refine, alleviating the issues of valuable reward signal loss and credit assignment. We further design a multi-subquery retrieval augmentation strategy that improves knowledge recall coverage in clinical diagnostic scenarios. Comprehensive experiments on four medical benchmarks demonstrate that C-MIG achieves the best performance among all RAG-RL methods on both in-domain and out-of-domain sets, and outperforms state-of-the-art general-purpose LLMs for clinical diagnosis.

[AI-143] MolLingo: Molecule-Native Representations for LLM -Powered Scientific Agents

链接: https://arxiv.org/abs/2605.27853
作者: Thao Nguyen,Heng Ji
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present MolLingo, a multi-agent system that emulates the reasoning process of a chemist to automate molecular design. Existing LLM-based approaches either operate as standalone generative models without access to external tools or lack the multi-agent coordination and shared memory needed for iterative, evidence-driven reasoning across the molecular design pipeline. MolLingo addresses this by coordinating a Literature Agent, a Chemist Agent, and an Orchestrator through a shared memory module, with each agent equipped with domain-specific tools. To enable effective molecular reasoning, we introduce BRICS-based Fragment Enumeration (BFE), a synthesis-aware molecular fragmentation method that decomposes molecules into chemically meaningful building blocks represented as block-based SMILES paired with common chemical names. This representation bridges molecular structure and LLM semantic space, enabling block-level reasoning and editing that is difficult with raw SMILES alone. As a case study in early-stage therapeutic design, MolLingo further grounds the Chemist Agent’s reasoning in binding site geometry and residue-level protein context derived from molecular docking to optimize molecules for stronger target binding. Across four benchmarks, MolLingo consistently outperforms frontier LLMs and specialized baselines, including a fourfold docking score improvement over GPT-5.4 despite using the same underlying model, consistent drug property optimization gains across multiple LLM backbones, and state-of-the-art results on TOMG-Bench, surpassing both frontier LLMs and the RL-based optimization method RePO. Our results suggest that LLMs are already capable molecular design assistants when guided through chemically meaningful representations and biologically grounded structural context. Code is available at: this https URL.

[AI-144] When Context Flips Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

链接: https://arxiv.org/abs/2605.27851
作者: Dasol Choi,Alex Kwon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety benchmark scores provide incomplete evidence of deployment readiness: aligned language models often adhere to rigid rules even when a situational update flips which action is safe. We term this failure brittle safety. To diagnose it, we introduce context-flip evaluation, testing 12 models across a safety benchmark (PacifAIst) and two commonsense controls using paired variants where the nominally safe action produces harm. Three findings emerge. First, brittle safety is safety-specific: all 12 models exhibit a safety-commonsense gap (mean +17.4 pp). Baseline accuracy fails to predict brittleness: among models above 90% baseline accuracy, brittleness rates range from 13.7% to 90.0%. Second, failures stem from policy override rather than miscomprehension: despite acknowledging the context change in every case, models persist via three distinct mechanisms that vary by update type and model family. Third, on a hand-audited probe of catastrophic consequence-flip scenarios, standard action-level guardrails catch none, while a state-aware validator catches all without false alarms on correct interventions. This indicates that action-level content moderation is systematically blind to consequence-flips, motivating state-aware architectural alternatives. We release our protocol, perturbed benchmarks, and deployment probe.

[AI-145] CP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems

链接: https://arxiv.org/abs/2605.27850
作者: Yi Ding,Zijie Xuan,Haowei Zhou,Zhenyu Ju,Xiaoxiao Dong,Jingwen Zhang,Xingyu Zhu,Leixin Sun,Haochi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective multi-agent systems cannot be designed by selecting prompts or communication graphs in isolation. Agent behavior depends on the information an agent receives, while the usefulness of a communication edge depends on how the receiving agent interprets and uses that information. We propose \textbfTCP-MCP (Topology-Coupled Prompting for Multi-Agent Collaborative Problem-Solving), a co-evolution framework that searches agent prompts and communication topologies as a unified genome. TCP-MCP uses an initialization-time landscape probe to calibrate early search behavior, and then relies on Pareto-front diagnostics to adapt exploration under three objectives: task performance, token cost, and structural complexity. Using the same DeepSeek-V3.2 backbone across all methods, TCP-MCP achieves 82.66%, 89.96%, and 96.61% accuracy on MMLU-Pro, MMLU, and GSM8K, respectively. Across the three benchmarks, it consistently outperforms automated graph-generation baselines and achieves competitive accuracy relative to debate-style systems, while using up to 5.69 \times fewer tokens than those systems at the reported operating points. These results show that jointly evolving prompts and communication structure provides a practical route to cost-aware and task-adaptive multi-agent system design in controlled evaluations.

[AI-146] EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

链接: https://arxiv.org/abs/2605.27846
作者: Yunsheng Zeng,Gen Li,Yuwei Miao,Xiandong Li,Yujin Wang,Siyu Chen,Luning Wang,Yunhao Qiao,Junfeng Wang,Jianwei Lv,Bo Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models are typically trained via reinforcement learning from verifiable rewards (RLVR). However, existing approaches adopt fixed weights for positive and negative samples, and the conclusions hardly generalize to open-ended question answering (QA). In this paper, we systematically investigate the roles of positive and negative samples in reinforcement learning for open-ended QA. We propose a reward-mean-based strategy for distinguishing positive from negative samples, and observe that negative samples predominantly govern response diversity and the performance upper bound, whereas positive samples primarily determine response quality and convergence stability. Building on these observations, we propose EAPO, an Entropy-driven Adaptive Policy Optimization method that adaptively computes the weighting coefficients of positive samples based on the ratio of the current policy entropy to the initial entropy. During the entropy-decreasing phase, the weight assigned to positive samples is reduced to preserve exploration, whereas during the entropy-increasing phase it is amplified to reinforce stability, thereby mitigating entropy collapse. Experiments on two publicly available open-ended medical QA datasets demonstrate that EAPO consistently and substantially outperforms fixed-weight baselines in both response diversity and stability.

[AI-147] Snippet-Driven Supply Chain Discovery with LLM s: Scaling Visibility in China

链接: https://arxiv.org/abs/2605.27845
作者: Hiroto Fukada,Takayuki Mizuno
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)
备注: 8 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Financial and economic research often relies on structured supply-chain disclosures and commercial databases. In China, supplier–customer disclosure is typically limited to major partners of listed firms, leaving unlisted firms and long-tail inter-firm links poorly captured in structured data. Public web evidence can partly complement this gap through corporate, government, and trade-media disclosures; however, full-text web mining at scale is costly because pages are often inaccessible or expensive to process with large language models (LLMs). We propose a snippet-driven method for constructing a supply chain knowledge graph (SCKG), with firms as nodes and inter-firm relationships as edges. Web search snippets are query-biased summaries returned with search results. We use them as a scalable first-pass evidence layer for LLM-based relationship extraction. We evaluate the pipeline in terms of extraction efficiency and coverage. For extraction efficiency, exhaustive full-text chunking discovers 19.8 \times more unique relationships than snippets, but requires 251.2 \times more input tokens and yields higher redundancy. For coverage, we use 130,685 Chinese firms as search seeds, covering Shanghai/Shenzhen-listed firms and large unlisted firms as of 2024. In the listed-firm subset, the resulting SCKG covers 7.2 \times more firms and 9.3 \times more relationships than the CSMAR disclosure-based benchmark, while revealing heavy-tailed degree patterns. Retained provenance metadata make the SCKG an auditable complement to disclosure-based databases.

[AI-148] Symmetry Defeats Auditing

链接: https://arxiv.org/abs/2605.27836
作者: Nick Merrill,Zeke Medley
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We demonstrate an attack on Introspection Adapters (Shenoy et al., 2026).

[AI-149] Operational AI Deployment Assurance: Governance-State Orchestration Under Threshold-Sensitive Deployment Conditions – A Governance Framework for High-Stakes AI Systems

链接: https://arxiv.org/abs/2605.27827
作者: Khalid Adnan Alsayed
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages, 3 figures, governance-oriented framework for operational AI deployment assurance in high-stakes systems

点击查看摘要

Abstract:AI governance frameworks increasingly emphasize fairness, transparency, accountability, and lifecycle risk management in high-stakes domains. However, many current approaches remain observational, relying on static metric reporting, post-hoc auditing, and monitoring dashboards without directly governing deployment readiness, remediation progression, escalation states, or assurance-driven deployment control. This paper introduces Operational AI Deployment Assurance (OADA), a governance framework for translating fairness disagreement, subgroup instability, threshold sensitivity, remediation outcomes, and operational uncertainty into deployment-oriented assurance decisions. Building on prior work on the Fairness Disagreement Index (FDI) and FairRisk-FDI, OADA reframes governance uncertainty as an operational concern within AI deployment pipelines rather than a byproduct of metric disagreement. The framework introduces Deployment Assurance Scores, Deployment Readiness Classifications, Threshold Stability Zones, Governance Escalation States, and remediation-aware assurance progression. These constructs support lifecycle-oriented governance decisions across high-stakes settings by connecting evaluation outputs to deployment-state interpretation, reassessment, escalation, and operational control. Through deployment-oriented evaluation across facial recognition systems, with discussion extended to healthcare AI as a representative high-stakes domain, the paper demonstrates how systems may appear acceptable under isolated fairness or performance metrics while still exhibiting instability that affects deployment readiness. The proposed framework positions operational deployment assurance as a governance layer between evaluation and real-world AI deployment.

[AI-150] EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

链接: https://arxiv.org/abs/2605.27820
作者: Yunqi Liu,Tong Niu,Zitong Wang,Zhenlong Dai,Yuqi Qing,Weiqiang Wang,Jian Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 68 pages, 6 figures

点击查看摘要

Abstract:As AI agents increasingly operate in open, real-world environments, they require a deep synergy of multimodal perception, tool invocation with multi-hop reasoning, and dynamic interaction with users. However, existing benchmarks fail to jointly evaluate these capabilities due to challenges in designing strictly coupled multi-capability tasks, simulating natural and task-constrained user feedback, and ensuring objective evaluation of dynamic interaction. To bridge this gap, we introduce EgoBench, the first interactive multimodal benchmark for tool-using agents. EgoBench comprises 1,045 egocentric-video-grounded tasks covering four daily scenarios, along with a user-agent-tool interactive environment for evaluation. We implement a three-stage synergistic pipeline through which each task is designed to enforce the joint application of visual perception and tool-augmented multi-hop reasoning. We additionally develop a multi-agent simulated user within EgoBench to evaluate agents’ interaction capabilities, which generates high-fidelity, task-aligned responses to agents. Furthermore, we establish a deterministic joint validation framework that guarantees objective assessment through process-based and result-based equivalence. Benchmarking eight SOTA video-MLLM agents on EgoBench reveals a severe performance ceiling: the best model achieves only 30.62% accuracy in the best-performing scenario, averaging 19.43% across all four scenarios. Finally, we conduct a multi-dimensional error analysis to disentangle failure modes, exposing capability bottlenecks for advancing future AI agents.

[AI-151] ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

链接: https://arxiv.org/abs/2605.27819
作者: Prathyush Poduval,Calvin Yeung,Neel Desai,Mohsen Imani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse autoencoders are usually trained one layer at a time, even though transformer residual stream activations are strongly coupled across depth. This creates a practical problem for multi-layer interventions: different layerwise dictionaries can spend capacity representing the same carried-forward information, and replacing several layers at once can produce interactions that are not predicted by single-layer behavior. We introduce Residualized Sparse Autoencoders (ReSAEs), which fit an affine map between selected layers and train each later-layer SAE on the unexplained residual rather than on the full activation. Reconstructions are mapped back into the original activation space through the fitted affine chain, so ReSAEs can be evaluated with the same intervention protocols as ordinary SAEs. On Pythia-1.4B and Gemma-2-9B, residualization reduces decoder redundancy and improves sparse probing and targeted perturbation in most tested settings. Despite reconstructing less of the raw activation variance, ReSAEs recover more transformer cross entropy under multi-layer replacement. This gain is clearest under teacher-forcing and at sufficient sparsity online, indicating that ReSAEs preserve the components of the activation most relevant to the model’s downstream computation. These results suggest that removing linearly predictable cross-layer structure is a useful default for multi-layer SAE interventions.

[AI-152] Constrained Auto-Bidding via Generative Response Modeling

链接: https://arxiv.org/abs/2605.27811
作者: Eunseok Yang,Xingdong Zuo,Kyung-Min Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Auto-bidding systems aim to maximize advertiser value over long horizons under budget constraints and ratio targets such as cost-per-acquisition, yet future traffic and auction dynamics are non-stationary and uncertain. Existing approaches face distinct limitations: control-based pacing reacts to deviations but cannot anticipate future conditions, while RL and generative methods fold constraints into reward signals, obscuring violations and degrading under distribution shift. We shift the learning target from actions to responses with the Generative Response Model (GRM), a history-conditioned sequence model that jointly predicts future traffic volume and horizon-aggregate cost/value curves as functions of a single bid multiplier. We show that under mild monotonicity conditions, the optimality gap relative to full per-tick control is bounded by the dispersion of per-tick marginal value-per-cost. Given predicted responses, a lightweight analytic controller enforces each active constraint via a 1D root-finding step. We prove this controller is exact for the single-multiplier problem and bound constraint violations under receding-horizon replanning in terms of prediction error. Experiments on AuctionNet show that GRM improves constraint stability and overall score compared to existing baselines.

[AI-153] GraD-IBD: Graph Representation Learning from Diagnosis Trajectories for Early Detection of Inflammatory Bowel Disease

链接: https://arxiv.org/abs/2605.27799
作者: Leo Y. Li-Han,Ellen L. Larson,Elizabeth B. Habermann,Cornelius A. Thiels,Hojjat Salehinejad
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:International Classification of Diseases (ICD) is a globally recognized coding system that records diagnostic events during each patient encounter, providing a standardized data foundation for various clinical tasks. However, the irregular and hierarchical nature of ICD code sequences poses challenges for N-D lattice-based sequential modeling methods, leading to overly complex model designs. In this paper, we propose GraD-IBD, a graph diagnosis model that reformulates longitudinal ICD trajectories as visit-bucketized, temporally directed graphs to detect the risk of inflammatory bowel disease (IBD). A novel context-aware, time-decay message passing mechanism was developed to capture temporal dependencies while reducing model complexity. The experimental results using a real-world clinical dataset demonstrated consistent and robust improvements in IBD detection over state-of-the-art methods, with significant reductions in computational complexity compared to sequential models. These findings highlight the potential of graph representation learning to enable efficient, scalable, and accurate disease risk prediction from longitudinal ICD diagnosis codes.

[AI-154] Locality-Aware Redundancy Pruning for LLM Depth Compression

链接: https://arxiv.org/abs/2605.27786
作者: Vincent-Daniel Yun,Youngrae Kim,Woosang Lim,YoungJin Heo,Minkyu Kim,Sunwoo Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are known to contain representational redundancy across network depth, making depth pruning an effective approach for improving inference efficiency. Existing one-shot pruning methods rely on local layer importance or fixed redundancy assumptions across architectures. We propose Locality-Aware Redundancy Pruning (LoRP), a training-free one-shot depth pruning framework guided by representation locality. We show that inter-layer redundancy can be either localized or globally distributed depending on the LLM architecture. To characterize this phenomenon, we introduce Representation Locality Score (RLS), derived from global inter-layer hidden-state similarity. Using a small calibration set, LoRP computes pairwise layer similarity, clusters layers by representational similarity, and allocates pruning according to residual intra-cluster redundancy. Experiments across diverse LLM families show improvements in both perplexity and downstream task accuracy.

[AI-155] A Query Engine for the Agents

链接: https://arxiv.org/abs/2605.27785
作者: Kenny Daniel
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 4 pages, 1 figure, 3 tables

点击查看摘要

Abstract:The fastest-growing data in production today is unstructured text: agent traces, chat logs, reasoning chains, model outputs. People want to analyze it, and the questions worth asking (“show me where the agent got confused”) cannot be answered by SQL alone, since text is not queryable without a model in the query path. The natural place this analysis is happening is the new class of AI applications (Claude Code, Cursor, Claude Desktop, in-browser agents) that run client-side and host both a human user and an LLM agent in the same process. These applications increasingly want to work with data, but the lakehouse read path has been hard to use from a JS runtime: Spark, Trino, and managed warehouses do not fit there. To build this new kind of AI data application, three properties of the engine become first-order: a JS-native distribution that drops into the runtime the application already runs in, a bundle small enough to ship inside a cold tab or per-turn agent sandbox, and a way to interleave analytic operators with model-based interpretation of text. We present Hyperparam, three open-source JavaScript libraries (Hyparquet, Squirreling, Icebird) totaling under 70 KB, that read Parquet and Apache Iceberg directly from object storage and meet the third property with per-cell, async-native SQL execution, so expensive cells fire only when downstream operators demand them. Squirreling runs LLM-shaped async UDFs over 300x faster than DuckDB-WASM on filter-bounded queries (and 192x on sort-bounded queries) and completes a ten-task agent analyst suite at two-thirds lower cost. We argue that data engineering as a discipline needs to update for the AI-native client applications now in production and the agents that work alongside their users.

[AI-156] Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

链接: https://arxiv.org/abs/2605.27784
作者: Lu Yan,Xuan Chen,Xiangyu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents are governed by long-lived natural-language prompt policies, but individually reasonable standing rules can interact in uninspected ways. We study live intra-policy rule-conflict diagnosis: finding rule pairs inside a single prompt policy that can co-govern a realistic state, and measuring how models resolve that pressure in responses or tool actions. We introduce WIRE, a Witnessed Intra-policy Rule Evaluation pipeline. WIRE extracts source-grounded rules, encodes them as PyRule clauses, uses satisfiability checks to retain same-surface hard-collision candidates, realizes those candidates as concrete co-governance witnesses, and judges model outputs against the original source-rule text. Across six public prompt policies, WIRE extracts 276 source rules and 560 atomic clauses, classifies 30,944 within-policy clause-pair comparisons, retains 170 encoded hard-collision candidate source-rule pairs, and realizes them as 1,402 concrete witnesses. In policy-only evaluation, these witnesses yield 13,335 post- generation trials where both source rules govern and both compliance labels are judgeable. Only 35.4% fall in joint compliance; 64.6% violate at least one governed source rule. These profiles are conditional diagnostics for WIRE-selected candidates, not deployment-frequency or causal excess failure estimates, but they reveal distinct policy, model, and tool-action resolution patterns.

[AI-157] Auditable Decision Models with Learned Abstention and Real-Time Steering

链接: https://arxiv.org/abs/2605.27768
作者: Sankaranarayanan Palamadai Chandrasekaran
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures

点击查看摘要

Abstract:Production AI systems often operate with incomplete, conflicting, or insufficient evidence. Forced classifiers collapse such cases into action labels, while generative systems can produce outputs that are difficult to interpret as auditable execution decisions. We study operational decision control for AI systems, where uncertainty must be explicitly routable, policy-governed, and auditable rather than hidden inside forced predictions or free-form generation. We present EvaluatorDPT, a bounded decision-control model that predicts YES, NO, or TBD, where TBD is learned as a deferral outcome rather than added only as a post-hoc confidence rule. The model uses a transformer encoder with a primary bounded-decision head and structured auxiliary channels for values and emotions/sentiments. The interface is domain-agnostic in form: a deployment domain supplies evidence and policy thresholds, while the model emits a bounded distribution that can be controlled at inference time through recorded operating thresholds and, when validated, auxiliary semantic signals. For the evaluated model version, we report decision performance on held-out validation and test splits; auxiliary emotion metrics are omitted because the emotion head is disabled for this evaluation. On the held-out test split (n=44,597), the model achieves Accuracy = 0.8260 and Macro F1 = 0.8252, with per-class F1 of 0.8314 (YES), 0.8486 (NO), and 0.7956 (TBD). The evaluation record also includes calibration evidence (ECE = 0.0338 on validation), threshold-sweep outputs, multi-seed stability checks, confusion matrices, and reproducibility commands. Our main contribution is a bounded execution interface in which deferral is learned, inference-time routing remains inspectable, auxiliary signals provide a path to auditable behavior control, and evaluation evidence supports external review.

[AI-158] Got a Secret? LLM Agents Cant Keep It: Evaluating Privacy in Multi-Agent Systems

链接: https://arxiv.org/abs/2605.27766
作者: Aman Priyanshu,Supriti Vijay,Esha Pahwa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook-style simulation platform where thousands of LLM agents interact across communities over a simulated month, and use it to evaluate privacy as a downstream safety concern under varying degrees of social pressure. We find that shifting from single turn to multi turn social evaluation amplifies privacy violations (CIMemories 19.95% to Ours 45.30% across OpenAI models), that leakage is socially contagious, with agents 8 times more likely to disclose sensitive information after observing a peer do so, and that explicit privacy instructions reduce but do not eliminate this effect, leaving leakage rates above 37.8% even with safeguards. Our findings suggest that static chat based safety benchmarks systematically underestimate risks in agentic deployment, and that social context alone is sufficient to elicit sensitive disclosures that single turn evaluations would never surface.

[AI-159] Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

链接: https://arxiv.org/abs/2605.27765
作者: Zehao Liu,Yuanpu Cao,Jinghui Chen,Vasant G. Honavar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language models by leveraging the model’s own feedback-conditioned predictions as a self-teacher. Unlike GRPO, however, whose group-relative advantage naturally concentrates learning on a sweet spot of intermediate-difficulty questions, SDPO’s KL-based advantage lacks an implicit notion of difficulty awareness. We analyze this gap through the lens of GRPO’s advantage normalization. Extending the learnability framework to normalized rewards, we show that normalization absorbs the variance term p(1-p) , equalizing leading-order learnability across questions and leaving \sqrtp(1-p) as the sole residual scaling factor in the per-question gradient. This analysis yields a simple prescription: weight each question’s SDPO loss by [\hatp(1-\hatp)]^1/2 , resulting in SC-SDPO, a scale-consistent variant of SDPO. The proposed weights are obtained as a zero-cost byproduct of on-policy rollouts with batch-adaptive normalization, inducing an implicit curriculum that dynamically tracks the model’s evolving competence. Experiments on scientific reasoning and tool-use benchmarks demonstrate that SC-SDPO consistently improves over SDPO, yielding gains of +3.2/+4.3 (mean@16/maj@16) on Qwen3-8B and +1.8/+3.0 on OLMo-3-7B, while preserving stable training dynamics throughout optimization. Comments: 18 pages, 8 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.27765 [cs.LG] (or arXiv:2605.27765v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.27765 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-160] PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

链接: https://arxiv.org/abs/2605.27762
作者: Yuchen Guo,Junli Gong,Hongmin Cai,Yiu-ming Cheung,Weifeng Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present PEAM, a Parametric Embodied Agent Memory framework in Minecraft that transforms agent memory from inference-time retrieval into parameter-resident skills internalized through experience. PEAM pairs a slow deliberative LLM for open-ended reasoning with a fast parametric module for reflexive execution of consolidated skills. The fast module is a multimodal Mixture-of-Experts LoRA architecture with per-category physically isolated adapters, enabling parameter-level continual learning without catastrophic forgetting. We treat failure as a first-class training signal: failure–correction trajectory pairs are internalized through a joint behavioral-cloning and contrastive objective, so the agent learns not only what succeeds but also how corrected actions differ from failed ones. To govern consolidation, PEAM introduces a parameterization-worthiness score for deciding which experience should be internalized, and a scale-free self-triggered consolidation mechanism for deciding when to internalize without task-specific hand-tuned thresholds, making the agent self-evolving as the trigger transfers across task distributions without re-tuning. Experiments in Minecraft show that PEAM improves long-horizon task performance, mitigates forgetting on previously consolidated skills, and improves parametric-versus-retrieval efficiency over retrieval-based embodied agents and parametric memory variants.

[AI-161] SkillGrad: Optimizing Agent Skills Like Gradient Descent

链接: https://arxiv.org/abs/2605.27760
作者: Hanyu Wang,Yifan Lan,Bochuan Cao,Lu Lin,Jinghui Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured files. However, whether downloaded from third parties or self-generated, these skills are often unreliable, incomplete, or outdated. Existing skill-evolution methods often address these deficiencies through heuristic reflections without an explicit optimization formulation. In this paper, we propose SkillGrad, a gradient-descent-inspired framework for optimizing agent skills. SkillGrad treats the skill package as a structured parameter to optimize in a gradient descent fashion: task executions provide trajectory-level loss evidence, automatic diagnoses then provide text-based gradients that indicate the correction directions. To stabilize optimization across iterations, a momentum agent accumulates recurring diagnostic patterns into a persistent memory overlay. Finally, an LLM-based patcher executes the parameter update by applying layer-aware edits to the skill package. Evaluated on SpreadsheetBench Verified and WikiTableQuestions, SkillGrad consistently outperforms training-based skill evolution baselines across two backbone LLMs, improving over the strongest training-based baseline by 6.7 percentage points on average. Ablations further show that momentum and contrastive diagnosis both contribute to the final skill quality.

[AI-162] High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention

链接: https://arxiv.org/abs/2605.27758
作者: Deepak Akhare,Mohammad Amin Nabian,Corey Adams,Sudeep Chavare,Sanjay Choudhry
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Automotive crashworthiness optimization remains a safety-critical challenge, requiring the management of large-scale nonlinear structural deformations and energy dissipation through iterative, high-fidelity simulations. While traditional finite element solvers are computationally prohibitive, emerging operator learning frameworks provide rapid surrogate predictions; however, applying them to industrial-scale crash analysis, where complex geometry, contact nonlinearities, and rapidly evolving transient deformation coexist, remains an open challenge. In this paper, we demonstrate that the GeoTransolver framework provides a viable solution for accurate, high-fidelity crash dynamics prediction at industrial scale. Benchmarked on complex bumper beam and full-vehicle crash datasets, GeoTransolver captures multi-scale geometric context and accurately resolves plastic deformation patterns as well as acceleration profiles at critical occupant locations. Beyond the architecture itself, we propose and systematically evaluate a suite of temporal prediction recipes, including one-shot, time-conditional, and autoregressive rollout strategies, demonstrating that the one-shot approach achieves state-of-the-art accuracy with significantly reduced training overhead and inference latency. As a secondary contribution, we introduce a Fast Low-rank Attention Routing Engine (FLARE)-based modification to the GeoTransolver attention backbone that reduces memory overhead by approximately 2x while further improving predictive accuracy for O(N) long-range, high-frequency transients, preserving the geometry-aware cross-attention strengths of the base framework. Our results highlight the practical viability of geometry-aware operator learning for high-fidelity surrogate modeling of complex, safety-critical automotive dynamics.

[AI-163] Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

链接: https://arxiv.org/abs/2605.27752
作者: Hankyeol Kim,Pilsung Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single prompt template, probability scale, and output format. We then vary the measurement axes that define the verbalized-vs-token comparison: which answer string receives the token-probability score, how that score is read from the answer tokens, and under which conditioning context it is measured. We evaluate this design on four QA benchmarks across three open 7–8B base/Instruct model families, with larger Qwen2.5 variants as same-family robustness checks. The resulting comparison is sensitive to these choices: conditioning context changes the sign or magnitude of the ECE gap across settings, token readout produces smaller but still sign-moving changes, and changing the ECE estimator has little effect. Under the default generated-answer, bare-context protocol, Instruct settings are close to parity rather than showing a large calibration gain for verbalized confidence. In a separate supplied-answer analysis, surface-plausible wrong answers receive nearly the same confidence as supplied gold answers, suggesting that verbalized confidence also reflects answer plausibility and provenance rather than correctness alone. We argue that both confidence signals should be treated as protocol-dependent behavioral measurements, and provide a reporting checklist covering elicitation provenance, scored answer, token-probability readout, and conditioning context.

[AI-164] A Policy-Driven Runtime Layer for Agent ic LLM Serving

链接: https://arxiv.org/abs/2605.27744
作者: Rui Zhang,Chaeeun Kim,Liting Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine-level event; the serving engine below sees every event but knows nothing about agents. A surprising number of cross-cutting policies depend on both: prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, safety enforcement, and more. Each lives in the seam between the two layers and is currently solved by a one-off patch into one neighbor or the other. We argue this seam is best addressed by an architectural change rather than point fixes: insert a third tier, an agent runtime layer, between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent-aware policy plugs, with agent identity as the shared coordinate. We map nine concrete policies onto the layer and validate the abstraction in depth on the one with the largest immediate serving-cost lever: KV caching across sessions, instantiated as CacheSage, which learns the per-workload agent transition matrix online and uses it for survival-based eviction and between-step prefetch. Preliminary results on five real multi-agent workloads show +13 to +37 pp cache hit-rate lift, 12% to 29% lower mean TTFT, and 6% to 14% higher throughput over an unmodified serving stack.

[AI-165] Worker Disagreement Reveals Sharp Directions in Local SGD ICML

链接: https://arxiv.org/abs/2605.27739
作者: Tolga Dimlioglu,Kristi Topollai,Anna Choromanska
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages main body, 18 pages appendix - Accepted to HiLD 2026, ICML

点击查看摘要

Abstract:Deep neural network training often exhibits highly anisotropic loss geometry, where a few sharp dominant Hessian directions coexist with a large flatter bulk. Gradients tend to align disproportionately with these dominant directions, although stable progress often requires movement through flatter bulk directions. Estimating the dominant subspace is therefore useful but costly with direct Hessian-based methods. We show that standard Local SGD exposes this geometry through worker disagreement. We theoretically show that the worker-average gap covariance is shaped by stochastic-gradient noise and Hessian curvature, causing workers to disagree along sharp, curvature-sensitive directions. Thus, worker-average gaps provide a cheap Hessian-free estimator of the dominant subspace. Experiments on MLPs, CNNs, and Transformers show that subspaces formed by worker-average gaps capture a substantial fraction of the gradient component lying in the dominant Hessian eigenspace.

[AI-166] HumanoidMimicGen: Data Generation for Loco-Manipulation via Whole-Body Planning

链接: https://arxiv.org/abs/2605.27724
作者: Kevin Lin,Ajay Mandlekar,Caelan Reed Garrett,Nikita Chernyadev,Yu Fang,Runyu Ding,Yuqi Xie,Justin Tran,Linxi Fan,Yuke Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: website: this https URL

点击查看摘要

Abstract:Imitation learning is a promising approach for training humanoid robots to both walk and manipulate, but it requires a large number of demonstrations, which are time-intensive and difficult to collect via teleoperation. Existing data-generation algorithms can automatically synthesize demonstrations for manipulators, but they are ineffective on humanoids because their high-dimensional composite action spaces involve arms, legs, and torsos. We present HumanoidMimicGen, a method for generating humanoid legged loco-manipulation data. Our method adapts contact-rich whole-body skills from a handful of source demonstrations to new states, generalizing across changes in object pose. By interleaving these single- and dual-arm skills with whole-body locomotion and manipulation planning, the method generates stable, collision-free data across diverse scenes and layouts. To evaluate our approach, we introduce a new simulated loco-manipulation benchmark containing nine diverse tasks that test humanoid loco-manipulation capabilities. There, we demonstrate that HumanoidMimicGen automatically generates large datasets for imitation learning and enables a systematic study of how data generation and policy learning decisions impact model performance. We show that whole-body visuomotor policies co-trained with data generated by HumanoidMimicGen outperform those trained only on real-world data by 20%.

[AI-167] Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

链接: https://arxiv.org/abs/2605.27712
作者: Zhenghan Song,Yunyi Li,Yulong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long reasoning traces need reliability estimates before final answers are known. We study prefix-conditioned eventual-success estimation, P(y=1 \mid o_1:t) , using prefix-safe observations. Sequential Bayesian Belief Tracking (SBBT) calibrates observation likelihoods and recursively updates a two-state belief, providing a common tracker for scalar scores, text and self-verification markers, hidden clusters, token-pooling probes, and latent-trajectory features. Across generated open-weight traces on MATH-500, GSM8K, AIME 2025, and RIMO-N, probability quality and ranking separate: score-only SBBT often improves Brier, while AUROC gains require structure-aware evidence beyond strong prefix-safe baselines. In the strongest hard math setting, structure-aware observations reach +0.110 AUROC against standard prefix-safe baselines. Under a same-prefix classifier audit, MATH-500 text markers and RIMO-N self-verification signals remain positive. Together, these findings support SBBT as a calibration-aware online inference framework and expose an evidence regime: scalar scores mainly support probability quality, while structure-aware prefix signals support ranking only when strong prefix-safe baselines have not already absorbed the rank evidence.

[AI-168] DeepSciVerify: Verifying Scientific Claim–Citation Alignment via LLM -Driven Evidence Escalation

链接: https://arxiv.org/abs/2605.27710
作者: Shaghayegh Sadeghi,Khashayar Khajavi,Rise Adhikari,Alexander Tessier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Misalignment between claims and their cited evidence is a common failure mode in reports generated by large language models, limiting their reliability in scientific and other high-stakes settings. We present DeepSciVerify, a two-stage pipeline for scientific claim-citation verification that combines abstract-level reasoning with selective escalation to passage-level evidence. The system first verifies claims using the abstract and defers uncertain cases, retrieving and analyzing full-text passages only when necessary. This design leverages complementary behaviors across LLMs, as some models are more conservative while others are more decisive under uncertainty. On the SCitance benchmark, DeepSciVerify achieves 86.7 Micro-F1, outperforming strong abstract-only baselines by +4.5 points while resolving 67% of instances without full-text retrieval. These results suggest that selective evidence escalation improves both accuracy and efficiency in claim-citation verification.

[AI-169] Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agent ic Language Models

链接: https://arxiv.org/abs/2605.27703
作者: Joan Vendrell Gallart,Russell Bent,Michael Grosskopf
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models are increasingly deployed inside agentic systems, where they must follow structured protocols, adapt to evolving states, and operate under memory, latency, and cost constraints. In such regimes, prompt extension is unreliable: growing contexts can push compact models outside their effective prompt domain, while deployment-time fine-tuning remains limited by scarce data and compute. We propose a hierarchical control-and-learning framework in which a compact model is first distilled to learn the required output schema, then supervised online by an oracle-controller loop. The controller monitors protocol validity and semantic performance, projects accumulated histories into a feasible prompt domain, and triggers lightweight oracle-supervised fine-tuning under drift. This separates schema learning for communication compatibility from semantic adaptation for task-level correction. We formalize prompt-domain feasibility and attention-induced saturation, motivating control of the effective prompt state rather than reliance on nominal context length. Using Multi-Fidelity Bayesian Optimization as a controlled sequential testbed, we characterize a core deployment failure mode and show improved reliability and cost-efficiency over non-hierarchical, distillation-only, and non-distilled baselines.

[AI-170] Cross-Entropy Games and Frost Training

链接: https://arxiv.org/abs/2605.27701
作者: Arthur Renard,Franck Gabriel,Valentin Hartmann,Clément Hongler
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures

点击查看摘要

Abstract:We present Frost Training, a method for improving Monte Carlo-based policy optimization for a large family of LLM-as-a-judge tasks called Cross-Entropy Games. The key idea is to exploit the gradient of the reward function in embedding space. This signal is used in the Greedy Coordinate Gradient (GCG) jailbreaking technique; we demonstrate for the first time that it can also be used to boost model training. We validate our method using GRPO training for maximum-likelihood infilling. Frost Training improves the model’s ability to generate high-scoring outputs, reaching higher maximum scores in a best-of-k setting, and does so at an increased speed.

[AI-171] CiteCheck: Retrieval-Grounded Detection of LLM Citation Hallucinations in Scientific Text

链接: https://arxiv.org/abs/2605.27700
作者: Khashayar Khajavi,Shaghayegh Sadeghi,Rise Adhikari,Alexander Tessier
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to generate scientific reports, but they can produce references that appear plausible while containing corrupted metadata or pointing to papers that do not exist. We introduce CiteCheck, a hybrid framework for citation hallucination detection that verifies whether a citation corresponds to a real scholarly work and whether its metadata is faithful to that work. CiteCheck retrieves candidate publications from external scholarly sources, compares the citation against the retrieved candidate using a structured LLM verifier, and maps verifier scores into three labels: Exact, Minor, and Major. We also construct a 982-citation physics benchmark with controlled corruptions that capture both subtle metadata drift and fully fabricated references. On the held-out test set, CiteCheck achieves 88.7 macro-F1 and 88.9% accuracy, outperforming GPT, Claude, and Gemini baselines, including web-search and few-shot variants. These results show that reliable citation verification benefits from combining scholarly retrieval, structured LLM-based comparison, and calibrated decision rules.

[AI-172] Simulation-Informed Diffusion for Decentralized Multi-robot Motion Planning

链接: https://arxiv.org/abs/2605.27697
作者: Jinhao Liang,Sven Koenig,Ferdinando Fioretto
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Decentralized multi-robot motion planning requires each robot to generate collision-free trajectories from local observations, without global sensing or reliable communication. However, most existing planners, whether classical or learning-based, generate trajectories from a static snapshot of the local observation, which limits their ability to anticipate the future behavior of neighboring robots. This limitation is critical as the number of robots increases and the environment becomes more cluttered. To overcome this challenge, this paper introduces Simulation-Informed Diffusion (SID), a decentralized framework built on constraint-aware diffusion models (CADM). SID first uses CADM to simulate the future trajectories of neighboring robots from their currently observed states, and then uses the same CADM to plan each robot’s own trajectory under safety constraints informed by these simulations. Crucially, the accurate simulation of neighbors enables a minimal communication scheme that triggers coordination only when necessary in highly congested scenarios. Experiments across diverse environments show that SID consistently outperforms baseline methods in terms of planning effectiveness and constraint satisfaction, and scales to scenarios with 108 robots and 160 obstacles.

[AI-173] Behavioural Analysis of Alignment Faking

链接: https://arxiv.org/abs/2605.27681
作者: Nathaniel Mitrani Hadida,Rhea Karty,David Williams-King,Alan Cooney
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: preprint

点击查看摘要

Abstract:Alignment faking (AF) refers to a model strategically complying with a training objective to avoid behavioural modification while preserving its deployment preferences. Understanding when and why AF arises matters as models grow better at distinguishing training from deployment. Prior work finds AF fragile, prompt-sensitive, and model-dependent, leaving its underlying drivers unclear. We study AF in a controlled, minimal setup that isolates its core components, and observe it across a wider range of models than previously reported, including small-scale models. We identify three separable drivers – values, goal guarding, and sycophancy – and show via targeted prompt ablations and activation steering that each independently modulates AF behaviour. Our results indicate AF is more widespread than previously reported and that its occurrence is predictable from situational cues and measurable model tendencies such as baseline sycophancy and stated values. The decomposition suggests concrete directions for detecting and mitigating AF in future models.

[AI-174] Backdoor Attacks on Fault Detection and Localization in Cyber-Physical Systems

链接: https://arxiv.org/abs/2605.27674
作者: Abile Jean,Kuniyilh S
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cyber-Physical Systems (CPS) integrate sensing, communication, computation, and control to support critical infrastructure, including smart grids, industrial automation, and control systems. In the electrical utility domain, various controllers are used in CPS to ensure the system detects and recovers from faults, such as voltage fluctuations, and to perform load balancing in distribution systems. Machine learning- and deep learning-based fault detection and localization frameworks have recently gained significant attention in CPS for their ability to identify anomalies and operational failures in real time. However, these intelligent models are vulnerable to adversarial machine learning attacks, particularly backdoor attacks. In a backdoor attack, an adversary injects malicious patterns into the training data so that the model behaves normally most of the time but produces attacker-controlled outputs when triggered by specific patterns. This paper investigates the threat of backdoor attacks against fault detection and localization mechanisms in recent ML pipelines used in modern CPS systems. We define these threats and explore how they can be realized by designing triggers and evaluating their success in the CPS domain. Our experiments show the attack is successful even with 10% of poisoning.

[AI-175] How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks ICML2026

链接: https://arxiv.org/abs/2605.27662
作者: Teodor-Mihai Stupariu,Andrei Manolache
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026 Workshop on Weight-Space Symmetries

点击查看摘要

Abstract:Equivariant neural networks encode geometric symmetries by construction, yet they are often difficult to optimize and can underperform less constrained architectures. A growing body of work addresses this through architectural modifications such as constraint relaxation or approximate equivariance, while the role of the optimizer remains comparatively underexplored. We study this direction by comparing Muon and Adam across several equivariant and geometric architectures under pointcloud and molecular learning settings. On ModelNet40, where the comparison is clearest, Muon consistently improves over Adam across all architectures considered. We then analyze the trained ModelNet40 checkpoints through Hessian estimates, loss surface visualizations, and spectral properties of learned weights and intermediate representations. The checkpoints reached by Muon have larger Hessian curvature summaries but more regular loss surfaces, and their learned weights and representations have higher stable and effective ranks. These observations suggest that the interaction between optimizer design and geometric inductive bias deserves further attention from the community.

[AI-176] ransferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

链接: https://arxiv.org/abs/2605.27659
作者: Gengyue Han,Yiheng Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world environments, they often suffer from performance degradation or safety violations because of the inevitable Sim2Real gap. Existing zero-shot approaches, such as robust safe RL and domain randomization, mitigate this issue but typically at the cost of degraded performance or residual safety risks when experiencing unmodeled system dynamics. To address these limitations, we propose a novel reinforcement learning framework that enables safe and efficient policy transfer via probabilistic latent embeddings and dynamic policy adaptation. We consider a family of Constrained Markov Decision Processes (CMDPs) under different environment contexts. By leveraging latent context variable in meta-RL, the proposed framework infers the latent representation of the environment from simulated experiences. Furthermore, it incorporates a distributional RL formulation, which allows risk levels of the deployed policy to be adjusted dynamically, based on the estimation accuracy of the latent context variable. This strategy promotes safety at the early deployment stage and improves efficiency through fast policy adaptation under the Sim2Real gap.

[AI-177] Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

链接: https://arxiv.org/abs/2605.27646
作者: Kabir Swain,Sijie Han,Daniel Karl I. Weidele,Mauro Martino,David Cox,Antonio Torralba
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose \textbfHurwitz Quaternion Multiplicative Quantization (HQMQ), a \textbfcalibration-free method for KV cache compression of large language models. HQMQ treats each 4-element chunk of K or V as a quaternion and quantizes its unit direction to the \emphproduct q_p \cdot q_s , where q_p ranges over the 24-element Hurwitz group 2T (the 24 vertices of the 24-cell on S^3 , pairwise angle 60^\circ ) and q_s ranges over a per-(layer, head) secondary codebook of S \emphrandom unit quaternions. The multiplicative composition yields 24S effective codewords at S stored parameters; random initialization suffices because left-multiplication is an S^3 isometry, so seeded codebooks vary in end-task ppl by 1.5% . A per-batch median-multiplier outlier extraction step ( C=3 , no calibration) handles modern outlier-heavy architectures. We evaluate on five modern open models: Mistral-7B (dense MHA), Llama-3-8B and Qwen2.5-7B and Qwen3-8B (dense GQA), and gpt-oss-20b (sparse MoE). On Mistral-7B and Qwen3-8B, HQMQ matches fp16 within 0.02 – 0.03 ppl points at \sim 5 bits. On Qwen2.5-7B and Qwen3-8B, where naive int4 collapses to 10^4+ ppl, HQMQ + Med3 \times recovers fp16 quality within 0.02 – 0.10 ppl points at \sim 5 bits. HQMQ Pareto-dominates naive int by 3 – 1900\times at matched bits across all five models, and downstream zero-shot accuracy matches fp16 at 3.79 bits on Mistral. Against the strongest calibrated KV-quantization baseline, HQMQ at 3.79 bits matches KIVI-4 ( \sim 4.5 bits) within \sim1 pt on CoQA, 0.6 pts on TruthfulQA, and 2.3 pts on GSM8K, at 16% fewer bits and without a calibration pass. At the storage level, HQMQ delivers up to 5.05\times KV compression, shrinking a Llama-3-70B 128k-context cache from 43 GB to 8.5 GB.

[AI-178] rinity: Unifying Class-Agnostic Terrain and Semantic Segmentation for Unstructured Outdoor Environments by Leverag ing Synthetic Data

链接: https://arxiv.org/abs/2605.27644
作者: Marcus G Müller,Wout Boerdijk,Maximilian Durner,Riccardo Giubilato,Abel Gawel,Wolfgang Stürzl,Roland Siegwart,Rudolph Triebel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Terrain understanding is fundamental for mobile robots operating in unstructured outdoor environments. Existing vision-based traversability estimation methods rely on robot-specific annotations or semantic class mappings, limiting transferability across platforms and requiring costly re-annotation when robot capabilities change, while standard semantic segmentation methods only focus on specific predefined classes, which do not capture the variety of terrains. In this work, we propose a transformer-based architecture that jointly performs class-specific semantic segmentation and class-agnostic terrain segmentation within a unified network, called Trinity. Terrain regions are segmented based solely on visual appearance, without predefined semantic labels or robot-dependent traversability scores. This formulation enables the learning of robot-agnostic visual terrain priors that can be combined with robot-specific experience for downstream tasks such as traversability estimation, visual odometry, and mission planning. To enable large-scale training with diverse terrain appearances, we extend the OAISYS simulator and introduce RUGDSynth, a synthetic dataset inspired by RUGD with class-agnostic terrain samples. Furthermore, we present the EXTerra Dataset, providing real-world images annotated with both class-specific and class-agnostic terrain labels. Experiments demonstrate the feasibility of the proposed task and the effectiveness of our joint segmentation approach in complex outdoor environments. Code and datasets will be released with this publication (after review).

[AI-179] Reasoning and Planning with Dynamically Changing Norms

链接: https://arxiv.org/abs/2605.27622
作者: Taylor Olson,Roberto Salas-Damian,Kenneth D. Forbus
机构: 未知
类目: Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: 8 pages, 1 figure, dataset included in anc

点击查看摘要

Abstract:To safely interact with humans, AI agents must both know our norms and consider them during planning. However, such norm-guided planning has been less explored, only within communities of artificial agents, and has ignored the dynamic nature of norms. This paper instead presents an approach to guiding planning with dynamically changing norms in a human-AI setting. We contribute a defeasible calculus for resolving normative conflicts and an approach to using such dynamically changing norms as guard rails on plans. We theoretically demonstrate our approach with formal proofs and empirically with an AI agent, SocialBot, on a natural language dialogue task.

[AI-180] Supervised Distributional Reduction via Optimal Transport and Dependence Maximization

链接: https://arxiv.org/abs/2605.27619
作者: Sai-Aakash Ramesh,Archit Sood,Andrew Corbett,Tim Dodwell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning representations that capture both intrinsic data geometry and target-relevant structure remains a fundamental challenge, particularly in settings where data reduction must balance compression with predictive fidelity. While distributional reduction-encompassing joint clustering and dimensionality reduction-offers a principled way to summarize data, its supervised variants remain relatively under-explored, despite the importance of retaining task-relevant signal for downstream prediction and decision-making. We propose Supervised Distributional Reduction (SDR), an algorithm for learning target-aware representations by combining optimal transport with explicit dependence maximization. SDR builds on the Fused Gromov-Wasserstein (FGW) objective to align the relational structure of the input distribution with a set of representative points, while augmenting it with a direct dependence term that encourages the learned embeddings to capture predictive signal more explicitly. This results in compact representations that reflect both geometric structure and supervision. Beyond representation learning, SDR naturally induces a data-dependent, non-stationary geometry that can be leveraged for settings such as Gaussian Process (GP) modelling. By redefining distances through target-aware distributional alignment, SDR enables the construction of adaptive kernels that respond to local variations in both data geometry and supervision, offering an optimal transport-based perspective on non-stationary kernel design.

[AI-181] Laguna M.1/XS.2 Technical Report

链接: https://arxiv.org/abs/2605.27605
作者: Julien Abadji,Marah Abdin,Connor Adams,Eric Alcaide,Mustafa Altun,Michele Artoni,Junze Bao,Uday Barar,Vassilis Bekiaris,Arkadii Bessonov,Benjamin Bütikofer,Jonathan Chang,Yen-Chun Chen,Dmitry Chernenkov,Yang Chi,Filippos Christianos,Fenia Christopoulou,Razvan-Andrei Ciocoiu,Tzachi Cohen,Yohann Coppel,Dmitrii Emelianenko,Brandon Fergerson,Brian Fitzgerald,Matthias Gallé,Alex Golonzovskyi,George Grigorev,Yiyang Hao,Christian Hensel,Jan Huenermann,Ye Ji,Sarthak Joshi,Eiso Kant,Kabir Khandpur,Seonghyeon Kim,Vladimir Kirichenko,Umut Kocasarac,Ilya Kochik,Ivan Komarov,Chaerin Kong,Anurag Koul,François-Joseph Lacroix,Sergei Laktionov,Waren Long,Quentin Malartic,Vadim Markovtsev,Afonso Marques,Robert McHardy,Carlos Mocholí,Dmitry Monakhov,Adam Morris,Martin Muller,Christian Mürtz,Robin Nabel,Thien Nguyen,Rok Novosel,Szymon Ozog,Aalhad Patankar,Aleksei Petrov,Alexandre Piché,Arthur Pignet,Teodor Poncu,Phil Potter,Alexander Rakowski,Pierre-Yves Ritschard,Jay Roberts,Joe Rowell,Piotr Sarna,Pierre-André Savalle,Uladzislau Sazanovich,Nikita Shapovalov,Arsenii Shevchenko,Mikhail Shilkov,Andrei Sokol,Mohamed Soliman,Jack Stephenson,Victor Storchan,Dragos-Constantin Tantaru,Artem Tyurin,Adrian Wälchli,Pengming Wang,Jianxiao Yang,Renat Zayashnikov,Alexander Zelenka Martin,Nikolay Zinov,Caroline Bercier,José Caldeira,Margarida Garcia,Tom George,Kabeer Gharzai,Glenn Hitchcock,Carson Klingenberg,Ivo Pinto,Varun Randery,Noah Smith,Arina Sugako,Jason Warner
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Technical report to models released here: this https URL

点击查看摘要

Abstract:We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has 225.8 B total parameters ( 23.4 B activated per token) and XS.2 has 33.4 B total ( 3 B activated). Both models were trained from scratch end-to-end inside the same internal system that we refer to as our Model Factory: a tightly-integrated stack of versioned data, training, evaluation, and inference components that turn model development into an industrial process. We describe the principles and design choices of the Model Factory and also detail the end-to-end training process of our models, throughout pre-training data and architecture, post-training stages, evaluation, and quantization. On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes. Laguna XS.2 weights are released under Apache~2.0 at this https URL. Comments: Technical report to models released here: this https URL Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2605.27605 [cs.AI] (or arXiv:2605.27605v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.27605 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-182] he Energy Blind Spot: NVIDIAs Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution

链接: https://arxiv.org/abs/2605.27599
作者: Deepak Panigrahy,Aakash Tyagi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Agentic AI workloads - where a single user goal triggers multi-step orchestration, tool calls, retries, and failure recovery - are being targeted for edge deployment, with NVIDIA, Dell, HP, ASUS, MSI, Acer, and Gigabyte all shipping GB10-based desktop AI systems in 2026. We recently demonstrated that orchestration structure dominates agentic energy cost, with workflows consuming 4.33x more energy per successful goal than linear baselines and OOI reaching 7.63x for multi-step reasoning tasks. Separately, Rajat et al. show that CPU-side processing accounts for up to 90.6% of total latency and 44% of total dynamic energy in agentic workloads. We report a systematic energy-observability audit of the ASUS Ascent GX10 (GB10 SoC) and find that the platform exposes no CPU energy counter, no INA power-rail monitor, no IPMI/BMC, and no SCMI powercap protocol through any supported software interface. The only on-device energy telemetry is instantaneous GPU power via NVML. We further discover that the MediaTek firmware already computes per-rail energy internally via an undocumented ACPI interface (SPBM), but NVIDIA states there are “no plans to expose CPU rail information.” On-device per-process energy attribution - as performed on x86 via RAPL - is therefore not reproducible on this platform through supported interfaces. We formalize a hardware requirements specification for energy-attributed AI, propose an interim calibration bridge using external DC metering combined with GPU subtraction, and identify a standards-track path via SCMI powercap. Our findings motivate the low-carbon computing community to demand energy observability as a first-class hardware requirement.

[AI-183] Cyberbullying Governance on Social Media: A Unified Framework from Content Identification to Intervention

链接: https://arxiv.org/abs/2605.27584
作者: Yiting Huang,Wenting Zhu,Zekun Wang,Qingpo Yang,Yakai Chen,Zihui Xu,Yueyue Zhang,Sanchuan Guo,Xi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The proliferation of social media platforms and online communities has inadvertently catalyzed the spread of cyberbullying, hate speech, and other forms of online toxicity, making the effective governance of such harm a critical societal and computational challenge. While significant strides have been made in automating content moderation, existing research predominantly treats cyberbullying governance as passive, isolated detection at the post level. This reductionist view overlooks the continuous behavioral dynamics of users, the structural diffusion of toxic events, and the critical need for proactive mitigation. To bridge these gaps, this paper proposes a unified full-lifecycle governance framework that shifts the paradigm of cyberbullying governance from isolated static detection toward integrated, continuous, and proactive moderation. Drawing on cyberbullying research and adjacent fields, we systematically synthesize the state-of-the-art literature across four interconnected stages: (1) Content Identification, (2) User and Behavior Modeling, (3) Diffusion Dynamics and Early Warning, and (4) Intervention and Governance. Furthermore, we review available datasets and evaluation practices, and discuss emerging challenges including multimodality, explainability, algorithmic fairness, and the dual-use risks of generative AI, providing a roadmap for future research toward a safer and more resilient digital ecosystem.

[AI-184] You Are in Control of Your State: Why Human Outcomes Are Controllable Through Causal State Intervention

链接: https://arxiv.org/abs/2605.27580
作者: Suraj Biswas,Saurav Gupta,Pritam Mukherjee
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 20 pages, 12 figures, 37 references. Companion to a prior SSRN preprint on causal architecture for human modelling

点击查看摘要

Abstract:A central puzzle for the behavioural sciences and for human-facing artificial intelligence is the persistence of within-person variability. The same individual, presented with the same observable input, produces different outcomes on different occasions, and different individuals produce divergent outcomes that no observable covariate fully predicts. We argue that this variability belongs in the dynamic latent state of the person, and that human outcomes are controllable in a precise and operational sense through interventions that target the state and its weighting at the moment a decision is being formed. We define a state as the time-indexed weighting vector over the dimensions that govern how an individual’s biology, physiology, and neuropsychology process the next event into a decision and an outcome. The relationship between state, decision, and outcome is causal rather than correlational. The weighting vector is dynamic at sub-daily timescales. The conscious channel through which outcomes are reportable is a narrow attentional bottleneck whose contents are themselves state-dependent. Taken together, these claims imply that the outcome of a given event is controllable, conditionally, on the state-trajectory at the time of intervention. We motivate the framework with six strands of established evidence (causal inference, predictive processing, allostasis, attentional bottleneck, chronobiology, computational psychiatry) and a 24-month observational base from a deployed behavioural platform spanning more than 200,000 consented users across four occupational personas (research period 2023 to 2026). We derive seven testable predictions, list six operational requirements for state-aware systems, and discuss implications for digital health, education, AI personalisation, and personal agency. Comments: 20 pages, 12 figures, 37 references. Companion to a prior SSRN preprint on causal architecture for human modelling Subjects: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC) ACMclasses: I.2.0; J.4 Cite as: arXiv:2605.27580 [cs.AI] (or arXiv:2605.27580v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.27580 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-185] Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution Agent Definition as a Code and Zero-Trust Access

链接: https://arxiv.org/abs/2605.27575
作者: Nikita Benkovich,Vitalii Valkov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As organizations move toward production deployments of AI agents, which execute non-deterministic workflows, maintain stateful sessions, and often operate with privileged access to internal services, the engineering challenge shifts from building individual agents to operating them at scale with proper isolation, governance, and security. In this paper we present Agyn, an open-source platform designed around three key principles tailored for agent workloads: a signal-driven, stateful serverless runtime on Kubernetes; a Terraform provider for agent and harness definition; and a security model grounded in zero-trust and least-privilege principles. Agyn is agent-agnostic, model-agnostic, and cloud-agnostic.

[AI-186] LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

链接: https://arxiv.org/abs/2605.27570
作者: Gabriele Cesa,Thomas Hehn,Aleix Torres-Camps,Àlex Batlle Casellas,Jordi Ros-Giralt,Arash Behboodi,Tribhuvanesh Orekondy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Parallel LLM test-time scaling techniques (e.g., best-of- N ) require drawing N1 sequences conditioned on the same input prompt. These methods boost accuracy while exploiting the computational efficiency of batching N generations. However, each sequence in the batch is traditionally generated independently and hence does not reuse intermediate generations, computations, or observations from other sequences. In this paper, we propose LaneRoPE to enable coordination and collaboration among N1 sequences at generation time. LaneRoPE involves two key ideas: (a) an inter-sequence attention mask to make sampling of sequences dependent on one another; and (b) a RoPE extension that injects positional information that captures relative positions between tokens, both within and outside a particular sequence. We evaluate our approach on mathematical reasoning tasks and find promising results: LaneRoPE enables collaboration among sequences, yielding additional accuracy gains under limited generated sequence length. Importantly, since LaneRoPE enables coordination with minimal changes to the underlying LLM architecture and introduces a negligible overhead at inference time, it is appealing to rapidly incorporate parallel reasoning into existing LLM inference pipelines.

[AI-187] RULER: Representation-Level Verification of Machine Unlearning

链接: https://arxiv.org/abs/2605.27569
作者: Georgina Cosma,Axel Finke
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine unlearning aims to remove the influence of specific training records from a deployed model without retraining from scratch. Current protocols verify this at the output level through membership inference, retain accuracy, and forget-set accuracy, but a model can satisfy all three whilst still encoding forgotten records in its intermediate representations. We introduce RULER, a set of representation-level verification metrics. The oracle-comparative metric M2 measures whether forget-set records occupy the same representational position as in a model retrained without them. The oracle-free metric M4 detects residuals from the unlearned model’s internal similarity structure alone, without retraining. Four approximate unlearning methods all pass output-level evaluation, yet under a linear mixed-effects model M2 detects significant residuals in 10 of 12 conditions (p0.05), with effect sizes growing as the forget fraction increases. A fifth method, Bad Teacher, shows the same residuals despite a different forgetting mechanism. M4 acts as a pre-unlearning diagnostic across tabular, image, clinical text, and face-identity settings: it detects identity-level memorisation in face recognition models where no tested method fully erases the signal.

[AI-188] DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM -based Scheduling Agents

链接: https://arxiv.org/abs/2605.27566
作者: Shijie Cao,Yuan Yuan,Jing Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Progress in neural combinatorial optimization for Dynamic Flexible Job Shop Scheduling Problem (DFJSP) is currently hindered by a methodological tension: static benchmarks encourage benchmark overfitting, while uncalibrated generators obscure algorithmic capability with stochastic noise. To resolve this, we introduce \textbfDynaSchedBench, a diagnostic framework for DFJSP that rigorously controls the instance-generation process. Instead of relying on parameter sampling, our approach utilizes Sequential Event-Space Calibrator (SESC) that computes a novel Schedule Stress Index (SSI) to stratify instances by difficulty. We demonstrate that SESC is substantially more computationally efficient than evolutionary baselines while converging reliably to the target metrics. The framework integrates modular components for instance generation, snapshot-based simulation, agents, evaluation, and visualization, thereby enabling rigorous testing of reactive and lookahead-based policies. Leveraging this calibrated environment, we identify key limitations of LLM-based scheduling agents. Specifically, in step-wise online decision-making for dynamic scheduling, we identify an ``Observability Paradox’': providing agents with oracle access to full structural information can degrade policy performance, underperforming concise information. Furthermore, despite substantial token overhead, tool-augmented and refinement strategies fail to reliably improve performance, and most LLM agents fail to consistently surpass strong dispatching baselines-behaving more like robust heuristic approximators than superior optimizers.

[AI-189] Benchmarks are Not Enough: RAMP for Runtime Assessing of Agent ic Models in Production Systems ATC

链接: https://arxiv.org/abs/2605.27492
作者: Yipeng Ouyang,Xin Huang,Bingjie Liu,Zhongchun Zheng,Yuhao Gu,Xianwei Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures. Project homepage: this http URL

点击查看摘要

Abstract:LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.

[AI-190] HARP: Measuring Harm Amplification in Multi-Agent LLM Systems

链接: https://arxiv.org/abs/2605.27489
作者: Md Hafizur Rahman,Zafaryab Haider,Tanzim Mahfuz,Prabuddha Chakraborty
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 12 figures, 12 tables, and 1 algorithm

点击查看摘要

Abstract:Multi-agent LLM systems decompose workflows across agents, tools, shared context, memory, and decision gates. This modularity improves interpretability, but creates a propagation risk: a bounded perturbation to one component can be reused by other agents and amplified into system-level harm. We introduce HARP (Harm Amplification through Role Perturbation), a trace-first methodology for studying local-to-global harm amplification in multi-agent LLM systems. HARP compares paired clean and perturbed executions and records specialist outputs, tool calls, memory reads/writes, guard events, oracle logs, latency, token cost, and decisions. We define local harm as deviation from targeted agents or corrupted channels, global harm as deviation over the full trace, and harm amplification as (H_global/H_local). This complements attack success rate with a measure of how strongly orchestration spreads harm beyond the attack point. We instantiate HARP in a finance-oriented seven-agent system with a deterministic decision gate and configurable attack harness for specialist compromise, collusion, shared-context corruption, and temporal or memory-persistent attacks. Across five defenses, prompt-only defenses preserve benign utility but leave high success and stealth; pre-tool and step-level guards reduce some failures with utility or latency costs; and IntegrityGuard, a trace-consistency defense, achieves the lowest attack success and global harm but introduces utility/cost trade-offs. Results show that single-specialist compromise produces the strongest amplification, shared-context corruption yields the highest attack success, and temporal persistence produces the largest malicious impact. HARP argues that secure multi-agent evaluation must measure not only bypass, but propagation.

[AI-191] Grimlock: Guarding High-Agency Systems with eBPF and Attested Channels

链接: https://arxiv.org/abs/2605.27488
作者: Qiancheng Wu,Wenhui Zhang,Gan Fang,Sheng Mao,Biao Gao,David Levitsky,Shawna Murphy Butterworth,Rob Cameron
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agentic systems increasingly run user-authored orchestration code that invokes tools, spawns subtasks, and delegates work across machines and clouds. Although this high agency is productive, it creates a security problem: identity, authorization, provenance, and delegation are often pushed into application code, where they become difficult to enforce consistently and difficult to audit. We present \emphGrimlock, an \emphAgent Guard that restores separation of concerns by moving trust enforcement into the sandbox substrate while leaving agent code unchanged. Grimlock uses \empheBPF-enforced traffic interception to ensure that sandbox communication passes through a guard, and combines it with \emphpost-handshake attestation bound to standard TLS~1.3 channel bindings. After a channel is established, the guard authorizes communication and mints short-lived, channel-bound \emphscope tokens that capture least-privilege delegation. At the receiving side, the destination guard re-validates identity, scope, and channel binding, terminates TLS, and releases plaintext to the destination sandbox only after policy checks succeed. kTLS provides an efficient dataplane for protected communication. As a result, Grimlock offers a path toward transparent, auditable, and scope-bound agent-to-agent communication across heterogeneous multi-cloud environments, using commodity Linux primitives and without requiring changes to user-layer orchestration code. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.27488 [cs.CR] (or arXiv:2605.27488v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.27488 Focus to learn more arXiv-issued DOI via DataCite

[AI-192] Energy-Structured Low-Rank Adaptation for Continual Learning ICML2026

链接: https://arxiv.org/abs/2605.27482
作者: Longhua Li,Lei Qi,Qi Tian,Xin Geng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:While orthogonal subspace methods try to mitigate task interference in Continual Learning (CL), they often suffer from energy diffusion across the basis, hindering knowledge compaction and exhausting capacity for future tasks. We observe that output feature drift induced by parameter updates is inherently low-rank, and theoretically prove that preserving parameters along the principal directions of this drift minimizes the output reconstruction error. Motivated by this, we propose \textbfEnergy-Concentrated and \textbfEnergy-Ordered \textbfLow-\textbfRank \textbfAdaptation (E ^2 -LoRA). By explicitly ordering and concentrating knowledge into leading ranks, E ^2 -LoRA frees capacity for subsequent tasks. Furthermore, we design a dynamic rank allocation strategy to balance stability and plasticity by jointly optimizing energy retention and model plasticity. Extensive experiments across multiple benchmarks demonstrate that E ^2 -LoRA achieves state-of-the-art performance.

[AI-193] Resource-Constrained Affect Modelling via Variance Regularisation Pruning

链接: https://arxiv.org/abs/2605.27479
作者: Kosmas Pinitas,Konstantinos Katsifis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper has been accepted at the 2026 PErvasive Technologies Related to Assistive Environments (PETRA)

点击查看摘要

Abstract:Affective computing systems are increasingly embedded in pervasive and interactive environments, such as adaptive games, assistive technologies, and resource-constrained platforms, where computational efficiency must be balanced with reliability across diverse users. Model pruning offers an effective way to reduce computational demands, yet existing approaches typically optimise for sparsity alone, without accounting for how parameter removal impacts robustness across individuals. In this work, we introduce Variance-Regularised Pruning (VR), a pruning framework that explicitly incorporates cross-participant stability into the sparsification process. Rather than relying solely on average prediction error, VR evaluates each connection based on its joint contribution to both prediction accuracy and variability across users, prioritising parameters that remain reliable under distributional differences. We evaluate the proposed approach on the AGAIN dataset, which includes arousal annotations collected across nine affect-eliciting game environments. Experimental results demonstrate that VR maintains competitive Concordance Correlation Coefficient (CCC) performance even at 80% sparsity without additional fine-tuning, highlighting its suitability for deployment in real-world, resource-limited affect-aware systems. Overall, the proposed framework supports the development of compact, robust affective models that can operate reliably in real-world interactive environments.

[AI-194] Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective ICML2026

链接: https://arxiv.org/abs/2605.27476
作者: Hyunmin Cho,Woo Kyoung Han,Kyong Hwan Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026 (Regular)

点击查看摘要

Abstract:We characterize the pre-softmax attention matrix \mathbfQK^\top in transformers as an associative memory matrix encoding pairwise associations between input features. By decomposing this matrix into its symmetric and skew-symmetric parts, we interpret the symmetric component as governing the structure of the energy landscape, and the skew-symmetric component as driving circulation on that landscape. Leveraging the energy formulation induced by the symmetric component, we derive Hopfield-style stability measures that quantify the stability of retrieved features. We observe meaningful correlations between Hopfield-style stability measures and the fidelity-diversity trade-offs in generation. Finally, we propose a controllable knob to modulate this trade-off by modifying the circulation of the underlying dynamics. Code is available at our GitHub (this https URL).

[AI-195] HEAL: Resilient and Self-* Hub-based Learning

链接: https://arxiv.org/abs/2605.27475
作者: Mohamed Amine Legheraba(NPA),Stefan Galkiewicz(NPA),Maria Gradinariu Potop-Butucaru(NPA),Sébastien Tixeuil(NPA, IUF, LINCS)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decentralized learning enhances privacy, scalability, and fault tolerance by distributing data and computation across nodes. A popular approach is Federated learning, which relies on a central aggregator, yet faces challenges such as server vulnerabilities, scalability issues, privacy risks and most importantly, the single point of failure. Alternatively Gossip Learning and Epidemic Learning offer fully decentralization through peer-to-peer exchanges of model updates, ensuring robustness and privacy, at the price of slower model convergence. In this work, we introduce a novel decentralized learning framework called HEAL. HEAL is the first cross-layer decentralized learning framework that exploits an optimized self-organizing and self-healing underlying P2P overlay combining the strengths of Federated Learning, Gossip and Epidemic Learning. Leveraging the recently proposed Elevator algorithm, HEAL promotes dynamically chosen nodes to act as aggregators. Through simulations, we demonstrate that HEAL has similar performances to that of Federated Learning in crash-free settings, while being fully decentralized and fault-tolerant. In crash and churn prone environments HEAL outperforms Gossip and Epidemic Learning.

[AI-196] AssertLLM 2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications

链接: https://arxiv.org/abs/2605.27472
作者: Yuchao Wu,Wenji Fang,Jing Wang,Wenkai Li,Ziyan Guo,Zhiyao Xie
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Assertion-based verification (ABV) is a cornerstone of modern hardware design, yet manually translating design intent into formal SystemVerilog Assertions (SVAs) remains labor-intensive and error-prone. While Large Language Models (LLMs) show promise for automating this process, existing benchmarks remain limited by unrealistic task formulations, weak specification inputs, and oversimplified evaluation. To address these limitations, we introduce AssertLLM2, an open-source benchmark for realistic assertion generation in hardware verification. AssertLLM2 contains 83 real-world designs across 13 functional categories. For each design, the benchmark provides a structured design specification, a verified dependency-complete golden RTL, and systematically mutated buggy RTL variants. These support two practical settings: bug-prevention, where assertions are generated from specifications to guard against design errors, and bug-hunting, where assertions are generated to expose discrepancies between intended behavior and faulty implementations. To the best of our knowledge, AssertLLM2 is the first benchmark to explicitly use buggy RTL as input to evaluate bug-detection capability. AssertLLM2 further adopts a more rigorous evaluation framework spanning syntactic validity, formal provability, coverage, and mutation-based bug detection. Our benchmark enables a more realistic and extensive assessment of assertion generation and establishes rigorous baselines for state-of-the-art LLMs in practical hardware verification.

[AI-197] Detect by Yourself: Self-Designing Agent ic Workflows for Few-Shot Graph Anomaly Detection

链接: https://arxiv.org/abs/2605.27470
作者: Tairan Huang,Qiang Chen,Yili Wang,Yueyue Ma,Changlong He,Xiu Su,Yi Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph anomaly detection aims to identify anomaly nodes in attributed graphs and plays an important role in real-world applications. However, existing graph anomaly detection methods still face two key challenges: 1) fixed pipelines, which restrict their adaptability across different graph tasks under limited supervision; 2) weak evidence, which prevents them from explicitly incorporating contextual and structural anomaly signals into the detection process. In this paper, we propose a novel framework, self-designing agentic workflows for few-shot graph anomaly detection (SignGAD). Specifically, we propose a novel paradigm that reformulates graph anomaly detection task from training a fixed anomaly detector to designing task-conditioned detection workflows. By constructing detection workflows, SignGAD selects suitable graph encodings and detector designs to exploit task-specific anomaly evidence. Meanwhile, we introduce a guarded final refit strategy to refine the selected workflow by calibrating refit acceptance, enhancing reliability under limited supervision. Extensive experiments conducted on several real-world datasets demonstrate that SignGAD achieves strong performance against state-of-the-art methods, highlighting its effectiveness on graph anomaly detection tasks.

[AI-198] Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

链接: https://arxiv.org/abs/2605.27469
作者: Zhong Ye,Yu Hu,Ruilin Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual Learning (CL) is a practical paradigm to utilize power of deep pre-trained neural networks, but which pre-trained model has a better ability to balance ``Plasticity-Stability", deserving to be chosen? The logit shift serves as a natural proxy because it represents the logit shift in CL scenarios. However, obtaining the logit shift requires huge computational cost, which hinders large-scale model selection. Existing theoretical analyses fail to offer an efficient alternative because of the assumption of uniform hidden layer widths, which ignores the structural heterogeneity (variable width and depth) of real-world architectures. This raises a critical question: what theoretically relationship can be identified between heterogeneous architecture and logit shift on prior tasks (that the model has been trained on)? To answer the question, we decouple logit shift into architecture dependency and data dependency to establish our framework, which reveals that the combination of two dependency, defined as Architecture-driven Shift (ADS), that can capture the logit shift tendency well computable with few data samples. Specifically, for a well-optimized model on prior tasks, higher ADS is associated with a larger logit shift after training on the current task, which derived based on three mechanistic components: (1) spectral norm scaling of weight matrix gradients with layer width, (2) the optimization path length of the new task, and (3) the asymptotic task conflict in wide networks. Extensive empirical results across more than 175 diverse architectures demonstrate a strong monotonic correlation (the weakest Spearman’s r_s=0.731 ) between ADS and logit shift. Practically, we demonstrate that ADS can serve as a lightweight proxy of the expected calibration error, which is a widely used metric for reliable CL model selection, on three datasets across six scenarios.

[AI-199] When NPUs Are Not Always Faster: A Stage-Level Analysis of Mobile LLM Inference

链接: https://arxiv.org/abs/2605.27435
作者: Pu Li,Jiawen Qi,Qinyu Chen
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying large language models (LLMs) on mobile devices increasingly relies on heterogeneous execution, yet no prior study has systematically characterized NPU effectiveness at the operator and pipeline level. We present the first stage-aware, multi-level benchmarking study of mobile LLM inference on a CPU-NPU heterogeneous SoC. We introduce an OPMASK-based controlled pipeline decomposition methodology that isolates communication, quantization, and computation overheads within the NPU execution path. Our results reveal a counter-intuitive stage-level performance reversal: CPUs outperform NPUs in the compute-intensive Prefill stage (up to 1.6x), while NPUs provide only limited acceleration in the memory-bound Decode stage (1.05-1.2x). We further show that scheduling overhead and cross-backend fallback reduce the practical benefits of NPU offloading. For the energy trend, increasing NPU offloading leads to higher energy consumption (up to 51%). Based on these findings, we derive design guidelines for NPU architects targeting on-device LLM inference.

[AI-200] ackling Multimodal Learning Challenges with Mixture-of-Expert: A Survey IJCAI2026

链接: https://arxiv.org/abs/2605.27431
作者: Liangwei Nathan Zheng,Wei Emma Zhang,Olaf Maennel,Lin Yue,Weitong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This survey paper has just been accepted by IJCAI 2026. Results were released by 30 April 2026. As I could not find a particular place to drop the acceptance email. I have upload the acceptance email alongside the LaTeX files of the paper, named as this http URL

点击查看摘要

Abstract:Mixture-of-Experts (MoE) presents a naturally compatible and scalable framework for multimodal learning, demonstrating strong adaptability across diverse modalities and tasks. Despite its growing success, a comprehensive and systematic review on the MoE metho addressing multimodal challenges remains lacking. Existing surveys tend to evaluate either multimodal learning or MoE independently from method taxonomy, overlooking the unique interplay between them. This survey fills that gap by answering a central question: \textitHow does MoE effectively resolve multimodal challenges? We approach this from three key perspectives: (1) \textbfMoE as an Efficient Multimodal Engine: enabling scalable multimodal modeling by decoupling computational cost from parameter growth and mitigating modality redundancy through selective expert activation; (2) \textbfMoE as a Multimodal Representation Learner: integrating complementary multi-opinion expert knowledge to enrich alignment and interaction representations; and (3) \textbfMoE as a Multimodal Adapter: providing a modular and flexible mechanism to model imperfect data scenarios such as modality imbalance and missing modality. Through our extensive literature review, we identify critical research gaps, including interpretable routing, expert communication, modality integration, and lifelong multimodal learning. We position this survey as a foundation for future research toward interpretable and sustainable multimodal Mixture-of-Experts system.

[AI-201] Advancing Direct Training for Spiking Neural Networks with Circulate-Firing Neurons and Learnable Gradients

链接: https://arxiv.org/abs/2605.27412
作者: Feifan Zhou,Xiang Wei,Yang Liu,Qiang Yu
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have emerged with promising energy-efficient property, yet a substantial performance gap persists compared to Artificial Neural Networks (ANNs). This gap stems from at least two key limitations: first, conventional spiking neurons offer limited information representation capacity, underutilizing the rich dynamics of membrane potentials; second, fixed surrogate gradient (SG) functions across time steps leads to imprecise gradient propagation, impeding effective direct training. To address these two challenges, we propose a new direct training algorithm with three core innovations: first, a circulate-firing spiking neuron model that enhances information representation capacity by leveraging membrane potentials more effectively; second, a time-step-wise learnable surrogate gradient function, enabling accurate gradient estimation during backpropagation; third, a positive-negative balanced loss function to achieve equilibrium between positive and negative membrane potentials and further boost SNN performance. Extensive experiments demonstrate that our methods achieve competitive performance across multiple datasets. Our methods can generalize seamlessly to advanced architectures of Transformer, consistently outperforming existing methods. Our work highlights the effectiveness of further harnessing intrinsic membrane dynamics of SNNs for performance improvement, and thus open a new avenue for advancing high-performance spiking neural architectures.

[AI-202] STARS: Spike Tail-Aware Relational Synthesis for ANN-to-SNN Data-Free Knowledge Distillation

链接: https://arxiv.org/abs/2605.27409
作者: Shuhan Ye,Yi Yu,Qixin Zhang,Hui Lu,Jiaming He,Qinggang Zhang,Li Shen,Xudong Jiang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:SNNs promise energy-efficient and low-latency inference, but their performance still trails that of ANNs. ANN-to-SNN knowledge distillation helps narrow this gap, yet the original training data are often unavailable in practical deployment settings. Existing data-free knowledge distillation (DFKD) methods synthesize surrogate data by matching teacher-side priors, especially BN statistics, but these ANN-oriented constraints mainly regularize mean and variance and therefore remain under-constrained for SNN students whose responses depend on threshold-crossing dynamics. In this paper, we propose Spike Tail-Aware Relational Synthesis (STARS), a plug-and-play method for ANN-to-SNN DFKD that augments standard BN-guided synthesis with two complementary objectives: Relational Consistency Alignment, which preserves cross-sample relational consistency between teacher and student, and Tail-Aware Regularization, which regularizes threshold-relevant tail probabilities through soft exceedance over teacher-derived thresholds. Together, these objectives generate synthetic batches that remain teacher-valid while becoming more informative for SNN students. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet across multiple ANN-SNN pairs show that our method consistently improves conventional DFKD baselines and even surpasses several KD methods, with gains of up to 4.6% on CIFAR-10 and 6.7% on CIFAR-100, highlighting the importance of complementing BN matching with relational and tail-aware constraints in SNN-oriented DFKD.

[AI-203] Benchmarking Fairness in Spiking Neural Networks: Data Bias Spurious Features and Hardware Effects

链接: https://arxiv.org/abs/2605.27407
作者: Hudi He,Fukun Wang,Zhe Wang,Xinyi Wang,Shuhan Ye,Jiarui Liu,Qing Qing,Ziqi Xu,Xikun Zhang,Renqiang Luo
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evaluating fairness in Spiking Neural Networks (SNNs) demands rigorous benchmarks that reflect real-world complexities, yet existing assessments remain limited by superficial dataset diversity and idealized hardware assumptions. This work introduces the first systematic fairness benchmark for SNNs, addressing three critical dimensions of realism: (1) demographic coverage gaps in training data, (2) spurious feature leakage (e.g., skin tone as a proxy for class labels), and (3) deployment-environment mismatches (e.g., edge devices with constrained spike encoding). Our framework integrates four cross-demographic datasets with controlled bias injections and three neuromorphic hardware simulators (Loihi 2, SpiNNaker), enabling isolated analysis of fairness-performance trade-offs under resource constraints. Standardized evaluations of 12 state-of-the-art SNNs reveal stark disparities: models trained on biased data exhibit 23% higher false positive rates for underrepresented groups, while hardware limitations (e.g., reduced spike precision) further amplify accuracy gaps by up to 41% in edge deployments. Critically, bias mitigation strategies developed for cloud-based SNNs often degrade under resource constraints, highlighting the need for co-design principles that jointly optimize fairness and hardware efficiency. By bridging algorithmic fairness research with neuromorphic engineering, our benchmark provides a foundation for trustworthy SNNs in socially critical applications such as healthcare and autonomous systems. Our code is available at: this https URL.

[AI-204] Smaller Younger and More Impactful: How AI-Assisted Writing Transforms Research Teams

链接: https://arxiv.org/abs/2605.27404
作者: Haoyang Wang,Mingze Zhang,Yi Bu,Star Xing Zhao,Meijun Liu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The era of Big Science has long been defined by increasingly large and specialized research teams pushing the frontiers of knowledge. However, recent advances in artificial intelligence (AI), particularly large language models (LLMs), are beginning to reshape academic writing and scientific research, potentially disrupting the longstanding trend toward ever-larger teams and transforming other dimensions of research team structure. Drawing on 147,074 full-text publications from the PLoS family and the Nature portfolio since 2020, we examined whether and how AI-assisted writing influences team structure and team outcomes in science. Using multiple methods, including ordinary least square, quantile regression, Poisson regression, logistic regression and propensity score matching, we found that research teams using AI-assisted writing tend to be younger and smaller. Importantly, this shift toward more compact, junior-leaning teams does not come at the expense of scientific impact. On the contrary, we observed a higher probability of research teams that employed AI-assisted writing producing highly impactful publications. These results highlight the significant role of AI-assisted writing in reshaping not only how research is produced, but also how research teams are formed and assembled. Our findings call for policy improvements in research evaluation, funding, and training to address this emerging trend.

[AI-205] LLM -assisted sentiment analysis for integrated computational and qualitative mixed methods education research: A case study of students written reflection assignments

链接: https://arxiv.org/abs/2605.27403
作者: Xiomara Gonzalez,Gabriella Coloyan Fleming,Andrew Katz,Maya Denton,Jessica Deters
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Written reflection assignments give students valuable opportunities for critical self-assessment, meaning making, and learning processing. Additionally, such reflections provide rich data for qualitative education research. However, qualitative data can be time-consuming to analyze. It is even more time-intensive to qualitatively compare findings between different groups of participants, usually limiting comparison to, at most, one variable (e.g., binary gender). Large language models (LLMs) have recently begun to be critically evaluated for use as qualitative research assistants. Using a longitudinal case of written student reflections (n=151) from a study abroad program, we investigate how LLM-assisted sentiment analysis can enable longitudinal mixed-methods research combining computational and thematic analyses. First, statistical testing is used to quantitatively compare sentiment differences according to seven different student identity/lived experience variables. Then, these results inform qualitative data analysis to investigate the reasons underlying these differences. For the case of undergraduate students studying abroad, we found that prior experience living abroad was the only personal variable impacting students’ sentiments of their verbal language and communication behaviors. This workflow has implications for how qualitative researchers can more easily probe multiple variables when comparing participants from different demographic groups.

[AI-206] Using Zero-Shot LLM -Generated Survey Data for Geographically Explicit Population Synthesis

链接: https://arxiv.org/abs/2605.27401
作者: Taylor Anderson,Sara Von Hoene,Orhan Yagizer Cinar,Emma Von Hoene,Amira Roess,Andrew Crooks,Hamdi Kavak
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, 3 tables

点击查看摘要

Abstract:There is a growing interest in utilizing synthetic populations for a diverse range of applications. At the same time, we are witnessing a tremendous growth in artificial intelligence in all walks of life. This paper evaluates whether zero-shot large language model (LLM)-generated health survey data can serve as inputs to a conventional iterative proportional fitting (IPF) workflow for geographically explicit population synthesis. Using the 2023 Behavioral Risk Factor Surveillance System (BRFSS), we generate synthetic survey records for the U.S. states of Colorado and Mississippi with GPT-4.1 and Gemini-2.5-Pro. We use the generated data in an IPF-based synthesis pipeline and evaluate the resulting census tract-level synthetic populations against external benchmarks. Results show both LLMs capture several major state-level contrasts, indicating zero-shot generation produces geographically differentiated survey data. However, performance is strongly variable-dependent. Downstream effects in population synthesis are mixed, as IPF sometimes amplifies or reduces errors in the generated data. Spatial validation shows that LLM-based populations reproduce census tract-level patterns reasonably well, especially for variables that were more aligned with the ground truth data. Overall, the LLM-generated survey data shows promise as supplementary input, but not yet as a replacement for real survey data.

[AI-207] Short-Term Gain Long-Term Frag ility: AI Labor Substitution and the Erosion of Sustainable Capability

链接: https://arxiv.org/abs/2605.27399
作者: Wolfgang Rohde
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures, Also available on SSRN: this https URL

点击查看摘要

Abstract:What looks like acceleration can be a quiet transfer of burden from the present to the future. Attempts to replace human labor with AI systems are often presented as rational responses to technological progress, but that view is often structurally short-sighted. Across software development and adjacent knowledge industries, AI is increasingly attractive because it appears to reduce labor costs, speed output, and improve short-term metrics. Yet those gains may be achieved by drawing down human capabilities that are slow to build and difficult to restore. This paper develops a mechanism of capability masking and capability erosion under AI labor substitution. AI-generated output can create the appearance that organizational capability has been replaced, even when dependence on skilled human labor remains. That appearance can support hiring restraint while slower costs accumulate in the background. Evidence from AI-assisted coding shows that generated output still requires substantial human verification and remains uneven in correctness, maintainability, and security. Repository-level studies also suggest limits in handling broader codebase context. More broadly, labor-market, political-economy, and industrial-strategy evidence suggests that substitution pressures are being driven by managerial cost incentives and national competition while increasing risks of concentration and platform control. The result is a system that may look more efficient in the short term while becoming more fragile over time.

[AI-208] Agent ic Literacy Debt: A Structural Problem the AI Literacy Field Has Not Yet Named

链接: https://arxiv.org/abs/2605.27396
作者: Rohith Nama
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous AI agents now plan, decide, and act on behalf of users across healthcare, financial services, and workplace contexts, often without step-by-step human approval. Existing AI literacy frameworks were built for a world in which humans evaluate AI outputs and decide whether to act; they have no vocabulary for the user who has delegated decision-making authority to an agent whose actions may not be observable, reversible, or controllable. This paper names the resulting problem agentic literacy debt: the accumulating societal deficit that grows when agentic AI systems are deployed at scale without corresponding literacy infrastructure. The debt compounds through three reinforcing channels (normalization of opaque delegation, multi-agent ecosystem complexity, and institutional path dependence), and it is incurred by the organizations that deploy agents but paid by the users, patients, and citizens on whose behalf the agents act. Evidence from healthcare, financial fraud, and global equity contexts suggests the gap is already consequential. The problem is structural, not a temporary lag that curriculum reform will close. It demands a reframing of AI literacy as a governance capability, not an evaluative one.

[AI-209] Informing AI Policy Assessment using Large-Scale Simulation of Interventions

链接: https://arxiv.org/abs/2605.27395
作者: Julia Barnett,Kimon Kieslich,Natali Helberger,Nicholas Diakopoulos
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This work will be published in the proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2026. 15 pages plus end matter and appendix

点击查看摘要

Abstract:As the rapid proliferation of AI systems and harms spurs efforts in AI governance around the world, prioritizing among competing policy options has become increasingly challenging for policymakers and researchers. We introduce a methodology for identifying viable policy options to mitigate specified AI harms, helping policymakers and researchers target areas that warrant greater time and resource investment. This method combines participatory evaluation of policies, expert assessment of implementation costs, and an LLM-based assessment of perceived harm mitigation under each policy option. We leverage a genetic algorithm-based simulation study to explore a vast solution space of potential policy combinations, and examine how outcomes change under different weightings of cost, participatory input, and harm mitigation. We find that this method enables exploration of different balances between participatory and expert components, allowing policymakers and researchers to assess how much weight to assign to each. We argue that the diversity of viable policy combinations found by the genetic algorithm could be a useful starting point for deliberation. This method operationalizes existing work on participatory AI by integrating it directly into practical policy development pipelines.

[AI-210] Learning after COVID-19 and the ICT career aspirations: Are students entering the AI era with weaker skills?

链接: https://arxiv.org/abs/2605.27391
作者: Diana Maria Popa,Simona-Vasilica Oprea,Adela Bâra
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper examines whether students are entering the generative AI era with sufficiently strong educational foundations, focusing on the relationship between learning environments and changes in ICT related career aspirations across countries. The analysis uses country-level data from PISA 2018 and 2022, combining indicators of student autonomy, digital skills and teacher support. A mixed-method approach is applied, including descriptive statistics, regression analysis, clustering, latent representation learning (using Variational Autoencoder-VAE), discriminant analysis and probabilistic modeling to capture both observable and latent dimensions of educational readiness. Unlike prior research that treats learning loss, digital skills and career expectations separately, our analysis integrates them within a comparative longitudinal framework. It shifts the focus from short-term post-pandemic effects to the structural capacity of education systems to prepare students for digital and AI-driven labor markets. Results show a global but uneven increase in ICT career aspirations. Digital skills emerge as the strongest and most consistent predictor, while teacher support plays a complementary role. Autonomy shows weaker, context-dependent effects. Educational readiness is multidimensional, and ICT aspirations evolve relatively independently from other career domains.

[AI-211] Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity IJCNN

链接: https://arxiv.org/abs/2605.27385
作者: Yiran Pang,Zhen Ni,Xiangnan Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the International Joint Conference on Neural Networks (IJCNN) 2025

点击查看摘要

Abstract:Federated reinforcement learning (FedRL) enables multiple agents to collaboratively train a global policy without sharing raw data, making it ideal for privacy-sensitive applications. However, FedRL faces challenges in heterogeneous environments where differing state-transition dynamics lead to non-identical input distributions and imbalanced parameter updates during aggregation. Therefore, this paper develops a personalized observation normalization (PON) method, allowing each agent to locally normalize raw state inputs using a continuously updated running mean and variance. This design ensures consistent scaling of local feature without overshadowing across agents during aggregation. Furthermore, we demonstrate that sharing normalization parameters across agents is ineffective due to the diverse local input distributions, which highlights the necessity of personalized statistics. Experiments on heterogeneous MuJoCo tasks show that our developed PON accelerates training and achieves superior performance compared to baseline methods.

[AI-212] he Computational Boundary of Inference: Capability Internalization Training and the Turing Jump

链接: https://arxiv.org/abs/2605.27381
作者: Chien-Ping Lu
机构: 未知
类目: Computational Complexity (cs.CC); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure, v2

点击查看摘要

Abstract:Claims about recursive self-improvement in AI often slide from repeated internal revision to the possibility of qualitatively stronger capability without clearly distinguishing the underlying computational regimes. This paper gives a formal separation result in classical computability theory that blocks that move under a precise modeling assumption. For an oracle A , let \mathcalC(A)=\B : B \leq_T A\ be the corresponding computational layer. We prove that finite internal self-modification remains inside \mathcalC(A) , while stabilized revision is governed instead by the jump A’ via the relativized limit lemma. Together with a local closure versus escape theorem, this yields a clean formal separation between within-layer iteration and ascent to a stronger relative level. The point is not that stronger layers never arise, but that they are not explained by finite repetition inside one already settled layer. The resulting separation gives a computability-theoretic limit on a broad class of recursive-improvement narratives in which repeated internal updating is treated as sufficient for qualitative capability ascent.

[AI-213] LNN-PINN: A Unified Physics-Only Training Framework with Liquid Residual Blocks

链接: https://arxiv.org/abs/2508.08935
作者: Ze Tao,Hanxuan Wang,Fujun Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have attracted considerable attention for their ability to integrate partial differential equation priors into deep learning frameworks; however, they often exhibit limited predictive accuracy when applied to complex problems. To address this issue, we propose LNN-PINN, a physics-informed neural network framework that incorporates a liquid residual gating architecture while preserving the original physics modeling and optimization pipeline to improve predictive accuracy. The method introduces a lightweight gating mechanism solely within the hidden-layer mapping, keeping the sampling strategy, loss composition, and hyperparameter settings unchanged to ensure that improvements arise purely from architectural refinement. Across four benchmark problems, LNN-PINN consistently reduced RMSE and MAE under identical training conditions, with absolute error plots further confirming its accuracy gains. Moreover, the framework demonstrates strong adaptability and stability across varying dimensions, boundary conditions, and operator characteristics. In summary, LNN-PINN offers a concise and effective architectural enhancement for improving the predictive accuracy of physics-informed neural networks in complex scientific and engineering problems.

[AI-214] Preference-Shaped Expected Hypervolume and R2 Improvement: Exact Computation and Monotonicity

链接: https://arxiv.org/abs/2605.28746
作者: Michael T.M. Emmerich
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:This paper studies preference-shaped expected improvement criteria for Bayesian multiobjective optimization. We consider two indicator families which are often used for similar algorithmic purposes, but which are geometrically different. The hypervolume indicator is based on a dystopian reference point and measures dominated volume in objective space. The R2 indicator is based on a utopian point and evaluates approximation sets through weighted Tchebycheff scalarization envelopes. The purpose of the paper is to make precise which preference transformations preserve exact computation, Pareto compatibility, and monotonicity properties, and which transformations change the underlying geometry. On the hypervolume side, we revisit canonical EHVI through the Deng representation, formulate product-density weighted EHVI in desirability coordinates, discuss cone-based EHVI as ordinary EHVI after a linear cone transformation, and separate these cases from truncated EHVI, where variance monotonicity may fail. On the R2 side, we prove that exact integral R2 improvement is not, in general, an ordinary objective-space weighted hypervolume. The obstruction is lower-dimensional: Lebesgue-density hypervolume cannot see certain boundary contributions that Tchebycheff scalarizations still detect. We then show that exact integral R2 improvement is exactly a scalarization-space volume, namely the measure of the Tchebycheff shadow between the incumbent scalarization envelope and the reference envelope. This representation yields finite-sum ER2I algorithms for discrete R2, quadrature methods for exact integral R2, and an achievement-space Gaussian surrogate formulation in which ER2I is an integral of scalar Gaussian expected improvements.

[AI-215] Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images

链接: https://arxiv.org/abs/2605.28693
作者: Joséphine Raugel,Maximilian Seitzer,Marc Szafraniec,Huy V. Vo,Jérémy Rapin,Patrick Labatut,Piotr Bojanowski,Valentin Wyart,Jean-Rémi King
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures

点击查看摘要

Abstract:Backpropagation is the core learning mechanism underlying deep learning. However, whether and how this algorithm is implemented in the brain remains highly debated. In particular, while forward activations of pretrained models reliably map onto the cortical hierarchy of visual processing, it is unknown whether backpropagated gradients exhibit a similar correspondence. Here, we address this question using functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) recordings of human brain responses to natural images. For this, we extend standard encoding analyses of forward activations to map backpropagated gradients onto neural data. Focusing on a recent self-supervised vision model (DINOv3) and reproducing results on eight vision models, we find that backpropagated gradients can reliably predict both fMRI and MEG signals, specifically in higher-level visual cortex and for later latencies. However, the spatial and temporal organization of these backpropagated gradients in the brain diverges from the patterns expected under a biologically plausible backpropagation mechanism: specifically, both the order in which gradients are computed and their spatial organization diverge from the temporal and spatial hierarchies of the human brain. Together, these results suggest that, although deep networks and the brain may share similar representational content, they likely rely on fundamentally different mechanisms to learn those representations.

[AI-216] hermodynamic properties of chemically disordered compounds via AI-driven estimation of partition function with the PULSE method

链接: https://arxiv.org/abs/2605.28594
作者: Baptiste Bernard,Luca Messina,Eiji Kawasaki,Emeric Bourasseau
机构: 未知
类目: atistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 13 pages, 11 figures, submitted to Physical Chemistry Chemical Physics

点击查看摘要

Abstract:In this article, we present an improved version of the PULSE method (Partition function Unsupervised Learning Sampling and Evaluation) for estimating the thermodynamic properties of chemically disordered compounds. The aim is to reduce the computational cost of Monte Carlo approaches for this type of material and to demonstrate that this generative tool can estimate thermodynamic properties by sampling and estimating the partition function of the system. To validate this innovative approach, we use the 2D Ising model as a benchmark. We demonstrate that our method accurately reproduces average properties with high precision and efficiency compared to traditional Monte Carlo sampling methods. Our results highlight the efficiency and adaptability of the PULSE method, making it a valuable tool for studying materials for which conventional methods are too inefficient to compute properties affected by chemical disorder at low cost.

[AI-217] Multi-Teacher Knowledge Distillation via Teacher-Informed Mixture Priors

链接: https://arxiv.org/abs/2605.27967
作者: Luyang Fang,Yongkai Chen,Jiazhang Cai,Ping Ma,Wenxuan Zhong
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Knowledge distillation is a powerful method for model compression, enabling the efficient deployment of complex deep learning models (teachers), including large language models. However, its underlying statistical mechanisms remain unclear, and uncertainty evaluation is often overlooked, especially in real-world scenarios requiring diverse teacher expertise. To address these challenges, we introduce \textitMulti-Teacher Bayesian Knowledge Distillation (MT-BKD), where a distilled student model learns from multiple teachers within the Bayesian framework. Our approach leverages Bayesian inference to capture inherent uncertainty in the distillation process. We introduce a teacher-informed prior, integrating external knowledge from teacher models and task-specific training data, offering better generalization, robustness, and scalability. Additionally, an entropy-based weighting mechanism adaptively adjusts each teacher’s influence, allowing the student to combine multiple sources of expertise effectively. MT-BKD enhances the interpretability of the student model’s learning process, improves predictive accuracy, and provides uncertainty quantification. We validate MT-BKD on both synthetic and real-world tasks, including protein subcellular location prediction and image classification. Our experiments show improved performance and robust uncertainty quantification, highlighting the strengths of our MT-BKD framework.

[AI-218] LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation

链接: https://arxiv.org/abs/2605.27840
作者: Zhisheng Zhang,Xiang Li,Yixuan Zhou,Jing Peng,Guoyang Zeng,Zhiyong Wu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which increases the modeling burden of Diffusion Transformers (DiTs) for generation. We propose LoSATok, a low-dimensional audio tokenizer for cross-domain audio understanding and generation. Motivated by the observation that 1280-dimensional semantic encoder features are compressible, we introduce a Semantic Bottleneck that compresses them into 128 dimensions, regularized by the proposed time-relation loss for temporal feature consistency. We further design a dual-level semantic supervision method that leverages both high- and low-dimensional semantic signals, enabling the tokenizer to jointly capture semantics and acoustic details within a compact latent space. Experiments on speech, music, and general audio show that SemBo preserves strong low-dimensional semantic capacity and LoSATok retains competitive understanding performance compared with several semantic representations, while consistently improving DiT modeling performance on speech, music, and audio generation. These results demonstrate that LoSATok’s low-dimensional representations can effectively support audio understanding and generation. Our code is provided at this https URL.

[AI-219] On the Subgaussianity of Quantized Linear Maps: An AI-Assisted Note

链接: https://arxiv.org/abs/2605.27563
作者: Guangyi Zou,Roman Vershynin
机构: 未知
类目: Probability (math.PR); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 4 pages

点击查看摘要

Abstract:This short note presents a dimension-independent subgaussian concentration bound for Gaussian vectors under coordinate-wise nonlinear mappings. Discovered by Gemini 3.5 Flash, this result applies to any bounded function under a well-conditioned covariance. We apply this tool to answer a question of Simone Bombari on sign-quantized linear maps Y = \textsgn(Wx) .

[AI-220] BIRDS: Characterizing and Understanding Biodiversity Impact of Large Language Model Serving

链接: https://arxiv.org/abs/2605.27480
作者: Tianyao Shi,Yi Ding
机构: 未知
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 21 pages, 27 figures, 9 tables

点击查看摘要

Abstract:Large language model (LLM) serving creates environmental impacts beyond carbon and water, including ecosystem damage through biodiversity-related pathways. We present BIRDS, a framework for Biodiversity Impact of Request-Driven LLM Serving. BIRDS defines request-level functional units, quantifies operational and embodied biodiversity impact, and introduces Quality-Normalized Biodiversity Impact (QNBI) to jointly analyze ecological impact and response quality. Across diverse workloads, models, GPUs, and regions, \SYSTEM reveals that biodiversity impact accumulates at scale and exposes actionable quality-aware serving tradeoffs.

[AI-221] When prompt perturbations break your A/B test: A valid statistical test for generative surveying

链接: https://arxiv.org/abs/2605.27463
作者: Hayden Helm,Carey Priebe
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Generative surveying – where collections of LLM-based personas provide feedback on messages – has emerged as a cheap and scalable alternative to traditional market research. However, LLMs are sensitive to small variations in prompt design and conclusions drawn from generative surveys may depend on arbitrary phrasing choices. Controlling for this sensitivity requires including semantically equivalent perturbations in the analysis. In this paper, we show that standard hypothesis tests, including the sign test and Wilcoxon signed-rank test, are invalid under a statistical model for generative surveying that includes realistic perturbation structure. We propose a permutation test that is valid under this model and formally characterize the conditions under which standard tests fail. Applying our framework to a simple generative surveying problem, we estimate relevant parameters, characterize the power of the permutation test under realistic conditions, and provide practical guidance on budget allocation across personas, perturbations, and replicates. Finally, we show that both the magnitude and direction of the estimated effect are sensitive to the choice of model, even within the same model family.

[AI-222] Quantum Machine Learning-based 6G edge Network: Enabling Adaptive Communication and Model Aggregation

链接: https://arxiv.org/abs/2605.27417
作者: Wenjing Xiao,Jiatai Yan,Chenglong Shi,Shixin Chen,Miaojiang Chen,Min Chen,Saif Al-Kuwari,Ahmed Farouk
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the advent of sixth-generation (6G) mobile communication technology, vehicle-to-everything (V2X) communication faces unprecedented challenges in communication efficiency, system generalization capabilities, and model collaboration. Conventional machine learning struggles with high-dimensional state spaces, slow convergence, and poor generalization under heterogeneous V2X nodes, rapidly varying channels, and multimodal sensing data in V2X systems. To address these issues, we propose a quantum-enhanced framework for V2X communication and model aggregation that targets efficient, robust, and intelligent transportation in 6G, which includes four modules: the channel-adaptive semantic communication module, the multimodal fusion module, the model transfer module, and the federated aggregation module. Specifically, the channel-adaptive semantic communication module leverages quantum convolutional neural networks (CNN) and quantum distortion metrics to enable efficient transmission and strong generalization across diverse conditions. The multimodal fusion module exploits quantum attention and entanglement to compress features and associate semantics across heterogeneous data. The model transfer module employs quantum reinforcement learning to model decision-making and improve adaptability in dynamic environments. The federated aggregation module integrates quantum tensor decomposition with backpropagation-based corrections to provide privacy preservation with low overhead and to strengthen global model robustness. This work outlines a new paradigm for communication and model collaboration in future 6G intelligent transportation.

[AI-223] Can Quantum Federated Learning Withstand Circuit-Level Backdoors? IJCAI ECAI2026

链接: https://arxiv.org/abs/2605.27416
作者: Aakar Mathur,Mohammed Ruknuddin,Ashish Gupta
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Accepted to IJCAI-ECAI 2026

点击查看摘要

Abstract:Quantum Federated Learning (QFL) inherits the core vulnerability of federated optimization to malicious clients, while also introducing an attack surface from variational circuit training and measurement-driven gradients. This work proposes a novel CircUit-Level backdoor Threat (CULT) model that formalizes four stealthy attacks by exploiting quantum-aware mechanisms, including Grover, Pauli, Bit-flip, and Sign-flip. By enabling malicious clients on both in-training and post-training surfaces, these attacks can critically undermine the learning process. We establish a rigorous theoretical foundation to demonstrate attack stealthiness under standard smoothness assumptions. Experiments on the MNIST and CIFAR-10 datasets with non-IID splits and varying fractions of malicious clients show that even a single malicious client can induce severe accuracy degradation under FedAvg aggregation. While popular defenses, including Krum, Multi-Krum, FoolsGold, FLGuardian, and Mud-HoG, reduce degradation in many regimes, they fail to eliminate worst-case failure cases, where accuracy drops up to 50%. The experimental analysis further reveals that under the CULT model, malicious updates effectively mask their presence by staying close to benign norms, thereby helping attackers evade detection.

[AI-224] Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

链接: https://arxiv.org/abs/2605.27413
作者: Chen Wei,Fanding Xu,Minghao Sun,Zhiyuan Liu,Lin Wang,Tianrui Jia,Yihang Zhou,Yang Zhang
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Proteins perform their biological functions through three-dimensional structures encoded by amino acid sequences, and ligand-binding protein co-design requires models that generate sequence-structure compatible proteins under explicit ligand constraints. Although continuous diffusion and flow-based models support ligand-aware design in coordinate or latent spaces, existing discrete diffusion protein language models mainly operate over sequence or structure tokens without direct small-molecule conditioning. We introduce \textbfProtLiD ^2 , a \textbfProtein \textbfLigand-conditioned \textbfDiscrete \textbfDiffusion model for protein sequence-structure co-design. ProtLiD ^2 jointly generates amino-acid sequence and discrete structure tokens while incorporating ligand chemical and geometric information through geometry-aware cross-attention. Trained on over one million ligand-protein complexes, ProtLiD ^2 extends masked discrete diffusion to ligand-aware functional protein design. We further propose maximum confidence-margin guided ReMask decoding, an inference-time self-correction strategy that retains confident predictions and remasks uncertain tokens. ProtLiD ^2 improves global fold confidence over Complexa in whole-protein design, increasing TM-score from 0.672 to 0.802 and pLDDT from 64.55 to 73.00. In pocket co-design, ProtLiD ^2 reduces active-site BB-RMSD from 3.46/3.40Å for FAIR/PocketGen to 1.97Å, and improves ligand-aware pass rates over PocketGen from 14.86% to 59.73% and from 6.08% to 23.49% under stricter docking thresholds. These results support ligand-conditioned discrete diffusion as an effective token-space framework for functional protein co-design. Code will be available at this https URL.

机器学习

[LG-0] Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

链接: https://arxiv.org/abs/2605.28769
作者: Kevin Y. Li,Asher Trockman,Ananda Theertha Suresh,Ziteng Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Softmax attention is the cornerstone of modern large language models, but its memory scales linearly and compute quadratically with sequence length. Linear recurrent models, such as linear attention and state space models, have become widely studied as alternatives to attention due to their linear compute and constant memory. While these sub-quadratic token mixing methods, or mixers, achieve promising efficiency gains and competitive results on a wide range of benchmarks, current linear recurrent models still lag behind on tasks that require long-context retrieval or in-context learning. A growing body of work studies hybrid architectures that attempt to mitigate these trade-offs by statically interleaving or merging attention and recurrent blocks. In this work, we explore a new axis of developing hybrid models: across the token sequence. We propose Oryx, a hybrid model that can, throughout a sequence, flexibly switch between different mixers, for example quadratic attention for rich context utilization and linear recurrences for efficient generation. Oryx ties at least 90% of its parameters across mixers, enabling attention and recurrent modes to operate over shared internal representations. We validate our design with Mamba-2 and Gated DeltaNet variants, up to 1.4B models. Under fixed token budgets and a mixed-training strategy, Oryx achieves comparable or better performance than its single-mixer baselines. At the 1.4B scale, all instances of Oryx outperform their respective baselines by at least 0.7 percentage points on averaged language modeling tasks. On retrieval tasks, Oryx achieves performance comparable to the Transformer baseline even when processing only a tiny fraction (10%) of the tokens in attention mode. These results suggest that attention and linear recurrent models can share internal representations, and motivate sequence-axis hybridization as a promising direction.

[LG-1] Principled Algorithms for Optimizing Generalized Metrics in Multi-Label Learning

链接: https://arxiv.org/abs/2605.28767
作者: Mehryar Mohri,Yutao Zhong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Many real-world classification tasks require predicting multiple labels per instance, necessitating the optimization of complex evaluation metrics such as the F -measure and Jaccard index. While the Empirical Utility Maximization (EUM) framework is natural for these population-level metrics, existing theoretical results are largely limited to asymptotic Bayes-consistency. In this paper, we develop principled learning algorithms for optimizing a broad class of generalized metrics within the EUM framework, grounded in the stronger notion of H -consistency. Our key contribution is the design of novel surrogate loss functions for multi-label learning that admit provable H -consistency bounds, enabling optimization with non-asymptotic guarantees tailored to the hypothesis class and finite samples. Crucially, we prove these combinatorially formulated surrogates decompose exactly, operating in strictly O(l) time without approximations. Building on this foundation, we introduce MMO (Multi-Label Metric Optimization), a new family of algorithms for optimizing generalized linear-fractional metrics. We validate our approach through extensive experiments, demonstrating robust scalability and superior performance over state-of-the-art continuous baselines on large-scale datasets (MS-COCO, Reuters-21578) in high-sparsity, deep learning regimes. Our results offer both theoretical rigor and practical effectiveness for general multi-label metric optimization.

[LG-2] LLM Zeroth-Order Fine-Tuning is an Inference Workload

链接: https://arxiv.org/abs/2605.28760
作者: Zelin Li,Caiwen Ding
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, 3 tables, including appendix and references

点击查看摘要

Abstract:Zeroth-order (ZO) fine-tuning is attractive for large language models because it replaces backpropagation with forward objective evaluations. Existing implementations nevertheless execute ZO algorithms inside conventional training loops, even though their dominant work is repeated scoring under nearby parameter states. This creates a workload-runtime mismatch: the algorithm asks for structured inference-style scoring, while the system exposes a sequence of fragmented training-loop steps. We show that LLM ZO fine-tuning is an inference-dominated workload and execute its repeated scoring phase through a serving runtime. On OPT-13B SST-2, the resulting vLLM execution path completes the 20k-step LoZO run in 0.51 estimated training hours versus 4.15 hours for the official LoZO baseline under the matched LoRA-only setting, an 8.13x speedup, while reaching 0.922 final evaluation accuracy and 0.931 final full-validation accuracy. In core-step scaling experiments across OPT-1.3B to OPT-13B, the same runtime reorganization gives 2.34x–7.72x speedups. A MeZO-style high-rank factorized experiment shows that the same runtime paradigm can track a MeZO-like loss trajectory while running up to 2.55x faster. More broadly, representing ZO updates as dynamic adapter states suggests a practical path toward inference-time training, where lightweight adaptation can be scheduled as an inference-like workload rather than as a separate training job.

[LG-3] How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures ICRA2026

链接: https://arxiv.org/abs/2605.28726
作者: Krishnam Gupta
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted at IEEE ICRA 2026 Workshop “From Data to Decisions: VLA Pipelines for Real Robots”, Vienna, June 2026. Non-archival workshop. 5 pages, 2 figures, 22 references

点击查看摘要

Abstract:We discover that VLA architectures fail in fundamentally different, predictable ways at the motor-command level. Running VQ-BeT, Diffusion Policy, and ACT on identical evaluation protocols (n=450 episodes across PushT and ALOHA 14-DOF bimanual manipulation), we find: (1) direction reversal rate is a universal failure predictor across all three architectures (AUROC=0.93, 0.79, 0.91; p0.001); (2) jerk monitoring is predictive only for discrete-token architectures, following a discrete-to-continuous gradient (0.88, 0.69, 0.41); (3) velocity violations alone are non-predictive everywhere (AUROC 0.41-0.69), yet velocity checking is the most common safety mechanism in VLA deployment code; and (4) for continuous-family VLAs, velocity monitoring provides effectively zero predictive signal (AUROC=0.52 on ACT, 0.41 on Diffusion), proving that architecture-matched monitor selection is essential. These results quantify a monitoring consequence of the well-known discrete/continuous VLA distinction: the two families produce qualitatively different failure signatures that require different monitors. No single monitor works universally; architecture-matched selection is required. This finding was enabled by SafeContract, a training-free, black-box action monitoring toolkit with conformal calibration. Code: this https URL

[LG-4] Stage-wise Distortion-Perception Traversal in Zero-shot Inverse Problems with Diffusion Models ICML2026

链接: https://arxiv.org/abs/2605.28711
作者: Jiawei Zhang,Ziyuan Liu,Leon Yan,Zhenyu Xiao,Yuantao Gu
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026

点击查看摘要

Abstract:The distortion-perception (D-P) tradeoff is a fundamental phenomenon of Bayesian inverse problems, which characterizes the inherent tension between distortion performance and perceptual quality. Enabling flexible traversal of the D-P tradeoff at inference time is crucial for practical applications. Despite the recent success of diffusion models in zero-shot inverse problem solving, efficient and principled strategies for D-P traversal in diffusion-based inverse algorithms remain inadequately characterized. In this paper, we propose a stage-wise framework for realizing D-P traversal using a single diffusion model in zero-shot inverse problems. Our proposed method, termed MAP-RPS, starts with an MAP estimation stage that approximates the MMSE solution and provides a low-distortion initialization, followed by a re-noised posterior sampling stage that progressively improves perceptual quality. We provide theoretical analyses for both stages, establishing the validity and effectiveness of the proposed design. Furthermore, we extend MAP-RPS to the latent space, yielding LMAP-RPS, which enjoys broader applicability by leveraging large-scale pre-trained latent diffusion backbones. Extensive experiments demonstrate that MAP-RPS and LMAP-RPS enable more effective D-P traversal on various tasks, while also exhibiting strong performance as efficient solvers for real-world inverse problems.

[LG-5] Understanding Generalization and Forgetting in In-Context Continual Learning ICML2026

链接: https://arxiv.org/abs/2605.28705
作者: Guangyu Li,Meng Ding,Lijie Hu
类目: Machine Learning (cs.LG)
*备注: accepted by ICML 2026

点击查看摘要

Abstract:In-context learning (ICL) derives its power from enabling Large Language Models to adapt to new tasks via prompt-based reasoning alone, entirely bypassing the need for parameter updates. Existing theories primarily study ICL in single-task settings, while real-world prompts often contain sequences of heterogeneous tasks, leaving a gap in understanding whether Large Language Models implicitly perform continual learning during inference. To bridge this gap, we propose the first theoretical framework for in-context continual learning, modeling how a pretrained Transformer processes multiple sequential tasks within a single prompt through shared attention mechanisms. Focusing on linear and masked linear self-attention, we derive error expressions for model predictions under sequential task prompts and analyze their generalization and forgetting behavior. Our results reveal that standard attention mechanisms inevitably induce intertask interference by uniformly or causally aggregating historical contexts, leading to systematic bias. We further provide a bias-variance-interference decomposition of prediction error, characterizing when historical in-context information yields positive transfer or provable negative transfer. This analysis exposes fundamental limits of attention-based continual inference and offers theoretical explanations for order sensitivity and performance degradation in long prompts.

[LG-6] Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations

链接: https://arxiv.org/abs/2605.28704
作者: Yeachan Park,Geonho Hwang,Wonyeol Lee,Sejun Park
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most existing expressivity theories for neural networks assume exact real arithmetic, whereas practical neural networks are executed under finite-precision floating-point arithmetic with implementation-dependent execution semantics. Recent works have begun studying the expressive power of floating-point neural networks, but existing results are limited to highly restricted activation functions and idealized assumptions such as fixed left-to-right reduction orders and correctly rounded activation implementations. In this work, we study the expressive power of floating-point neural networks under generalized floating-point execution semantics, including arbitrary reduction orders and inexact activation implementations with bounded ulp errors. We investigate when floating-point neural networks can represent arbitrary functions between floating-point domains exactly. To this end, we introduce a general distinguishability framework and show that the ability to distinguish every pair of distinct inputs in the first layer is necessary for universal representability. This characterization yields broad classes of activation implementations that are not universal representators, extending previous isolated counterexamples such as the correctly rounded cosine activation. We further prove that a suitable form of distinguishability is also sufficient for universal representability under mild conditions on the activation implementation. Using this framework, we establish universal representability results for a broad class of practical activation functions, including implementations of \mathrmSigmoid , \tanh , \mathrmReLU , \mathrmELU , \mathrmSeLU , \mathrmGeLU , \mathrmSwish , \mathrmMish , and \sin , under significantly more realistic floating-point execution models than previously known.

[LG-7] History-aware adaptive reduced-order models via incremental singular value decomposition

链接: https://arxiv.org/abs/2605.28684
作者: Amirpasha Hedayat,Ali Mohaghegh,Laura Balzano,Cheng Huang,Karthik Duraisamy
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph)
*备注: 50 pages, 27 figures, Preprint submitted to Elsevier

点击查看摘要

Abstract:Reduced-order models (ROMs) can accelerate high-dimensional dynamical simulations, but their accuracy often deteriorates when online dynamics leave the regime represented by offline training data. We develop a projection-based adaptive ROM framework based on incremental singular value decomposition (iSVD), in which occasional full-order operator evaluations provide correction snapshots for online basis updates. The intrusive ROMs considered here are fully parameterized by the basis, so each update naturally propagates to reduced operators and hyper-reduction machinery. Through its evolving singular structure, iSVD retains an encoded history of the observed dynamics and is history-aware in this sense. We study the method on three nonlinear problems of increasing complexity: the one-dimensional viscous Burgers equation, the Sod shock tube, and a stiff one-dimensional ten-species rotating detonation engine (RDE). The Burgers problem is used to analyze the method and compare iSVD with alternative basis adaptation rules, showing that history-aware updates outperform instantaneous updates and that iSVD gives the strongest overall performance. The Sod and RDE cases demonstrate that these advantages persist in more challenging compressible-flow settings. For the RDE problem, the iSVD adaptive ROM improves upon the current state-of-the-art Direct adaptive ROM baseline in both predictive accuracy and computational efficiency. A cost analysis shows that the dominant online cost comes from interacting with the full-order model to obtain correction snapshots, while the iSVD update itself is negligible. These results identify iSVD as an effective mechanism for online learning of reduced subspaces and suggest a path toward ROMs that remain predictive over horizons several orders of magnitude longer than their initial training window.

[LG-8] Optimal ridge regularization revisited

链接: https://arxiv.org/abs/2605.28679
作者: Jack Timmermans,Sergio A. Alvarez
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider L^2 -regularized linear (ridge) regression over a finite data sample X with bounded covariance and linear prediction targets y with additive isotropic noise of finite variance. We present an iterative procedure to compute the optimal regularization strength numerically from the generative parameters in the fixed- X setting and prove its convergence at limited noise levels. Our experimental evaluation over synthetic data shows that the proposed procedure combined with sample-based parameter estimates attains near-optimal random- X generalization across a wide range of sample sizes, aspect ratios, and noise levels, at an added computational cost equivalent to one preliminary ridge regression in the underparameterized regime and two in the overparameterized case.

[LG-9] Optimal Data Acquisition for Reinforcement Learning: A Large Deviations Perspective

链接: https://arxiv.org/abs/2605.28675
作者: Mingjie Hu,Jian-Qiang Hu,Enlu Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data acquisition efficiency is a central challenge in deploying reinforcement learning in business and healthcare operations, where interactions are costly, slow, and often involve humans in the loop. This paper develops a unified large deviations framework for data acquisition in infinite-horizon reinforcement learning. We introduce the exponential decay rate of the policy-selection error probability as a principled efficiency metric and derive a variational characterization of this rate via large deviations theory for Markov chains, yielding a nested optimization problem. Based on this characterization, we formalize two complementary notions of optimality in terms of the optimal solution of the nested problem. Because the resulting program is implicit and generally intractable, we propose a tractable convex relaxation with explicit constraints. We then develop a lazy one-step projected subgradient method to solve the relaxed problem and use its iterates to construct an adaptive data acquisition policy. We prove that the resulting reinforcement learning algorithm is near-robustly optimal under our optimality criterion, up to a constant factor. Finally, we extend the framework to linear function approximation to improve scalability, and numerical experiments support the effectiveness of the proposed approach.

[LG-10] Applications of temporal graph learning for predicting the dynamics of biological systems

链接: https://arxiv.org/abs/2605.28659
作者: Manuel Dileo,Andrea Sottoriva
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Biological foundation models have shown strong performance in single-cell representation learning by applying transformer architectures directly to gene-expression matrices. However, these approaches predominantly operate in static settings and do not explicitly model the temporal evolution of developmental programs in the cell. Modeling such dynamics is important for understanding how cellular states progressively emerge, differentiate, and reorganize during development or disease progression. In this work-in-progress paper, we investigate an alternative temporal graph-based perspective in which cellular states are represented through pseudotime-resolved gene regulatory networks and modeled as evolving graph structures over persistent gene identities. Starting from single-cell transcriptomic data, we infer pseudotime trajectories, discretize cells into developmental snapshots, reconstruct one gene regulatory network per snapshot, and apply temporal graph neural networks to forecast biological states. We evaluate this framework on two publicly available mouse developmental datasets, erythroid gastrulation and pancreatic endocrinogenesis, considering three complementary tasks: gene-expression forecasting, link prediction, and out-degree centrality prediction. Our results show that graph-based models outperform well-known foundation-model such as scGPT and scFoundation, suggesting that explicitly modeling evolving regulatory structure provides useful information beyond static pretrained representations. For link prediction and centrality forecasting, temporal graph learning captures non-trivial regulatory dynamics and enables the identification of temporally important gene hubs. Overall, our findings support temporal graph learning as a promising direction for modeling dynamic biological systems and as a complementary paradigm to current foundation model approaches in single-cell biology.

[LG-11] Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

链接: https://arxiv.org/abs/2605.28640
作者: Xiuying Wei,Caglar Gulcehre
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.

[LG-12] Single-Rollout Hidden-State Dynamics for Training-Free RLVR Data Selection ICML2026

链接: https://arxiv.org/abs/2605.28631
作者: Jianghao Wu,Jianfei Cai,Weiqiang Wang,Jin Ye,Daniel F. Schmidt,Yasmeen George
类目: Machine Learning (cs.LG)
*备注: 14 pages, 2 figures. Accepted by ICML 2026

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) can yield large reasoning gains from very few training instances, yet its strong sensitivity to which instances are used makes data selection a central bottleneck. Most existing selection pipelines rely on training-time optimization signals and/or require access to verifiable rewards or ground-truth answers over large candidate pools, which is costly and often infeasible in specialized domains. We study RLVR data selection in a setting where selection must be performed before any RL training and without labels or reward evaluation on the full pool. We propose SHIFT, a one-shot, training-free selector based solely on inference-time hidden-state dynamics. For each candidate instance, SHIFT runs a single deterministic reasoning rollout and computes a reasoning-induced representation shift (RIRS) as the start-to-end hidden-state delta. SHIFT uses the RIRS magnitude as a lightweight proxy for instance utility and enforces coverage via a quality-weighted farthest-first CoreSet procedure in an RIRS-augmented feature space, producing compact subsets that scale to large unlabeled pools. Across mathematical reasoning and medical QA benchmarks under ultra-low budgets, SHIFT consistently outperforms training-free diversity and difficulty/uncertainty baselines, improving both in-domain accuracy and transfer to harder evaluation settings. Ablations show that RIRS-based coverage and quality-weighting contribute complementary gains, and analyses indicate that RIRS is not explained by simple input/output length statistics. Code is available at this http URL.

[LG-13] When Interpretability Is Unequally Distributed: Fairness in Hybrid Interpretable Models

链接: https://arxiv.org/abs/2605.28626
作者: Ziba Jabbar Zare,Ulrich Aïvodji,Julien Ferry,Thibaut Vidal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Hybrid interpretable models combine a transparent component with a black-box model by assigning some examples to the former and deferring the rest to the latter. While this design enables flexible tradeoffs between accuracy and interpretability, it also raises a distinct procedural fairness concern: some demographic groups may systematically receive interpretable decisions, while others are disproportionately routed to a black box. We formalize this issue as Interpretability Coverage Disparity (ICD), a demographic-parity-style measure applied to the routing decision of hybrid interpretable models. Using tools from predictive multiplicity, we study ICD across four hybrid interpretable learning methods, three standard fairness benchmark datasets, and multiple sensitive attributes. Our experiments reveal substantial ICD in intermediate transparency regimes, where both the interpretable and black-box components are actively used. We further show that simple coverage-disparity constraints can significantly reduce ICD in exact hybrid learning methods, with marginal impact on accuracy and sparsity. In several settings, ICD mitigation also improves standard algorithmic fairness metrics. These results show that hybrid interpretable models should be audited not only for predictive fairness, but also for how they allocate interpretability across individuals and groups. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.28626 [cs.LG] (or arXiv:2605.28626v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.28626 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-14] Random Process Flow Matching: Generative Implicit Representations of Multivariate Random Fields

链接: https://arxiv.org/abs/2605.28625
作者: Julien Lalanne,David Picard,Lionel Boillot,Lina-María Guayacán-Carrillo,Leon Barens,Jean-Michel Pereira
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative modeling provides a powerful framework for learning data distributions. These models initially relied on probabilistic methods such as Gaussian Processes (GP) for uncertainty-aware predictions and shifted towards larger trainable models to learn more complex distributions. In this work, we introduce Random Process (RP) Flow, a Flow Matching-based framework that represents the vector field as a neural implicit function. Unlike modern generative methods, our setting involves a single observed field, from which only sparse measurements are available. RP Flow uses Random Fourier Features to learn an implicit signal representation that can be queried at any arbitrary location from a limited set of observations, while encoding uncertainty through ensemble sampling. We propose constructing a Bayesian posterior by GP regression in the source space to generate high-quality samples. Our empirical results demonstrate that this framework generates realistic samples along with calibrated uncertainty estimates, even under challenging conditions such as high frequency, high sparsity, or high dimensionality. These findings position RP Flow as a milestone towards generative models for reconstruction tasks where data is scarce and uncertainty must remain traceable.

[LG-15] Learning High-Dimensional Parity Functions with Product Networks using Gradient Descent

链接: https://arxiv.org/abs/2605.28612
作者: Guillaume Larue(1),Louis-Adrien Dufrène(1),Quentin Lampin(1),Hadi Ghauch(2),Ghaya Rekaya(2) ((1) Orange Research, Meylan, France (2) Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parity functions are fundamental Boolean operations with critical applications across machine learning, cryptography, and error correction. Yet, learning high-dimensional parity functions poses significant challenges: in a general setting, standard neural network architectures typically require exponential sample complexity, making gradient-based optimization intractable for large number of inputs N . We demonstrate that compact product-based neural architectures combined with stochastic data sparsity (Bernoulli inputs with p_e \leq 1/N ) and appropriate hyperparameter choice enable efficient parity learning, with theoretical guarantees of convergence. Experiments validate our theory across dimensions up to N = 100,000 , with empirical evidence showing optimal hyperparameter choices for p_e and learning rate \alpha , as well as polynomial complexity scaling laws. This work establishes fundamental connections between architectural inductive bias and data sparsity, opening new possibilities for neural arithmetic, structured reasoning, binary neural networks, and machine learning applied to automated protocol discovery.

[LG-16] ransformers Provably Learn to Internalize Chain-of-Thought

链接: https://arxiv.org/abs/2605.28600
作者: Yixiao Huang,Hanlin Zhu,Zixuan Wang,Jiantao Jiao,Stuart Russell,Somayeh Sojoudi,Song Mei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting substantially improves the sample efficiency of transformers, reducing the complexity of tasks like parity learning from exponential to polynomial in the input length. However, generating explicit reasoning steps at inference is computationally expensive. Implicit Chain-of-Thought (ICoT) has emerged as a promising empirical remedy that trains models to internalize intermediate steps within their hidden states, but its theoretical foundations remain poorly understood. We give the first theoretical analysis of ICoT, proving that an L -layer transformer trained under our proposed Log-ICoT curriculum learns k -parity with \mathsfpoly(n) samples and L = \log_2 k training stages. This matches the sample efficiency of explicit CoT while eliminating its inference overhead, and extends prior one-layer parity guarantees to multi-layer architectures. Compared to standard ICoT, which removes thinking tokens one at a time, Log-ICoT removes them in geometric chunks, reducing the number of stages from linear in k to logarithmic. Experiments on multi-layer transformers confirm the theory and visualize how reasoning is progressively absorbed into deeper layers.

[LG-17] PLS in the Mirror of Self-Attention

链接: https://arxiv.org/abs/2605.28592
作者: Jiangsheng(Jason)You
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This note provides an interesting observation on casting partial least square (PLS) as a linearized self-attention so that PLS may be studied within the neural network paradigm. On the other hand, the dimensionality reduction and selection of predictors in PLS may indicate that self-attention includes certain degree of dimensionality normalization toward improved learning.

[LG-18] hinned Mean Field Langevin Dynamics

链接: https://arxiv.org/abs/2605.28589
作者: Zonghao Chen,Heishiro Kanagawa,François-Xavier Briol,Chris J. Oates,Lester Mackey
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Several important learning tasks can be formulated as minimizing an entropy-regularized objective over an appropriate space of probability distributions. Mean-field Langevin dynamics (MFLD) facilitate computation in this general context, casting the minimizer as the invariant distribution of a McKean–Vlasov process, which can be numerically discretized using N particles and thus simulated. However, simulating this interacting particle system has computational complexity of order N^2 . Motivated by recent research into \emphkernel thinning, we propose \textttKT-MFLD, in which each particle interacts only with a thinned particle coreset of size \mathcalO(N^\frac12) . \textttKT-MFLD thus reduces the computational complexity to order N^\frac32 while, under mild regularity conditions, achieving the same convergence guarantees (up to logarithmic factors) as MFLD. Our theoretical analysis is empirically confirmed on tasks including the training of student-teacher neural networks, quantization with maximum mean discrepancy, and computation of predictively-oriented posteriors in a post-Bayesian framework.

[LG-19] Outer-Momentum Restarting in High-Dimensional Two-Phase Optimization

链接: https://arxiv.org/abs/2605.28585
作者: Kristi Topollai,Allan Ma,Tolga Dimlioglu,Sui Jiet Tay,Anna Choromanska
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Communication-efficient distributed optimizers such as DiLoCo reduce synchronization costs by letting workers perform many local updates before aggregating their progress with an outer momentum optimizer. Recent theory suggests that the outer optimizer acts on an effective spectrum induced by the inner optimization loop, and that the choice of outer momentum controls how progress from local updates is accumulated across communication rounds. We study periodic restarting of the outer momentum as a simple complementary mechanism for controlling this outer memory. In a linearized squared-loss model where prediction-space residuals evolve under the empirical NTK, we derive a mode-wise restart contraction showing that resets exploit phase cancellation by discarding stale momentum while preserving inner-loop progress. Toy experiments verify the predicted contraction behavior, and language-model pretraining experiments show that periodic restarts widen the stable range of outer learning rates and momentum values across communication periods.

[LG-20] A Generalized Tikhonov Layer for Interpretable-by-design Graph Neural Networks

链接: https://arxiv.org/abs/2605.28578
作者: Nicolas Tremblay,Benjamin Ricaud,Filippo Maria Bianchi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose the Tikhonov layer, a graph neural network layer that is interpretable by design: once trained, its learned parameters directly reveal which node features and which aspects of the graph topology were leveraged for prediction. In practice, the layer’s propagation matrix takes the closed-form R = (p(L)+Q)^-1 Q , where L is the normalized graph Laplacian, Q = diag(q_1,…,q_n) a learnable diagonal matrix of positive node-importance scores, and p(\cdot) a learnable polynomial. For any input feature x , the layer output Rx is the exact minimizer of a generalized graph Tikhonov problem that trades off node-level data fidelity against a topology-driven regularization penalty. The learned pair \q_i,p\ constitutes a built-in explanation: large q_i indicates that node i 's own features drive the prediction, while small q_i signals reliance on the local graph topology; the shape of p reveals whether homophily, heterophily, or a band-pass response is exploited. Expressivity is preserved by routing complexity through a dedicated, arbitrarily deep Q-network that produces the importance scores, while the Tikhonov layer itself remains transparent. We prove that distinct node-importance matrices yield distinct propagation operators, structurally coupling the explanation to the computation. Additionally, the Tikhonov layer provides, in a single layer, a global receptive field, mitigating both oversmoothing and oversquashing. Experiments on standard graph classification benchmarks confirm that the model matches (and sometimes outperforms) opaque baselines while producing interpretable and faithful explanations.

[LG-21] High Performance Low Reliability: Uncertainty Benchmarking for Tabular Foundation Models

链接: https://arxiv.org/abs/2605.28554
作者: José Lucas De Melo Costa,Fabrice Popineau,Arpad Rimmel,Bich-Liên Doan
类目: Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, 2 tables. Accepted at ESANN 2026 (European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning), 22-24 April 2026, Bruges (Belgium)

点击查看摘要

Abstract:Recent Tabular Foundation Models (TFMs) have demonstrated state-of-the-art predictive performance, often surpassing Gradient-Boosted Decision Trees (GBDTs). However, the trustworthiness of these models, particularly their uncertainty quantification, has been largely overlooked. We investigate this gap through an extensive study comparing TFMs, GBDTs, and classical baselines on the 112 datasets of the TALENT benchmark. Our results reveal a performance-uncertainty trade-off: although TFMs achieve the highest predictive performance, measured by AUC, they exhibit lower conditional coverage under conformal prediction, measured by SSCS, compared to GBDTs. Complementary experiments on synthetic datasets further characterize the regimes in which this effect intensifies. We conclude that while TFMs advance predictive frontiers, achieving well-calibrated uncertainty remains a major open challenge for their reliable adoption. Code is available at: this https URL

[LG-22] SPRINT: Efficient Spectral Priors for Humanoid Athletic Sprints

链接: https://arxiv.org/abs/2605.28549
作者: Yantong Wei,Kaihong Huang,Hainan Pan,Jiawei Luo,Jiawei Zhou,Ziyan Mai,Zhiwen Zeng,Yaonan Wang,Huimin Lu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The pursuit of humanoid athletic sprints is hindered by a scarcity of humanoid-viable kinematic reference data and the inability of existing frameworks to maintain stability during sprints. To overcome these limitations, we introduce SPRINT, a novel framework driven by efficient, frequency-adaptive spectral priors. By characterizing the fundamental periodicity of human locomotion in the frequency domain using a reference library of five discrete motion sequences, these priors generate kinematically feasible joint trajectories across a broad velocity spectrum, successfully extrapolating to speeds that exceed the reference distribution. Guided by these pretrained priors, the SPRINT policy achieves zero-shot sim-to-real transfer in field experiments on the Unitree G1 platform, reaching a peak sprinting velocity of 6 m/s and demonstrating seamless gait transitions while preserving biomimetic naturalness. Ultimately, this work establishes frequency-adaptive spectral priors as a highly data-efficient foundation for humanoid athletic sprints. The project page is available at this https URL.

[LG-23] Semi-Supervised Hypothesis Testing by Betting on Predictions

链接: https://arxiv.org/abs/2605.28533
作者: Yaniv Tenzer,Elad Tolochinsky,Yaniv Romano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a testing-by-betting framework that leverages predictions on unlabeled data to enhance the power of sequential hypothesis testing. Given limited samples from the joint distribution of (X,Y) , and additional unlabeled samples from the marginal of X , we ask how unlabeled data can be used to hypothesize about the distribution of Y , and the conditional distribution of Y\mid X . We introduce an e-statistic and use it to construct a sequential test. Under standard distributional assumptions – label shift or concept shift – we establish that the test is anytime valid. Furthermore, we show that for binary data, the e-statistic has non-trivial power. Crucially, our approach retains these properties even when the underlying predictions are inaccurate. Through simulations and applications to large language models evaluation, we demonstrate power gains over baseline approaches, including prediction-powered inference. These gains persist even with relatively limited unlabeled data and when predictions have low accuracy due to weak correlation between X and Y .

[LG-24] Stabilizing distribution-free probabilistic forecasts

链接: https://arxiv.org/abs/2605.28531
作者: Jente Van Belle,Honglin Wen,Wouter Verbeke,Pierre Pinson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-step-ahead forecasts are often updated as new observations become available, since shorter forecast horizons typically improve forecast quality. However, such improvements come at the cost of forecast instability, i.e., variability in forecasts for the same target period. This instability can trigger costly changes to plans formulated based on the forecasts and may erode trust in the forecasting system. In this work, we integrate forecast stability alongside forecast quality into the training of distribution-free probabilistic time-series forecasting models, allowing us to control this trade-off. We propose a method for generating stabilized forecasted conditional quantile functions using regression splines parameterized by a neural network. This approach enables joint optimization of quality and stability, as it allows us to directly penalize dissimilarities arising from forecast updates. Furthermore, it allows assigning varying importance to stabilizing different parts of the forecast distributions (e.g., central parts vs. tails) to focus on the parts most relevant for the intended downstream use (e.g., the upper tail for inventory management). We empirically evaluate the proposed method on two datasets with different statistical properties and show that it can effectively reduce forecast instability without a substantial loss in forecast quality, and that it can target stabilization effort toward specific parts of the forecast distributions.

[LG-25] Universal Time Series Generation with Neural Controlled Differential Equations

链接: https://arxiv.org/abs/2605.28507
作者: Torben Berndt,Elyes Farjallah,Leif Seute,Raeid Saqur,Benjamin Walker,Jan Stühmer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent work on the sequence universality of State Space Models (SSMs) has introduced efficient, maximally expressive continuous-time approaches for time-series modelling. While these works focus on discriminative settings, we extend this perspective to generative time-series modelling by proving that maximally expressive Structured Linear Controlled Differential Equations (SLiCEs) are universal time-series generators, in the sense that they can approximate the induced path laws of continuous causal pushforwards on compact latent sets in W_\infty . Building on these theoretical results, we propose Generative SLiCEs (G-SLiCEs), a maximally expressive continuous-time model for flow matching on path-space. Empirically, we show that expressivity improves performance in probabilistic forecasting and downstream tasks, while retaining the advantages of continuous-time models such as generalising to arbitrary observation grids. This is particularly beneficial for irregular grids, where fixed-grid models often struggle.

[LG-26] Fitting Unknown Number of Hyperplanes with Manifold Optimization

链接: https://arxiv.org/abs/2605.28501
作者: Zhiqin Cheng,Yu Zhan,Mingjin Zhang,Lingbo Liu,Liang Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fitting an unknown number of hyperplanes to data is a fundamental yet challenging problem in machine learning, characterized by its non-convexity, non-differentiability, and unknown model order. Existing approaches often struggle with local optima or lack geometric consistency. To address these limitations, we propose a novel framework based on Manifold Optimization. We reformulate the problem as an unsupervised learning task on the unit sphere manifold \mathcalS^\textbfdim-1 . This formulation effectively handles the non-convex constraints and linearizes the distance measurement, rendering the gradient descent tractable. We propose a Two-Stage Manifold Optimization algorithm. In Phase I, we employ a Riemannian Expectation-Maximization process with a heavy-tailed kernel to robustly estimate posterior probabilities, effectively resolving the ambiguities of point distribution between intersecting hyperplanes. In Phase II, upon convergence of the soft estimates, the probabilistic weights degenerate into hard matching, generating a precise local optimum that strictly satisfies the geometric definition. Furthermore, we introduce a projected density estimation strategy for initialization to facilitate global convergence by significantly reducing the feature description space and search complexity. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines in both geometric accuracy and robustness.

[LG-27] Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

链接: https://arxiv.org/abs/2605.28467
作者: Avidan Shah,Jannik Brinkmann,Rico Angell
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As LLMs gain stronger reasoning capabilities, their extended chain-of-thought introduces new degrees of complexity for defending against adversarial jailbreaks and prompt injection. We study consistency training, a family of fine-tuning objectives that enforce identical behavior on clean prompts and adversarial rewrites, and evaluate its two main variants, output-level (BCT) and activation-level (ACT), across five reasoning models. We formulate both methods as a prompt injection defense and find ACT to be competitive with other training-based defenses while requiring only self-supervised pairs of clean and wrapped prompts. Our experiments also generalize both techniques within the jailbreak setting, demonstrating that ACT remains more robust to adaptive attacks. We also provide mechanistic evidence that ACT’s defense against jailbreaks is encoded as a roughly linear shift in activation space at the assistant-turn boundary. After ACT training, we can recover a single steering direction that controls refusal on reasoning models with minimal effect on benign inputs. We find that ACT remains robust even when the model’s chain-of-thought is replaced with a compliant trace from the undefended base model, pivoting to refuse prefilled jailbreaks. Together, these results suggest that supervising internal representations is a surprisingly effective and interpretable approach to various forms of safety training in reasoning models.

[LG-28] Bilinear Coordinate Alignment for Training-Free Task-Vector Transfer

链接: https://arxiv.org/abs/2605.28444
作者: Jungyong Son,Jinwook Jung,Minhee Park,Sungyong Baik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning large-scale pre-trained models is a recent prevalent paradigm for adapting general representations to specialized tasks. However, when a new version of a pre-trained model becomes available, expertise acquired through fine-tuning cannot be directly reused because it is tied to the parameterization of the original model, requiring another costly fine-tuning. To address this inefficiency, recent work uses task vectors, defined as the parameter difference between a fine-tuned model and its base model, to transfer expertise across models. While existing methods bridge disparate models by matching activations or gradients, a significant performance gap remains relative to direct fine-tuning, suggesting that these partial correspondences are insufficient. In this work, instead of viewing a task vector merely as a parameter offset, we revisit the formation of task vectors and show that they can be derived as accumulated bilinear interactions between input-side activations and output-side gradients. Motivated by this observation, we formulate task-vector transfer as a dual-space alignment problem and propose BiCo, a training-free framework for transferring task vectors through Bilinear Coordinate alignment. BiCo estimates orthogonal Procrustes mappings in both spaces using a single forward-backward pass on a small calibration set, without any parameter update. Across extensive computer vision and natural language processing benchmarks, BiCo consistently outperforms existing transfer methods across models that differ in width, depth, and pre-training configuration.

[LG-29] Latent Diffusion for Missing Data

链接: https://arxiv.org/abs/2605.28427
作者: Alberte Heering Estad,Ignacio Peis,Jes Frellsen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion models have emerged as powerful generative approaches for missing-data imputation, yet most existing methods operate directly in data space and degrade when training data are heavily incomplete. We investigate whether shifting diffusion to a learned latent representation improves robustness under missing-completely-at-random (MCAR) corruption. To this end, we propose a two-stage framework: a robust VAE-based imputer first learns compact semantic features from incomplete observations, and a diffusion model is then trained in the resulting latent space. Across training missing rates, we perform a controlled comparison against pixel-space diffusion models under the same incomplete-data setting. The latent diffusion model maintains high sample quality and remains stable up to 50% missingness, while pixel-space diffusion degrades progressively as missingness increases. For downstream imputation, latent diffusion also achieves consistently better performance than pixel-space diffusion. These findings indicate that latent-space modeling mitigates artifact amplification from zero-imputed inputs and provides a more robust generative prior for incomplete-data learning. Overall, our results support latent diffusion as a strong and practically useful alternative to pixel-space diffusion for missing-data problems.

[LG-30] Conveyance: A Versatile Framework for Learning in Structured Class Spaces

链接: https://arxiv.org/abs/2605.28420
作者: Yasser Taha,Grégoire Montavon,Nils Körber
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While machine learning (ML) architectures have evolved rapidly to account for complex data, loss functions like cross-entropy remain mostly structure-agnostic in many real-world applications. However, the `class-symmetric’ nature of these standard losses fundamentally limits the ability of ML models to exploit structural relationships between classes, particularly when facing structured noise. We propose \textscConveyance, a new classification approach and associated loss function tailored to structured class spaces. It allows users to encode graph-like relations between classes without having to define complex joint distributions or manually tune utility this http URL, our loss function operates by maximizing two separate margins over distinct class partitions, while preserving formal properties such as monotonicity and partial convexity. We demonstrate the versatility and effectiveness of our method by applying it to hierarchical classification, ordinal regression, and multiple instance learning. Across these tasks, \textscConveyance either matches or exceeds the performance of specialized baselines, thereby offering a unified solution for structured class spaces.

[LG-31] Revisiting Metafeatures to Explain Model Differences on Tabular Data

链接: https://arxiv.org/abs/2605.28418
作者: Markus Herre,Andrej Tschalzev,Sascha Marton,Christian Bartelt
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rise of tabular foundation models alongside traditional models still performing well on many tasks, choosing the right model for a tabular dataset remains difficult. We investigate whether dataset meta-features can explain performance gaps between model families on tabular prediction tasks. Using the TabArena benchmark results, we analyze dataset-level performance gaps and relate them to model-agnostic dataset descriptors. After strict statistical tests with false discovery control, we find that (1) for neural network vs. tree gaps, no meta-feature survives false discovery control, (2) for non-foundation vs. foundation model gaps, one association is robust but does not generalize when tested in leave-one-dataset-out prediction, and (3) for TabICLv2 vs. TabPFN-2.6, one robust association also improves held-out prediction. Furthermore, we conduct a leave-one-dataset-out analysis and find that meta-feature predictors fail to improve meaningfully over a simple baseline. Overall, our results show the heterogeneity of tabular datasets and that global meta-feature approaches are not robust enough to offer explanations on the 51 TabArena datasets.

[LG-32] actile-Proprioceptive Sensor Fusion for Contact Wrench Estimation in Whole-Body Physical Human-Robot Interaction ICRA

链接: https://arxiv.org/abs/2605.28412
作者: Junha Min,Junghyeon Ma,Jiwung Kwon,Sunggyu Bae,Joohyung Kim,Kyungseo Park
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures. Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026

点击查看摘要

Abstract:Direct physical guidance is a natural means of teaching and interacting with robots, and robotic skins make a key contribution by enabling sensitive contact sensing and localization. This paper presents a tactile-proprioceptive sensor fusion framework for natural physical human-robot interaction. Tactile cues from pneumatic skin pads serve as contact indicators that bypass the ambiguity between frictional residues and applied external forces, enabling highly sensitive contact detection without explicit friction identification. We fuse these cues with motor-current-based proprioception to reconstruct multi-axis contact forces on the robot surface. To maintain accuracy during motion, we employ a temporal convolutional network (TCN) to mitigate friction hysteresis during stick-slip transitions, reducing uncertainty at contact onset and yielding smooth, responsive guidance. We validate the approach on a skin-integrated robot arm: (i) multi-axis forces are reconstructed in stationary contacts, and (ii) simultaneous force estimation and kinesthetic teaching are demonstrated. Results indicate improved sensitivity and responsiveness across diverse contact conditions compared with tactile-only and proprioceptive-only baselines, supporting tactile-proprioceptive fusion as a reliable pathway to safe, intuitive physical human-robot interaction.

[LG-33] Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference

链接: https://arxiv.org/abs/2605.28384
作者: Alan Ferrari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Standard transformer architectures apply a single attention mechanism uniformly across all tokens and sequence positions, irrespective of local context or computational budget. We propose Meta-Attention, a framework that dynamically routes each token to the most appropriate attention strategy – full softmax attention, linear (kernel) attention, or sliding-window local attention – via a Bayesian Meta-Controller. Unlike prior routing approaches that use deterministic or prior-free learned routing, the Meta-Controller treats per-token mechanism selection as posterior inference under a compute-aware Dirichlet prior: routing weights are the output of an amortised variational posterior q(alpha | x_t; phi) trained with an Evidence Lower Bound (ELBO) objective that jointly encodes task performance and attention-mechanism cost. This design produces principled routing uncertainty estimates that govern the soft-to-hard routing transition, mitigates routing collapse without ad hoc load-balancing losses, and yields better compute-performance trade-offs than deterministic or prior-free learned routing at negligible overhead. Phase 1 empirical results on a Tiny LM benchmark confirm core predictions: the Bayesian controller’s learned routing distribution implies a projected normalised FLOP cost of 25.1% under hard routing, vs. 59.3% for the prior-free baseline (-34.2 pp), and reduces routing entropy from 55.8% to 43.3% (-12.5 pp), demonstrating that the Dirichlet prior prevents routing collapse while the non-Bayesian model defaults to full attention. We present the Bayesian architecture, ELBO training objective, and a Phase 1 PyTorch prototype validating forward-pass correctness, posterior diversity, and a controlled ablation against a prior-free baseline. Code available at: this https URL

[LG-34] acher-Student Representational Alignment for Reinforcement Learning-Driven Imitation Learning ICRA2026

链接: https://arxiv.org/abs/2605.28372
作者: Meraj Mammadov,Pedro Zuidberg Dos Martires,Johannes Andreas Stork
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 6 pages, 5 figures. Accepted as an oral presentation at the RL4IL Workshop at ICRA 2026

点击查看摘要

Abstract:Imitation learning (IL) from a state-based reinforcement learning (RL) policy is a common approach to overcome the curse of dimensionality in complex and high-dimensional observation spaces prevalent in robotics. This paper addresses the irreducible imitation gap that emerges when teacher and student are learned in isolation, and the teacher policy has the liberty to rely on privileged state information that the student cannot infer from its observations. Instead of improving poor student performance with RL finetuning after IL, which often requires a whole new training setup, we propose a novel algorithm which learns a shared embedding space that hides agent-specific observations and thus trains imitable teacher policies by construction. We train the shared embedding space with self-supervised contrastive learning in parallel to the teacher policy and prevent it from extracting private information by limiting its gradients from updating the encoder networks. We perform evaluations on several example domains and compare to state-of-the-art baselines showing that our algorithm enables higher student performance with substantially reduced imitation gap.

[LG-35] LEIA: Learned Environment for Interactive Architected Materials

链接: https://arxiv.org/abs/2605.28368
作者: Haiqian Yang,Yuan Cao,Markus J. Buehler
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:World models have enabled interactive exploration of game environments and robotic manipulation, but physical engineering remains beyond their reach: real materials exhibit nonlinear constitutive laws, carry history-dependent internal state, undergo inertial dynamics, and may possess hierarchical structures spanning multiple length scales. We present LEIA (Learned Environment for Interactive Architected materials), a world model that lets engineers apply boundary conditions step by step and observe the resulting deformation and stress fields in real time. LEIA handles large three-dimensional unstructured meshes and generates autoregressive responses to user-specified loading. We introduce MicroPlate, a benchmark of architected plates spanning two regimes of microstructure modeling: architected lattices that resolve microstructure explicitly through three-dimensional geometry, and a homogeneous plate where microstructural change is modeled implicitly through internal degrees of freedom. MicroPlate is used to assess LEIA alongside four baseline methods across both regimes. Finally, we demonstrate that LEIA enables efficient candidate generation and ranking for fast surrogate-guided search for de novo designs of architected materials, with stress-accurate candidate ranking validated by finite element ground truth.

[LG-36] Detecting Diffusion-Generated Time Series Under Generator Shift

链接: https://arxiv.org/abs/2605.28355
作者: Zhi Wen Soi,Aditya Shankar,Gert Lek,Abele Mălan,Daniel Neider,Jian-Jia Chen,Lydia Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The boundary between real and diffusion-generated time series is becoming increasingly difficult to draw, yet detection in this domain remains underexplored, especially when the generator is unknown. We compare white-box detection, which requires access to the generator, against black-box detection, which operates on the raw signal alone. The white-box approach, a reconstruction-based detector adapted from the image domain, works well in in-distribution but breaks down under generator shift: reconstruction-based detection in images succeeds because large generic generators provide a near-universal reconstruction prior, and no analogous generator exists for time series. In contrast, a simple off-the-shelf classifier used as a black-box detector performs remarkably well, achieving an average F1 of 79.2, a 22.1% relative improvement over the white-box approach, and a TPR@1%FPR of 57.2. Diffusion-generated time series detection is therefore not a direct transfer of the image domain problem. This work provides the first systematic exploration of white-box and black-box detection for diffusion-generated time series. We close by identifying several open and promising directions.

[LG-37] Dimensionality Reduction for Robust Federated Learning: A Theoretical Analysis and Convergence Guarantee

链接: https://arxiv.org/abs/2605.28335
作者: Shiyuan Zuo,Jiashuo Li,Rongfei Fan,Han Hu,Jie Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables multiple clients to collaboratively train models without sharing raw data, but it is highly vulnerable to Byzantine attacks. Existing robust approaches can neutralize these threats but incur substantial computational overhead during high-dimensional gradient aggregation, an overhead that scales poorly with model size and increasingly dominates the training cost as modern models grow larger. To address this computational bottleneck, we propose Projected Dimensionality Reduction (PDR), a universal acceleration framework for vector-level distance-based robust aggregators, which performs robust aggregation by compressing gradients into a drastically smaller subspace via sparse random projection to efficiently compute reliability weights. This approach reduces the server computational complexity to an optimal \mathcalO(Mp) , where M is the number of clients and p is the model dimension, matching the theoretical lower bound required merely to read the gradients. We establish convergence guarantees under standard FL assumptions in prior Byzantine-robust FL analyses. By leveraging the Subspace Embedding Theorem, we show that PDR achieves optimal convergence rates of \mathcalO(1/\sqrtT) for non-convex functions and \mathcalO(1/T) for strongly convex functions, where T denotes the number of iterations. Crucially, we mathematically demonstrate that this massive acceleration comes almost for free, merely inflating the inherent Byzantine error floor by a bounded, tunable factor of \frac1+\epsilon1-\epsilon . Experimental results on benchmark datasets confirm that integrating PDR with existing aggregators yields orders of magnitude speedups in time efficiency while maintaining highly competitive convergence performance.

[LG-38] Learning to Assess the Reliability of Number-of-Runs Estimation in Stochastic Optimization GECCO2026

链接: https://arxiv.org/abs/2605.28309
作者: Sara Gjorgjieva,Eva Tuba,Tome Eftimov
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Preprint version of a poster accepted at the Genetic and Evolutionary Computation Conference 2026 (GECCO 2026)

点击查看摘要

Abstract:In large-scale benchmarking of stochastic optimization algorithms, the key challenge is no longer whether repeated runs are needed for reliability, but how to determine when sufficient evidence has been collected without incurring unnecessary computational cost. We study a learning-based extension of a recent empirical online heuristic that adaptively estimates the required number of runs using outlier handling and skewness-based symmetry checks. Using annotated outcomes from 132,000 Nevergrad runs on COCO (24 problems in 20 dimensions, 10 instances each, 11 optimizers), we train classifiers on 23 statistical, energy-free, and shape and stability features to predict whether a run-number estimate is reliable, prioritizing detection of incorrect estimates via minority-class recall. We evaluate reliability prediction using a within-configuration learning setup, where models are trained and tested on data sharing the same optimizer. The results show that run-number reliability can be learned in a within-configuration scenario, enabling detection of unreliable estimates with high minority-class recall, although performance remains limited by the restricted data diversity within fixed configurations.

[LG-39] Compositional Generalization in Autoregressive Models via Logit Composition

链接: https://arxiv.org/abs/2605.28304
作者: Aakash Kumar,Maria Sofia Bucarelli,Emanuele Natale
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Composing autoregressive models remains a core challenge in understanding how large language models can combine behaviors or skills learned across tasks. We introduce a new and principled composition strategy for autoregressive systems, inspired by composition methods developed for diffusion models. Under a factorized-conditionals assumption, we show that the resulting composition is projective: each component model preserves control over its own designated subspace of the output distribution avoiding interference between models. This property is further preserved under smooth reparameterizations of the output space, yielding a feature-space theorem. Finally, we show that composition preserves length-generalizing behavior when the factorization assumptions and component guarantees hold uniformly at the target length. These results provide a principled understanding of when model composition and merging succeed in autoregressive systems and identify conditions under which their interactions remain stable.

[LG-40] -GINEE: A Tensor-Based Multilayer Graph Representation Learning ICML2026

链接: https://arxiv.org/abs/2605.28300
作者: Maolin Wang,Ziting Mai,Xuhui Chen,Zhiqi Li,Tianshuo Wei,Yutian Xiao,Wenlin Zhang,Wanyu Wang,Ruocheng Guo,Haoxuan Li,Zenglin Xu,Xiangyu Zhao
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026

点击查看摘要

Abstract:Traditional network analysis focuses on single-layer networks, real-world systems often form multilayer networks with multiple relationship types. However, existing methods typically fail to capture complex inter-layer dependencies by treating layers independently or aggregating them. To address this, we propose T-GINEE (Tensor-Based Generalized Multilayer-graph Estimating Equation), a statistical regularization framework combining tensor-based generalized estimating equations with task-specific loss to model cross-network correlations explicitly. Key innovations include: (1) CP tensor decomposition capturing structural dependencies via shared latent factors; (2) a generalized estimating equation framework modeling inter-layer correlations through working covariance matrices; and (3) a flexible link function accommodating characteristics like sparsity. Our theoretical analysis establishes consistency and asymptotic normality under mild conditions. Extensive experiments on synthetic and real-world datasets validate T-GINEE’s effectiveness for multilayer network analysis.

[LG-41] Machine Learning methods for event classification and vertex reconstruction of the 12C 12C reaction with the MATE-TPC

链接: https://arxiv.org/abs/2605.28296
作者: Minghui Zhang,Xiaobin Li,Jie Chen,Ningtao Zhang,Fenhua Lu,Junrui Ma,Jiazhen Yan,Wanqin Tu,Xiaodong Tang,Bingshui Gao,Chengui Lu,Zhichao Zhang,Jinlong Zhang,Weiping Liu
类目: Machine Learning (cs.LG); Nuclear Experiment (nucl-ex); Instrumentation and Detectors (physics.ins-det)
*备注:

点击查看摘要

Abstract:In modern nuclear physics experiments, identifying events of interest is challenging for nuclear reaction studies with the active target Time Projection Chamber (TPC). In this work, machine learning techniques are employed to analyze the complex data of the 12C + 12C fusion reaction from a TPC named MATE (multi-purpose active-target time projection chamber for nuclear experiments). Specifically, we successfully applied Residual Neural Network (ResNet-50, ResNet-34 and ResNet-18) and Visual Geometry Group (VGG-19) to classify elastic scattering and fusion reaction events from the 12C + 12C reaction. The classification results of the four models are nearly identical, with accuracies of approximately 97% for the simulated data and 90% for the experimental data. Moreover, these approaches successfully identify some events that are misclassified by traditional methods. These models are also applied to classify events from different fusion reaction channels, with classification accuracies of approximately 95% on simulated data. In addition, a Convolutional Neural Network (CNN) model is developed to reconstruct the reaction vertex, providing an alternative strategy for vertex reconstruction. These results indicate that machine learning techniques can effectively classify reaction events from different channels and reconstruct the reaction vertex, thereby paving the way for future analyses of complex nuclear reaction data.

[LG-42] Adaptive Bandit Algorithms for Contextual Matching Markets ICML2026

链接: https://arxiv.org/abs/2605.28290
作者: Shiyun Lin,Simon Mauras,Vianney Perchet,Nadav Merlis
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注: Accepted to ICML 2026

点击查看摘要

Abstract:We study bandit learning in matching markets, where players and arms constitute the two market sides, and the players’ utilities are linear in the arm contexts. In each round, new arms arrive with observable contexts. Then, the algorithm matches them to players, aiming to minimize each player’s regret against a stable matching benchmark. This contextual structure creates significant complexity: subtle context shifts can slightly alter one player’s utility while completely reconfiguring the underlying benchmark, causing large regret spikes for others. We address this in two settings: stochastic contexts, drawn from a latent distribution, and adversarial contexts, which may be arbitrary. For the stochastic case, we introduce a novel minimum preference gap to capture learning difficulty and provide a fully adaptive algorithm with an instance-dependent poly-logarithmic regret upper bound. We also establish matching instance-independent regret upper and lower bounds under a mild distributional assumption. For the adversarial setting, we propose a tractable regret notion that remains valid under arbitrary contexts and achieves an instance-independent sublinear regret bound via an adaptive algorithm.

[LG-43] AtomComposer: Discovering Chemical Space from First Principles with Reinforcement Learning

链接: https://arxiv.org/abs/2605.28287
作者: Bjarke Hastrup,Francois Cornet,Tejs Vegge,Arghya Bhowmik
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Discovering novel stable molecules without training data remains a grand scientific challenge. Current molecular generative models are trained on large, pre-curated datasets, which introduce biases and limit exploration of novel chemistry. In contrast, we propose a new paradigm: autonomous, generalized agents capable of mapping vast, unknown chemical spaces without any pretraining. For the first time, we present AtomComposer, a self-guided agent that autonomously constructs valid 3D isomers under stoichiometric constraints and is trained exclusively online using reinforcement learning. Unlike existing approaches that generally overfit to a specific chemical formula, we establish a multi-composition training scheme that enables a broad generalization across diverse chemistry, guided by energy- and validity-based rewards. Our agent can discover up to an order of magnitude more valid isomers on unseen test formulas than existing single-composition reinforcement-learning baselines trained with per-step energy rewards. These results fulfill the promise of online reinforcement learning as a powerful paradigm for scalable, from-scratch exploration of chemical configuration space.

[LG-44] Commit to the Bit: Reactive Reinforcement Learning Done Right

链接: https://arxiv.org/abs/2605.28276
作者: Onno Eberhard,Claire Vernade,Michael Muehlebach
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning algorithms are commonly analyzed (and designed) under the Markov assumption. This is unrealistic, as most environments encountered in practice are either partially observable, or require function approximation that restricts the agent to access non-Markovian state features. We consider the problem of learning an optimal reactive policy in a finite environment with deterministic observations (or equivalently, hard state aggregation). We introduce a new algorithm, Committed Q-learning, and prove almost-sure convergence to the optimal reactive policy under an intuitive assumption we call rewire-robustness. This assumption is strictly weaker than the q_\star -realizability condition used in prior work. Our algorithm is a variant of classical Q-learning in which the behavior policy commits to a single action upon entering a feature, and only resamples actions when the observed feature changes. A crucial part of our analysis is the introduction of quasi-Markov environments.

[LG-45] Dynamic Topic Modeling with a Higher-Order Hypergraphical Representation

链接: https://arxiv.org/abs/2605.28269
作者: Hanjia Gao,Hanwen Ye,Qing Nie,Annie Qu
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 34 pages, 4 figures

点击查看摘要

Abstract:Dynamic topic modeling is widely used to analyze evolving trends in scientific literature, medical records, and social media. Traditional topic models represent each topic through a single probability vector on the multinomial simplex and implicitly couple word occurrence and repetition within one probabilistic mechanism. However, this formulation restricts the dependence structure among words and overlooks informative higher-order interactions, particularly in dynamic corpora with overlapping semantics. To address these limitations, we introduce a hypergraph representation of text where each document is modeled as a hyperedge connecting all co-occurring words, with repetition intensities encoded as node weights. This representation naturally separates word occurrence from repetition and induces a novel hypergraph-based multinomial distribution with a nonlinear normalization depending on the observed word set of each document. Building on this likelihood, we develop a dynamic topic modeling framework via structured low-rank factorizations with explicit temporal regularization on topic-word profiles. Moreover, we establish local convergence guarantees and derive non-asymptotic error bounds despite the intrinsic nonconvexity induced by bilinear factorization and document-specific nonlinear normalization. Numerical experiments on synthetic data and an application to the International Conference on Learning Representations (ICLR) corpus demonstrate consistent improvements over existing multinomial-based topic models.

[LG-46] Parameter-Efficient Generative Modeling with Controlled Vector Fields

链接: https://arxiv.org/abs/2605.28267
作者: Peyman Morteza
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a continuous-time generative modeling framework, motivated by the Chow-Rashevskii theorem, that builds expressive flows from a small set of fixed vector fields and learned scalar controls. Instead of learning an unconstrained high-dimensional vector field, our framework constructs the velocity by modulating fixed vector fields with learned scalar control functions. When the fixed fields are bracket-generating, their Lie algebra spans the ambient space, providing a mechanism for expressive transport with only a small number of learned control channels and offering a parameter-efficient geometric alternative to standard vector-field parameterizations. This decoupled formulation yields a structured and interpretable generative model in which the number of learned scalar output channels can be chosen independently of the ambient dimension. We formulate an expressivity principle showing that, under suitable controllability and well-posedness assumptions, such controlled flows can transport a source distribution to a target distribution. We train the resulting model using a continuous-normalizing-flow likelihood objective and present proof-of-concept experiments on synthetic distributions.

[LG-47] ProgVLA: Progress-Aware Robot Manipulation Skill Learning

链接: https://arxiv.org/abs/2605.28231
作者: Seungsu Kim,Jinyoung Choi,Seungmin Baek,Jean-Michel Renders
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present ProgVLA, a compact vision-language-action (VLA) model designed for reliable robot manipulation under tight compute and memory budgets. The model specifically focuses on efficiently processing long multi-modal sequences by maintaining an explicit representation of task progress over extended horizons. To this end, ProgVLA integrates two key components. First, a multi-modal encoder with a two-stage Perceiver resampling scheme compresses variable-length visual, language, and proprioceptive streams into a fixed set of control-ready context tokens, substantially reducing sequence length while preserving cross-modal grounding. Second, an auxiliary set of progress heads is trained with offline reinforcement learning (RL) objectives to jointly learn critics over normalized remaining-horizon targets. This provides the policy with an internal estimate of task progress and enables advantage- and success-weighted flow-matching imitation learning. On two well-established multi-task robot manipulation benchmarks, a 0.1B-parameter ProgVLA model reaches success rates that are competitive with, and on long-horizon and harder task tiers exceed, substantially larger pretrained baselines. Ablations indicate that the learned context resampler and task-adaptive visual fine-tuning are the largest single contributors, while progress-aware training provides a consistent additional gain that is concentrated on long-horizon and multi-object tasks. We further validate the approach in real-world toy-kitchen environments.

[LG-48] PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion

链接: https://arxiv.org/abs/2605.28226
作者: Łukasz Janisiów,Sebastian Musiał,Bartosz Zieliński,Dawid Rymarczyk,Tomasz Danel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Small-molecule drug discovery requires simultaneous optimization of numerous properties of candidate molecules. These properties can be investigated through the analysis of high-dimensional biological signatures, such as cell morphology and transcriptomic perturbations, which provide a rich perspective on the underlying biological mechanisms. However, existing generative methods, which use those signatures for optimization, fail to meet two key requirements: providing precise guidance toward desired phenotypic signatures while maintaining structural proximity to a known hit. We introduce PhAME (Phenotype-Aware Molecular Editing), a latent diffusion framework that overcomes this challenge by recasting molecular optimization as editing in the latent space of a pretrained graph-based VAE. Our central contribution is a compositional classifier-free guidance scheme with two independent scales, one for the phenotype-conditioning and one for similarity to the seed structure, allowing practitioners to control the tradeoff between these two objectives. Empirical evaluations across diverse benchmarks, including docking score optimization and multimodal phenotypic generation, demonstrate that PhAME achieves state-of-the-art results while maintaining high chemical validity and novelty.

[LG-49] Robust Contrastive Graph Clustering with Adaptive Local-Global Integration

链接: https://arxiv.org/abs/2605.28209
作者: Lei Zhang,Fubo Sun,Haipeng Yang,Zhong Guan,Likang Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph clustering is essential in graph analysis for revealing structural patterns and node communities. Despite recent advances in self-supervised contrastive learning that have improved clustering via structural and attribute signals, existing methods still struggle to flexibly capture high-order local structures and often overlook global semantics in complex graphs. These limitations lead to suboptimal node representations, especially in real-world graphs with fragmented structures and ambiguous cluster boundaries. To address these limitations, a contrastive graph clustering framework is proposed to jointly integrate multi-scale local structures with global semantics via attention mechanisms. At the local level, GNN-based topological signals extracted from multiple propagation depths are adaptively fused through attention-based weighting to capture multi-scale neighborhood features. At the global level, semantic prototypes derived from dynamically evolving cluster centers are adaptively aggregated through attention to guide node representations and enhance inter-cluster separability. The model is trained under a dual-view contrastive learning paradigm with a hybrid objective that combines instance-level and structure-aware losses to improve representation robustness and discrimination. Experiments on eight real-world graph datasets demonstrate that our method achieves competitive clustering performance. Code is available at this https URL.

[LG-50] Refining Multidimensional Video Reward Models via Disentangled Influence Functions

链接: https://arxiv.org/abs/2605.28203
作者: Muyao Wang,Zeke Xie,Hideki Nakayama
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Text-to-Video (T2V) generation models continue to evolve, the complexity of video evaluation necessitates a fine-grained assessment across various axes. To address this, recent works have focused on developing Multidimensional Video Reward Models (MVRMs), which decompose the evaluation process to better align with the multifaceted nature of human visual perception. However, training effective MVRMs is fundamentally challenged by the complex nature of video data. In this work, we identify a critical phenomenon termed Dimensional Heterogeneity: the reliability of a training sample can vary substantially across evaluation dimensions, meaning that a sample may provide reliable supervision for one objective while inducing high supervision risk for another. Consequently, prevailing data-centric methods that filter based on global scalar metrics are ill-posed for T2V tasks. To address this, we propose a disentangled influence framework that that efficiently estimates dimension-specific supervision risk. Leveraging this framework, we introduce two dimension-disentangled refinement strategies: Dimension-Disentangled Pruning, which removes extreme high-risk samples, and Dimension-Disentangled Reweighting, which softly down-weights high-risk supervision. Extensive experiments demonstrate that our disentangled strategies significantly outperform global filtering baselines, yielding reward models with superior alignment to ground truth.

[LG-51] Geometry-First Generative Spatial Single-Cell Reconstruction KDD2026 KDD

链接: https://arxiv.org/abs/2605.28200
作者: Ehtesamul Azim,Muhtasim Noor Alif,Tae Hyun Hwang,Yanjie Fu,Wei Zhang
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 32nd SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

点击查看摘要

Abstract:Single-cell RNA sequencing (scRNA-seq) profiles large numbers of cells but loses spatial context, whereas spatial transcriptomics (ST) preserves partial spatial structure at lower resolution. Most existing integration methods either deconvolve spot mixtures or map cells onto a measured spot lattice, which ties reconstructions to a fixed grid and slide-specific coordinate systems, a limitation that is especially problematic in unpaired settings. We propose GEARS, a geometry-first framework that reconstructs an intrinsic single-cell spatial geometry guided by ST, without relying on cell-type labels, histological images, or cell-to-spot assignment. GEARS first learns a domain-invariant expression encoder that aligns ST spots and dissociated cells, and then trains a permutation-equivariant generator with a diffusion-based refiner with EDM-style preconditioning to generate local spatial geometries under pose-invariant supervision derived from ST coordinates. At inference, GEARS reconstructs geometry on many overlapping subsets of scRNA-seq cells, aggregates predicted pairwise distances across subsets, and solves a global distance-geometry problem to obtain canonical two-dimensional coordinates and a dense distance matrix. Extensive quantitative and qualitative experiments, including cross-section generalization, show that GEARS consistently improves global distance preservation, local neighborhood fidelity, and spatial distribution alignment compared to strong spatial mapping and deconvolution baselines.

[LG-52] Hierarchical Synthetic Tabular Data Generation: A Hybrid Top-Down and Bottom-Up Framework ICML2026

链接: https://arxiv.org/abs/2605.28198
作者: Junfeng Nie,Alvin Jin,Xiaohui Chen
类目: Machine Learning (cs.LG)
*备注: Accepted as a poster at FMSD @ ICML 2026. 9 pages, 6 figures

点击查看摘要

Abstract:Existing approaches for synthetic tabular data generation are based on either purely generative models or LLMs, both of which struggle with data heterogeneity, logical consistency, rare-event coverage, and robustness in low-data regimes. In this paper, we propose a hierarchical hybrid top-down and bottom-up (H-TDBU) framework that decouples semantic structures from stochastic texture. In the top-down path, structure-driven logical constraints and cross-modal alignment rules are constructed, while in the bottom-up path, lightweight tabular generators are used to learn local statistical patterns from real data. The two paths are consolidated in a unified synthesis engine with an iterative feedback loop. We evaluate the framework on weak multimodal financial benchmarks combining tabular and sentiment-text data. Experimental results show that our H-TDBU approach improves train-synthetic-test-real performance over neural baseline methods while preserving semantic consistency. Our results suggest that hierarchical rule-guided synthesis provides an effective mechanism for combining controllability, semantic coherence, and statistical fidelity in synthetic data generation.

[LG-53] Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient Calibration

链接: https://arxiv.org/abs/2605.28184
作者: Zili Wang,Jiajun Chai,Lin Chen,Xiaohan Wang,Shiming Xiang,Guojun Yin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.

[LG-54] Unification and Optimization of Robust Supervised Learning

链接: https://arxiv.org/abs/2605.28165
作者: Jonas Hanselle,Valentin Margraf,Clemens Damke,Eyke Hüllermeier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The literature has proposed various robust alternatives to empirical risk minimisation to address failure modes such as distribution shift, label noise and finite-sample degeneracies. Examples include distributionally robust optimization, label smoothing, vicinal risk minimization, and Mixup. However, such approaches are typically developed in isolation, forcing practitioners to commit a priori to a single failure mode even when the dominant mode for the task is unclear. To address this, we organize a broad class of existing methods along three common design axes and derive a tractable training procedure that decomposes robust learning into sequential stages (reference distribution enrichment, input-space perturbation, label-space perturbation, and sample-level aggregation), each with a choice of stance (pessimistic, neutral, or optimistic). This results in a unified design space in which joint hyperparameter optimization can compose and configure robustness strategies suited to the task at hand. Across tabular, image, and reward modeling benchmarks, joint hyperparameter optimization is competitive with the best single-method baseline in each setting, offering a reliable default for practitioners who do not know a priori which failure mode dominates their task.

[LG-55] mporal Hyperbolic Graph Representation Learning for Scale-Free Internet Routing and Delay Prediction

链接: https://arxiv.org/abs/2605.28155
作者: Yi-Ling Kuo,Hao-Yu Tien,Shih-Yu Tsai
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Predicting Internet round-trip time (RTT) is critical for routing optimization, quality-of-service (QoS) provisioning, and traffic engineering, yet remains challenging due to long-term temporal dependencies, evolving routing dynamics, and heavy-tailed latency distributions. While Temporal Graph Neural Networks (TGNNs) can model evolving network topologies, most existing approaches operate in Euclidean space, which poorly captures the hierarchical and scale-free structure of Internet routing graphs. Hyperbolic geometry provides a more suitable representation space. We propose HERMIT (Hyperbolic Edge-aware RTT Modeling via Integrated Topology), a hybrid framework combining a hyperbolic manifold-preserving temporal GNN with a Random Forest regressor for joint link prediction and RTT prediction. Built on HMPTGN, HERMIT introduces RTT-aware edge features and a learnable edge encoder to improve modeling of evolving link states and routing behavior. The resulting hyperbolic node representations are combined with historical RTT statistics for robust latency prediction. We evaluate HERMIT on a large-scale real Internet dataset spanning 2015-2024. HERMIT consistently outperforms a strong Random Forest baseline using only historical RTT statistics, achieving a 6% RMSE improvement while reducing large errors on heavy-tailed samples. It also surpasses prior hyperbolic TGNN models, including HMPTGN and HTGN, in link prediction performance. These results demonstrate that combining hyperbolic temporal graph learning with tree-based regression provides a scalable solution for RTT prediction in real-world Internet topologies. Subjects: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2605.28155 [cs.LG] (or arXiv:2605.28155v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.28155 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] Off-Policy Learning to Reason Works Because It Is More Pessimistic Than You Think

链接: https://arxiv.org/abs/2605.28150
作者: Otmane Sakhi,Aleksei Arzhantsev,Imad Aouali,Flavian Vasile
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large scale reinforcement learning has become a central tool for improving reasoning in large language models. At this scale, generation is often lagged or asynchronous, so updates are performed on data collected by older policies. This makes learning inherently off-policy. Most existing approaches nevertheless remain rooted in PPO-style trust-region objectives, treating training as approximately on-policy and using importance weights to correct distribution mismatch. These corrections can introduce high variance, destabilize optimization, and accelerate entropy collapse. Recent work suggests an alternative: rather than correcting the mismatch, one can embrace off-policy data and remove importance weights, often yielding stronger algorithms. In this paper, we provide an intuitive construction of off-policy objectives that include successful off-policy objectives and show that their effectiveness can be understood through implicit pessimism: they optimize toward target policies that are more conservative than their nominal objectives suggest. This perspective explains why some particular implementation choices improve stability: they implicitly control the effective target distribution. We then propose a principled modification that stabilize this induced distribution and improve off-policy learning.

[LG-57] Sign-Aware Gated Sparse Autoencoders: Modeling Anticorrelated Features with Bi-Jump-ReLU Activations

链接: https://arxiv.org/abs/2605.28149
作者: Bartosz Wieciech,Zmnako Awrahman,Marcin Czelej,Victor Hugo Jaramillo Velasquez,Wioletta Stobieniecka
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) extract interpretable features from Large Language Models, but standard variants enforce non-negativity, forcing separate latents for diametrically opposed concepts (e.g., “pressure too high” vs. “pressure too low”) and wasting dictionary capacity when features are anticorrelated. We propose the Sign-Aware Gated SAE (SA-GSAE): two-sided gated sparsity with signed magnitude and auxiliary supervision. A polarity-sensitive gate selects support on either sign, a signed-magnitude path avoids L1 shrinkage, and an auxiliary reconstruction prevents gate collapse. Bipolar sharing - one latent encoding both signs along a shared direction - is realised via a new Bi-Jump-ReLU activation; parameter accounting shows sign-awareness stays parameter-efficient even when anticorrelated pairs are rare. On real LLM activations across three mid-depth hookpoints on Pythia-1B and SmolLM3-3B (6 cells, 3 seeds), a half-width SA-GSAE at width H strictly Pareto-dominates a full-width Gated SAE at 2H over the entire swept L0 overlap on 3 of 6 cells (both MLP-output hookpoints and resid-mid/Pythia-1B); on the remaining 3 it matches R^2 within 0.025 (max gap -0.008) while cutting dead fraction by 0.35-0.62 absolute. Sweep-geomean dead-fraction reductions are ~100x-500x on MLP-output cells and Pythia-1B resid, ~2x-4x on attention cells and SmolLM3-3B resid. Ablations show the two-sided gate and auxiliary loss are load-bearing (no auxiliary collapses LR to 0.27, 98% dead); tying r_i^+ = r_i^- is indistinguishable (|Delta R^2| = 0.0015), and we recommend this symmetric variant as default. MLP-output gains come from most latents carrying both polarities; on attention, bipolar structure concentrates in a small set of top latents. Full-width SA-GSAE exhibits a reproducible reconstruction collapse at SmolLM3-3B resid that the half-width entirely avoids. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.28149 [cs.LG] (or arXiv:2605.28149v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.28149 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-58] Sequential Neural Probabilistic Amplitude Shaping: Learning the Channels Language

链接: https://arxiv.org/abs/2605.28143
作者: Mohammad Taha Askari,Lutz Lampe,Amirhossein Ghazisaeidi
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注: 4 pages, 2 figures, Submitted to the 52nd European Conference on Optical Communications

点击查看摘要

Abstract:We present the first neural probabilistic amplitude shaping that outperforms existing methods while accounting for all implementation losses, using a block-less, easily implementable sequential autoregressive encoder compatible with arithmetic distribution matching, yielding reduced rate loss and higher achievable information rates.

[LG-59] Learning to Bid in Repeated Second-Price Auctions with Dynamic Values and Aggregated Feedback

链接: https://arxiv.org/abs/2605.28133
作者: Benjamin Heymann,Otmane Sakhi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of learning to bid when the bidder’s value is dynamic, i.e., when the current value depends on past outcomes. Specifically, we consider a bidder participating in repeated second-price auctions whose value depends on the time elapsed since their last successful bid, with auctions arriving in continuous time and only aggregated feedback revealed at the end of the horizon. Such a bidder must (1) balance the immediate benefit of winning the current auction against its impact on future values and (2) learn unknown environmental parameters. We derive regret bounds for a class of learning methods that combine plug-in estimators with a differential-equation characterization of the optimal policy, and show that a specific confidence bound algorithm learns the optimal policy with a near optimal regret of \widetildeO(\log N) for piecewise linear primitives, and \widetildeO(N^1/3) for general, smooth primitives, achieving these regrets without explicit randomization. These theoretical results are supported by numerical experiments.

[LG-60] Adaptive Coarse-to-Fine Subgoal Refinement for Long-Horizon Offline Goal-Conditioned Reinforcement Learning

链接: https://arxiv.org/abs/2605.28127
作者: Kaiqiang Ke,Shenghong He,Chengdong Xu,Yuheng Luo,Xiangyuan Lan,Chao Yu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline goal-conditioned reinforcement learning (GCRL) is challenging in long-horizon tasks, where distant state–goal pairs provide weak supervision and value estimates become vulnerable to accumulated bootstrapping errors. Hierarchical methods mitigate this difficulty by introducing intermediate subgoals, but fixed temporal abstractions or fixed hierarchy depths can be mismatched to state–goal pairs with different reachability horizons. We propose Coarse-to-Fine Hierarchical Goal Reinforcement Learning (CFHRL), a fully offline GCRL framework that adaptively refines distant goals before execution. Starting from the final goal, CFHRL recursively proposes intermediate targets, trained from replay-supported candidates, and stops refinement once the current target is estimated to be locally executable by a learned reachability cost. The key idea is that a subgoal need not be an exact midpoint or globally optimal waypoint; it only needs to provide reliable progress and reduce the remaining reaching difficulty, enabling subsequent refinement over shorter horizons. A stylized analysis further supports the robustness of approximate recursive contraction. Experiments on OGBench show substantial gains on several long-horizon tasks, with ablations validating the proposed refinement and stopping mechanisms

[LG-61] Chreode: A Cell World Model for One-Step Temporal Dynamics and Perturbation Prediction NEURIPS2026

链接: https://arxiv.org/abs/2605.28111
作者: Mufan Qiu,Genhui Zheng,Yinuo Xu,Ruichen Zhang,Ying Ding,Qi Long,Tianlong Chen
类目: Machine Learning (cs.LG)
*备注: 25 pages, 3 figures, 14 tables. Submitted to NeurIPS 2026

点击查看摘要

Abstract:Predicting how a cell will change its transcriptional state under a developmental signal or a genetic perturbation is the computational core of in-silico biology and the AI Virtual Cell program. Existing approaches either fit static control-to-treated maps that discard time, or solve multi-step ODE / Schrödinger-bridge problems on each dataset independently. We introduce Chreode, a one-step cell world model that predicts action-conditioned cell-state transitions through a structured residual transition operator. It shifts distributional evolution from inference time to training time, enabling single-pass generation while preserving a Waddington-inspired decomposition into downhill landscape flow, rotational in-tangent dynamics, and stochastic spread. The model is pretrained with a shared scVI encoder and a DiT-based dynamics backbone on a 2.4M-cell mouse embryonic atlas spanning 7 datasets. As a fine-tuning initialization, Chreode improves per-target Sinkhorn distance on Weinreb hematopoiesis and Veres islet differentiation over matched scratch models, PI-SDE, and PRESCIENT. As a transferable gene-state embedding for GEARS, the pretrained dynamics representation reduces shared-vocabulary DE20 mean squared error on Norman Perturb-seq from 0.2121 to 0.1858, a 12.4% relative improvement, without changing the GEARS training procedure. We interpret this transfer to perturbation prediction as evidence that pretrained developmental-trajectory dynamics encode differentiation primitives transferable to CRISPR-induced state shifts, since both involve cell-state transitions in a shared latent geometry. The pretrained backbone additionally produces zero-shot clonal fate scores on Weinreb that are competitive with strong dynamic-OT baselines.

[LG-62] Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization ICML2026

链接: https://arxiv.org/abs/2605.28109
作者: Hao Jiang,Shurui Li,Tianpeng Bu,Bowen Xu,Xin Liu,Qihua Chen,Hongtao Duan,Lulu Hu,Bin Yang,Minying Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026 main conference

点击查看摘要

Abstract:Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy’s exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at this https URL.

[LG-63] Benchmarking Inductive Biases for Multivariate Time-Series Anomaly Detection with a Robust Multi-View Channel-Graph Detector

链接: https://arxiv.org/abs/2605.28103
作者: Junhao Wei,Yanxiao Li,Bidong Chen,Yifu Zhao,Haochen Li,Dexing Yao,Baili Lu,Xudong Ye,Jietian Feng,Sio-Kei Im,Yapeng Wang,Xu Yang
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We present a unified experiment, analysis, and benchmark study of multivariate time-series (MTS) anomaly detection. Ten family-representative detectors – spanning statistical, reconstruction, association, frequency, and generic-transformer families – are evaluated on five datasets (SMD, MSL, SMAP, PSM, and MSDS) under effectiveness, efficiency, robustness, and cross-dataset generalisation. All methods share the same windowing, scoring, hardware, and metric protocols. Effectiveness, ablation, and robustness use three random seeds; cross-dataset transfer uses seed~0 because each extra seed requires 250 source-target evaluations. The benchmark yields three method-independent findings: no single-bias baseline dominates; absolute perturbation VUS-ROC is more informative than retention ratios; and MSDS behaves as an event-dense deployment workload rather than a sparse point-anomaly benchmark. Under this protocol we also introduce \ours, an adaptive detector family combining a NOTEARS-constrained directed channel-graph view with optional patch-attention and temporal-association views. \ours achieves the best macro-average VUS-ROC ( 0.675 , +5.1 ~pt over the second-best LSTM-AE), ranks first overall, and reaches the top-3 on all five datasets. Its wins on MSL and MSDS are narrow, while its average and robustness gains are larger: under the same three-seed robustness protocol for every method, it obtains the strongest absolute VUS-ROC across noise, channel dropout, and time-shift perturbations. We release the MSDS preprocessing protocol, configurations, scripts, and seed-level metric dumps.

[LG-64] Measure-to-measure Regression with Transformers

链接: https://arxiv.org/abs/2605.28075
作者: Matthew Vandergrift,Martha White,Yury Polyanskiy,Philippe Rigollet,Lazar Atanackovic
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many learning problems require predicting how populations evolve under an unknown transformation. A natural representation for such populations is a probability measure, with point clouds as a key example. In this work, we study the measure-to-measure (M2M) regression problem, in which one seeks to learn a map between probability measures from a finite collection of observed input-output pairs. In contrast to classical regression, where individual samples are transformed independently, M2M regression treats entire distributions as the data points. This perspective is vital in certain scientific applications, for example, cellular and molecular biology, where cells are known to evolve not as independent data points but as a collection. However, few existing approaches address the problem of M2M regression with sufficient expressivity and scalability. We present a formalization of nonlinear M2M regression and introduce two easy-to-use, expressive, and scalable approaches to learn such operators: transformers as static M2M maps and transformers as dynamic M2M velocity fields. Our approach leverages the natural measure-dependent and mean-field structure of transformers to learn nonlinear M2M maps on the space of probability distributions. We illustrate the effectiveness of our proposed method to generalize to unseen measures on synthetic experiments, interacting particle systems, and a large-scale patient-derived organoid dataset for predicting treatment response in colorectal cancer.

[LG-65] PINE: Pruning Boosted Tree Ensembles with Conformal In-Distribution Prediction Equivalence ICML2026

链接: https://arxiv.org/abs/2605.28068
作者: Haruki Yajima,Yusuke Matsui
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026

点击查看摘要

Abstract:Tree ensembles are machine learning models with strong predictive performance and interpretability, and remain widely used for tabular data. Standard pruning methods for tree ensembles typically optimize an accuracy-compression trade-off and may change a subset of predictions, potentially compromising decision consistency. Faithful pruning methods address this issue by preserving prediction equivalence over the entire input space, but this requirement leads to lower compression ratios. We propose PINE, a pruning method that provides strong guarantees within an in-distribution region. PINE preserves prediction equivalence within this region and controls the region size using a single parameter \alpha via conformal calibration. Experiments on 12 public tabular datasets show that PINE improves the compression ratio by up to 30% while preserving predictions at a comparable level to existing faithful pruning methods.

[LG-66] RW-TTT: Batched Serving for Request-Owned Test-Time Training State

链接: https://arxiv.org/abs/2605.28053
作者: Jian Yang,Zhizhuo Kou,Yao Tian,Hao Zhang,Han Chen,Sirui Han,Yike Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Test-time training (TTT) adapts an LLM during generation by reading and updating request-owned state, such as fast weights, low-rank deltas, or streaming learner state. This breaks batched LLM serving, which assumes shared static weights: serial execution is correct but slow, while naive batching can corrupt request state. We formulate this problem as read-write TTT serving and present RW-TTT , which tags each decode step with its owner, version, and READ/WRITE effect, batches only compatible phases, and commits updates only to the owner. On one GPU with eight fast-weight InPlace-TTT streams, RW-TTT reaches 274.61 aggregate tok/s, 9.31x over sequential serving and 3.44x over per-stream replicas under the same memory budget. It preserves behavior on RULER, a long-context benchmark, and passes owner/version checks.

[LG-67] BPPO: Binary Prefix Policy Optimization for Efficient GRPO-Style Reasoning RL with Concise Responses

链接: https://arxiv.org/abs/2605.28028
作者: Qingfei Zhao,Huan Song,Shuyu Tian,Jiawei Shao,Xuelong Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) is widely used for training reasoning models, but updating all sampled completions in each group incurs substantial cost and can reinforce verbose reasoning trajectories. In this paper, we study whether all completions provide equally useful update signals in GRPO-style reasoning RL. Our gradient-similarity analysis shows that, within the same prompt group, same-class completions often induce highly similar update directions, whereas correct-incorrect pairs provide more distinct contrastive signals. Motivated by this observation, we propose Binary Prefix Policy Optimization (BPPO), which uses the shortest correct completion and the shortest incorrect completion as a compact update unit while preserving full-group advantage normalization. BPPO further improves efficiency with adaptive completion scheduling and prefix-focused optimization; by updating only response prefixes, it avoids reinforcing redundant suffixes and encourages more concise responses. Experiments on GSM8K, MATH, and Geo3K show that BPPO achieves up to 6.08x speedup over GRPO while maintaining competitive accuracy, and reduces mean response length by approximately 30-50% without modifying the reward with an explicit length penalty.

[LG-68] AOE: Exhaustive Out-of-Distribution Detection via Recalibrating Outlier Labels

链接: https://arxiv.org/abs/2605.28021
作者: Fengqiang Wan,Qing-Yuan Jiang,Yang Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is essential for deploying machine learning models in open-world and safety-critical scenarios, where test inputs may deviate from the training distribution and overconfident predictions on unknown samples can lead to unreliable decisions. Outlier Exposure (OE) has emerged as a promising OOD detection paradigm by introducing auxiliary outliers during training to enlarge the margin between in-distribution (ID) and OOD samples. Existing OE-based methods typically enlarge this margin by employing uniform labels to maximize the entropy of OOD samples over ID categories. However, we theoretically show that uniform labels inevitably disregard the relations between OOD samples and ID categories, termed the over-softening effect, leading to a suboptimal margin bound. Our theoretical analysis further reveals that explicitly exploiting such relations can instead yield improved OOD detection performance. Motivated by this insight, we propose \underlineAdaptive Confidence \underlineOE (AOE), a simple yet effective method that leverages temperature scaling to recalibrate outlier labels. Specifically, AOE generates adaptive soft targets from temperature-scaled model predictions for OOD samples, where the learnable temperature smooths the prediction distribution without fully erasing class-wise relational information. By supervising OOD samples with these adaptive soft targets, AOE preserves the semantic proximity between OOD samples and ID categories while encouraging the softened targets to approach a high-entropy distribution, thereby suppressing overconfident OOD predictions and enlarging the separation margin. Extensive experiments across diverse benchmarks demonstrate the effectiveness of AOE.

[LG-69] Patched-DeltaNet: Token-Level Event-Driven Memory for Linear-Time Anomaly Detection

链接: https://arxiv.org/abs/2605.27992
作者: Tae-Gyun Lee,Junyoung Park,Kyu Won Han
类目: Machine Learning (cs.LG)
*备注: 7 pages, 2 tables

点击查看摘要

Abstract:Time series anomaly detection is critical for maintaining the reliability of mission-critical systems. While Transformer-based models like PatchTST have shown remarkable performance, their \mathcalO(L^2) computational complexity severely limits deployment in resource-constrained environments. In this paper, we propose Patched-DeltaNet, a novel architecture combining time-series patching with Gated Delta Networks. By integrating these paradigms, we hypothesize and demonstrate the emergence of token-level event-driven memory, whereby the patching mechanism extracts local semantic chunks, while the error-driven DeltaNet updates its recurrent state exclusively when significant physical changes, defined as deltas, occur. This synergy effectively filters out background noise and captures sudden anomalous drifts. Our rigorous experiments on the Server Machine Dataset (SMD) benchmark demonstrate the structural superiority and sample efficiency of Patched-DeltaNet. By strictly outperforming recent architectures under unified evaluation constraints and identical compute budgets, our model yields an ROC-AUC of 0.957 and PA-F1 of 0.822, while drastically reducing computational complexity to the theoretical minimum of \mathcalO(L/P) .

[LG-70] Law of Neural Interaction: Depth-Width Shape Interaction Efficiency and Generalization

链接: https://arxiv.org/abs/2605.27989
作者: Wenjie Sun,Jinning Yang,Shuai Zhang,Mengnan Du
类目: Machine Learning (cs.LG)
*备注: 30 pages, 4 figures

点击查看摘要

Abstract:The guidance of scaling laws has increased the resource demands of modern large language models (LLMs), yet it remains questionable whether these models utilize resources effectively under a fixed budget. Previous research has proved superposition as a key contributor to loss. By leveraging the Neural Feature Ansatz, we extend superposition from parameter space to gradient space and define it as neural interaction. We find that under a fixed budget, good generalization is usually accompanied by efficient neural interactions, and the model can be placed in an efficient interaction interval by adjusting its depth-width ratio ( R_D/W ). In addition, as the budget scales up, the efficient interaction interval of the model remains relatively stable. By comparing existing small scale dense LLMs, we observe that models operating near this interval tend to perform better on the MMLU-Pro benchmark. Our findings reveal that the R_D/W influences resource utilization efficiency and thereby affects generalization, providing insights into model shape initialization and the understanding of model generalization mechanisms. Code for Neural Interaction Law is available at: this https URL

[LG-71] Continual Learning in Modern Hopfield Networks with an Application to Diffusion Models

链接: https://arxiv.org/abs/2605.27975
作者: Ken Takeda,Masafumi Oizumi,Ryo Karakida
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Generative models, including diffusion models, are increasingly used as foundation models and adapted through sequential fine-tuning, making continual learning an essential problem setting. However, continual learning in such generative models remains poorly understood: after a task change, what aspects of the learned distribution are most easily lost, and what replay samples should be prioritized? We address these questions through the modern Hopfield energy. Recent links between modern Hopfield networks (MHNs) and diffusion models allow analyses in MHNs to be transferred to diffusion models. We introduce intrinsic forgetting as an increase in Hopfield energy after the task change. In tractable settings in an MHN, we prove that high-energy, outlier-like samples undergo a larger energy increase than cluster-like samples, implying that samples located in sharp, isolated basins are more forgettable. We further analyze memory replay and show that replay is particularly effective for high-energy samples, enabling an energy-based selection of replay samples. We validate these predictions in experiments on MHNs and two diffusion models under continual-learning settings: Stable Diffusion and a pixel-space DDPM. In these diffusion models, Hopfield energy tracks reconstruction-based forgetting, and replay experiments reveal energy-dependent mitigation of forgetting that is consistent with the MHN analysis.

[LG-72] Cyclical Entropy Eruption: Entropy Dynamics in Agent Reinforcement Learning

链接: https://arxiv.org/abs/2605.27954
作者: Wendi Li,Shawn Im,Sharon Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Agentic large language models are increasingly used to solve real-world tasks by reasoning over goals, invoking tools, and interacting with external environments. Reinforcement learning provides a natural framework for improving these behaviors, and recent agent RL methods have achieved strong results across domains. However, the training dynamics of agent RL remain poorly understood, limiting our ability to diagnose instabilities and design more effective training algorithms. In this work, we identify a previously underexplored phenomenon in agent RL, which we term cyclical entropy eruption. Unlike single-turn reasoning RL, where entropy typically collapses and stays low, agent RL training exhibits unique recurring cycles of sharp entropy eruption and gradual subsidence. We decompose this dynamic into three phases and provide theoretical and empirical analyses of each, explaining the mechanisms underlying its cyclical oscillation. We further show that degenerate patterns such as sentence duplication and hallucination, once acquired during eruption, can persist and accumulate across cycles. Motivated by these findings, we propose SEAL (Separation-Enhanced Agent Learning), a lightweight auxiliary loss that separates correct and incorrect trajectories in representation space, directly targeting the root cause of entropy eruption. Experiments across multiple benchmarks, models, and RL algorithms demonstrate that SEAL stabilizes training and yields stronger downstream agent performance.

[LG-73] Frequency-Guided Action Diffusion via Sub-Frequency Manifold Traversal

链接: https://arxiv.org/abs/2605.27919
作者: Junlin Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: A preprint version of FGO

点击查看摘要

Abstract:Learning visuomotor policies via behavior cloning typically involves mimicking expert demonstrations collected by human operators. However, natural human demonstrations inherently contain high-frequency noise, such as intermittent jerks, pauses, and action jitter. Training policies to directly imitate these raw trajectories inevitably causes the model to inherit these suboptimal behaviors. This pathology is particularly pronounced in diffusion-based policies, where iterative denoising steps can inadvertently amplify high-frequency artifacts at the expense of meaningful fine-grained details. To address these limitations, we present a novel frequency-based algorithm that enables implicit spectral maneuvering and smooth action generation. Our method, Frequency Guidance Operator (FGO), steers the generation process of diffusion polices by progressively driving the noisy samples through intermediate sub-frequency manifolds with expanding spectral bands. Validated on 15 robotic manipulation tasks from 5 benchmarks, FGO achieves superior performance in enhancing action smoothness and temporal consistency while preserving the details necessary for successful task execution. Project website: this https URL

[LG-74] Where LLM Annotators Fail: Label-Free Learning on Graphs with LLM s

链接: https://arxiv.org/abs/2605.27913
作者: Safal Thapaliya,Jiatan Huang,Chuxu Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Node classification on graphs often requires labeled nodes, yet obtaining labels at graph scale is expensive. When node attributes contain semantic content, such as paper abstracts, web pages, or product descriptions, large language models (LLMs) can provide low-cost supervision by annotating a small subset of nodes. However, these LLM-generated labels are noisy, and existing label-free graph learning methods usually treat this noise as either global or class-conditional. We find that LLM annotation errors are not only class-dependent but also region-dependent: within the same class, reliability can vary sharply across feature-space clusters. In light of this, we propose Cluster-Aware Noise Estimation (CANE), a label-free learning framework that estimates cluster-conditional LLM reliability without ground truth labels, and uses this estimate to decide which pseudo-labels to trust, and which labels to correct. Across various graph benchmarks and GNN backbones, CANE improves over the strongest label-free baselines, with the largest gains on datasets exhibiting stronger cluster-conditional noise.

[LG-75] Privately Estimating Monotone Statistics in Polynomial Time

链接: https://arxiv.org/abs/2605.27912
作者: Gavin Brown,Ephraim Linder,Mahbod Majid,Vikrant Singhal
类目: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study efficient differentially private algorithms for estimating monotone statistics, i.e., statistics that are monotone under the addition of new observations. The starting point for our investigation is subsample-and-aggregate: a classical paradigm that partitions the dataset into blocks, estimates the statistic on each block, and then privately aggregates the this http URL practical and generically applicable, this approach is quite data-hungry. We improve upon this framework for the class of monotone statistics – compared to subsample-and-aggregate, our algorithms save a factor of t in sample complexity and pay a factor of e^t in running time, where t0 is a tunable parameter. We complement our results with a query-complexity lower bound, showing that our algorithms are essentially optimal for this task. As an application, we obtain improved results for private eigenvalue estimation, private loss estimation, and privately estimating a single parameter of a high-dimensional model, e.g., in linear regression.

[LG-76] FedEHR-Gen: Federated Synthetic Time-Series EHR Generation via Latent Space Alignment and Distribution-Aware Aggregation

链接: https://arxiv.org/abs/2605.27892
作者: Jun Bai,Ziyang Song,Yue Li
类目: Machine Learning (cs.LG)
*备注: 8 pages main paper with 14 pages supplementary appendix

点击查看摘要

Abstract:Synthetic Electronic Health Record (EHR) generation provides a promising avenue for data augmentation and cross-hospital modeling in privacy-constrained healthcare settings. However, most existing EHR generative models are centralized and require pooling data across hospitals, which is often infeasible when real-world data sharing is restricted. While federated EHR generation offers a natural solution, direct federated modeling often collapses or diverges due to the high dimensionality, sparsity, and cross-hospital heterogeneity of EHR data. In this work, we propose FedEHR-Gen, the first federated framework for synthetic time-series EHR generation across distributed hospitals. FedEHR-Gen uses a two-stage learning paradigm. First, we introduce a federated autoencoder that projects high-dimensional and sparse EHR features onto a compact latent space. To ensure semantic consistency across hospitals, we develop a layer-wise matching aggregation mechanism that aligns local encoders into a unified global latent space. Second, operating on this aligned latent space, we train a federated temporal conditional variational autoencoder (TCVAE) with distribution-aware aggregation, enabling stable temporal generative modeling under severe cross-hospital heterogeneity. Extensive experiments on the eICU and MIMIC-III datasets demonstrate that FedEHR-Gen achieves generation fidelity, downstream utility, and privacy risk comparable to centralized training, while consistently outperforming the standard federated baseline.

[LG-77] Reward Transfer from Inverse Reinforcement Learning: A Coupled Minimax Approach

链接: https://arxiv.org/abs/2605.27834
作者: Guang-Yuan Hao,Lars van der Laan,Aurélien Bibaut,Nathan Kallus
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the transfer of rewards learned using inverse reinforcement learning from expert demonstrations in one environment to reinforcement learning in a new, different environment. This arises naturally when demonstrations are collected in a controlled environment. We formulate the problem as a joint system of Bellman equations across the source and target environments and develop minimax estimators for the target soft- q -function. Whereas a sequential solution approach first estimates the source reward and then plugs it into the target control problem, a coupled approach solves the source and target system of equations jointly. We show that, in contrast to the sequential approach, the coupled approach removes the first-order influence of source Bellman residual error. We characterize the local behavior of each approach, develop finite-sample soft- q -function error bounds, and prove regret guarantees for the resulting soft-control policy. An empirical investigation using a sepsis simulator validates the theoretical comparison.

[LG-78] Decentralized Parameter-Free Online Learning with Compressed Gossip

链接: https://arxiv.org/abs/2605.27831
作者: Tomas Ortega,Hamid Jafarkhani
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study decentralized online convex optimization when agents communicate over a graph and messages may be compressed. Classical decentralized online methods typically require learning-rate choices that depend on the horizon, comparator scale, or other problem parameters, while compressed communication introduces additional disagreement that must be controlled. We propose DECO-EF (DEcentralized COin-betting with Error Feedback), a decentralized parameter-free online learning algorithm that combines coin-betting predictions with compressed difference-based gossip. Each agent maintains a clean accumulated state and a compressed tracker, and communicates only compressed state differences during gossip steps. The method is parameter-free in the online-learning sense: it does not tune to the horizon, the comparator norm, or the learning rate. We prove expected comparator-adaptive network-regret bounds for DECO-EF under compressed communication. To the best of our knowledge, this gives the first expected sublinear network-regret guarantees for parameter-free decentralized online learning under compressed communication.

[LG-79] MRMMIA: Membership Inference Attacks on Memory in Chat Agents

链接: https://arxiv.org/abs/2605.27825
作者: Kai Chen,Yan Pang,Tianhao Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: This work investigates the MIA on chat agent memory

点击查看摘要

Abstract:Membership inference attacks (MIAs) test whether a target data record belongs to a system’s private data, and have become a standard tool to measure privacy leakage in machine learning systems. Prior work has primarily focused on training corpora or retrieval databases. However, MIAs against agent memory have received less attention, even though such memory can contain sensitive user-agent interactions, retrieved facts, and user preferences. Therefore, in this work, we focus on chat agent memory MIAs, where an adversary infers whether a candidate memory unit belongs to the chat agent’s memory store. We propose Multi-Recall Memory MIA (MRMMIA), a unified attack that utilizes multiple recall probes to the agent to extract the membership signal across black-box, gray-box, and white-box settings. Our experiments demonstrate that MRMMIA consistently outperforms baselines. Our results expose the privacy risk in agents and provide an initial evaluation framework for membership leakage in chat-agent memory systems.

[LG-80] Density-aware Sample-specific Attack NEURIPS2026

链接: https://arxiv.org/abs/2605.27809
作者: Qiyuan Wang,Yao Li,Raymond K. W. Wong
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 18 pages, 6 figures, 8 tables. Submitted to NeurIPS 2026

点击查看摘要

Abstract:Despite recent progress in backdoor attacks, existing methods remain susceptible to post-training defenses that erase the backdoor through fine-tuning or pruning. We revisit the core objectives of backdoor attacks and derive principled criteria characterizing optimal sample-specific trigger construction under a Bayes-optimal model of the victim’s training. Our analysis reveals that both attack success and clean-accuracy preservation are simultaneously optimized when triggered samples are steered into low-density regions of the clean data distribution, a distributional condition that controls all moments of the poisoned distribution at once rather than a handful of input-space summary statistics. We introduce a bilevel optimization framework that estimates density ratios via conditional time-score matching and optimizes a mixture-model objective to place triggered samples in these sparse regions. Extensive evaluations on MNIST, CIFAR-10, GTSRB, and TinyImageNet demonstrate that our method achieves above 99% attack success rate before defense and retains 50–85 percentage points higher post-defense ASR than the strongest baselines under fine-tuning defenses. Against neuron-pruning defenses, the method exhibits complete immunity, with zero neurons identified for removal across all pruning thresholds. These results expose a fundamental gap in current defense paradigms and underscore the need for defenses that operate beyond the support of the clean distribution.

[LG-81] SYNAPSE: Neuro-Symbolic Visual Thought-to-Text Decoding via Topological Semantic Denoising

链接: https://arxiv.org/abs/2605.27790
作者: Akshaj Murhekar,Abhijit Mishra
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in large language models have accelerated open-vocabulary EEG-to-imagined-text decoding, where non-invasive neural activity recorded during visual perception is translated into coherent natural language descriptions of viewed stimuli. However, existing systems remain highly vulnerable to biological noise, where corrupted neural projections induce hallucinated or semantically unstable generation in frozen language models. We introduce SYNAPSE (Symbolic Neural Alignment for Precise Semantic Extraction), a lightweight neuro-symbolic framework that stabilizes neural text generation through inference-time symbolic regularization. By purifying EEG-derived semantic candidates using commonsense graph structure and latent exemplars, SYNAPSE improves semantic stability without end-to-end LLM fine-tuning. Experiments across popular EEG decoding benchmarks and multiple frozen LLM backends demonstrate consistent gains over unconstrained prompting baselines, robustness under object-label ablation, and performance commensurate with substantially more resource-intensive fine-tuned systems, while preserving biometric privacy by localizing raw EEG processing entirely within the encoder stack.

[LG-82] Revisiting ML Training under Fully Homomorphic Encryption: Convergence Guarantees Differential Privacy and Efficient Algorithms

链接: https://arxiv.org/abs/2605.27782
作者: Yvonne Zhou,Mingyu Liang,Ivan Brugere,Danial Dervovic,Yue Guo,Antigoni Polychroniadou,Min Wu,Dana Dachman-Soled
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:We present the first theoretical convergence analysis of machine learning training under fully homomorphic encryption (FHE), combined with a differentially private (DP) training algorithm tailored to encrypted computation. Our approach improves computational efficiency over standard differentially private gradient descent (DP-GD) while achieving comparable utility. In particular, we prove convergence of approximate gradient descent using polynomial approximations of activation and loss functions, which are required for FHE compatibility. To preserve privacy in downstream tasks, we integrate differential privacy without relying on costly per-sample gradient clipping, enabling scalable encrypted learning. We also provide data-independent hyperparameter selection and theoretically grounded strategies for polynomial approximation which can be of independent interest. Together, these contributions advance the feasibility of efficient, private, and secure machine learning on sensitive data.

[LG-83] Fine-Tuning Dynamics of In-Context Factual Recall in Transformers

链接: https://arxiv.org/abs/2605.27774
作者: Ruomin Huang,Eshaan Nichani,Jason D. Lee,Rong Ge
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In-context learning \ – performing tasks based on examples given in the prompt \ – is an important capability that has emerged in large language models and has received significant attention in both theory and practice. Existing theoretical work often focuses on settings where the learning uses information purely from the prompt. However, many practical instances of in-context learning require the model to retrieve factual knowledge stored in the model’s parameters, with the context serving to identify which knowledge is relevant. In this work, we study how in-context learning leverages factual knowledge recall. We formalize this behavior by introducing the \emphin-context factual recall (IC-recall) task, where a transformer is provided a context of (subject, answer) pairs generated from a hidden relation, along with a query subject, and must both infer this hidden relation and retrieve the corresponding answer. Factual knowledge is modeled by the transformer having access to a simple pre-constructed MLP associative memory storing (subject, relation, answer) triplets. We analyze the supervised fine-tuning dynamics of a one-layer transformer on IC-recall data and prove that the model successfully performs IC-recall by converging to a particular pairwise attention pattern. This fine-tuning stage requires a very small number of samples \ – only polylogarithmic in the number of stored knowledge triplets. Experiments verify our theoretical predictions and show that the pairwise attention pattern emerges even when the MLP layer is pretrained instead of constructed.

[LG-84] Do Audio LLM s Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox ICML2026

链接: https://arxiv.org/abs/2605.27772
作者: Jiacheng Pang,Ashutosh Chaubey,Mohammad Soleymani
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted as a conference paper at ICML 2026. Project page: this https URL

点击查看摘要

Abstract:Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow language-implied (incorrect) answers. To understand the cause of this gap, we perform layer-wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder–LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt-Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language-implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset. Our project page is available at this https URL.

[LG-85] Smoothed Score Queries and the Complexity of Sampling

链接: https://arxiv.org/abs/2605.27769
作者: Jingbo Liu
类目: Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the query complexity of sampling from high-dimensional Gaussian distributions using gradient information. In the standard oracle model, exact gradients expose only matrix-vector products with the precision matrix, leading to polynomial approximation barriers and a characteristic (\sqrt\kappa) dependence on the condition number. We show that this barrier disappears when the sampler is allowed to query \emphsmoothed scores, namely gradients of the logarithms of the Gaussian-convolved densities. For a Gaussian target with precision matrix (\Lambda), a smoothed-score query at noise level (\tau) gives access to the resolvent ((\Lambda+\tau^-1I)^-1). Combining geometrically spaced noise levels with sinc-quadrature rational approximation, we obtain a sampler with q=O!\left(\bigl(\log\kappa+\log(e\sqrt d/\delta_\rm TV)\bigr)\log(e\sqrt d/\delta_\rm TV)\right) smoothed-score queries for total variation error (\delta_\rm TV), improving the condition-number dependence from (\sqrt\kappa) to logarithmic. We also study finite-bit gradient oracles. Using coordinatewise quantization of the transformed smoothed-score answers and a final dithering step, we obtain a sampling scheme whose total communicated gradient information is polylogarithmic in (\kappa); in particular, for fixed dimension and accuracy, the bit complexity is (O(\log^2\kappa)). To complement these upper bounds, we introduce a channel-synthesis, or reverse-Shannon, converse technique for sampling lower bounds. This converts total-variation simulation guarantees into communication requirements and yields an (\Omega(\log\kappa)) lower bound on the required gradient information. Together, these results identify smoothed scores as a provably more informative oracle for sampling and give nearly matching upper and lower bounds for its finite-bit complexity.

[LG-86] A Paired Testing Protocol for Batch-Conditioned Refusal Robustness in LLM Serving ICML2026

链接: https://arxiv.org/abs/2605.27763
作者: Sahil Kadadekar
类目: Machine Learning (cs.LG)
*备注: 12 pages. Accepted to the ICML 2026 Workshop on Hypothesis Testing

点击查看摘要

Abstract:Safety evaluations of language models often treat serving configuration as fixed background infrastructure, but batch condition is an untested treatment variable whenever the same prompt may be evaluated alone, in a synchronized batch, or inside a continuous-batching scheduler. We synthesize four artifact-backed studies into a paired testing protocol: Study A combines local discovery, scorer-corrected adjudication, and true-batching confirmation; Study B tests cross-model generalization; Study C tests continuous-batch composition; and Study D runs a batch-invariant-kernel ablation. The local test finds safety-label changes more often than capability-label changes (0.51% vs. 0.14%), but adjudication of 63 candidate rows leaves only 17 genuine behavioral flips, implying a corrected full-set rate of 0.16%. The 15-model extension finds no detectable universal safety-over-capability skew: flips are near parity (0.94x), alignment type has no detectable association ( p=0.942 , \eta^2=0.033 ), and output instability is the strongest tested fragility screen ( r=0.909 , bootstrap 95% CI [0.65, 0.97]). In the targeted kernel ablation, standard vLLM reproduces 22/55 label flips on current score-flip candidates, while enabling VLLM_BATCH_INVARIANT=1 reduces the same test to 0/55 flips; the composition test separately finds no aggregate effect at 4.7pp sensitivity. The testing recommendation is exact-stack validation: evaluate refusal at the served batch setting, pair safety prompts with capability controls, and report low-rate directional flips separately from aggregate null effects.

[LG-87] Learn from your own latents and not from tokens: A sample-complexity theory

链接: https://arxiv.org/abs/2605.27734
作者: Daniel J. Korchinski,Alessandro Favero,Matthieu Wyart
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures in main. 28 pages, 14 figures, 1 table in all

点击查看摘要

Abstract:Generative models, from diffusion models to large language models, achieve remarkable performance but at a cost in training data orders of magnitude larger than what biological learners require. An alternative paradigm has emerged in which networks are trained to predict their \emphown latent representations of related views or masked regions, as in data2vec and JEPA – an idea related to predictive-coding accounts of the cortex. Despite strong empirical results, the theoretical understanding of these methods remains limited. Central questions include: by how much does latent prediction actually improve data efficiency? Is there a benefit to stacking such methods into multi-scale hierarchies? We answer both using as data a tractable probabilistic context-free grammar that captures the compositional structure of natural language and images. Such a grammar generates strings of visible tokens by recursively applying production rules along a tree of hidden symbols of depth L . For such data, supervised or token-level SSL require a number of samples \emphexponential in L to recover the latent tree; we prove that latent prediction achieves this with a number of samples \emphconstant in L , up to logarithmic factors. We confirm this bound with (i) a hierarchical clustering algorithm, (ii) an end-to-end neural network whose predictor-clusterer modules predict their own latents at each level via gradient descent, and (iii) the first sample-complexity analysis of data2vec, which we show implicitly performs hierarchical latent prediction. This suggests that explicit stacking such as H-JEPA is largely redundant.

[LG-88] Can Entry-Wise Clipping Give Spectral Control of Stochastic Gradients?

链接: https://arxiv.org/abs/2605.27733
作者: Zitao Song,Cedar Site Bai,Zhe Zhang,Brian Bullins,David F. Gleich
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training instabilities such as loss spikes are frequently the result of stochastic gradient noise. Because of rare expressions in language training data, and multiple layer composition, the noise impact is heavy-tailed and survives mini-batch averaging. Existing remedies trade off structure against cost: vector-norm clipping ignores the matrix structure of weight updates, while spectral normalization (e.g., Muon (Jordan et al., 2024)) respects it at additional cost. We show that this trade-off can be balanced. Real gradient noise appears to be similar to entry-wise heavy-tailed contamination, and a first-order perturbation analysis reveals a localization property of such noise, under which a simple entry-wise method achieves spectral control. Exploiting this, we derive a tractable surrogate for the Bayes-optimal entry-wise estimator under a Gaussian signal prior. We establish O(\epsilon^-4) convergence guarantee under Cauchy-contaminated noise. Empirically, we find that smooth shrinkage improves Adam on NanoGPT pretraining, saving \sim7% of training tokens. We further find that applying the entry-wise clipping before spectral normalization yields a \sim2% token saving on top of Muon.

[LG-89] NUCLEUS-MoE: Unified Model of Pool Boiling for Liquid Cooling KDD

链接: https://arxiv.org/abs/2605.27722
作者: Arthur Feeney,Xianwei Zou,Sheikh Md Shakeel Hassan,Siddhartha Rachabathuni,Aparna Chandramowlishwaran
类目: Machine Learning (cs.LG)
*备注: 12 pages, 9 figurs, KDD AI for Science

点击查看摘要

Abstract:Two-phase boiling enables heat transfer rates an order of magnitude higher than single-phase cooling, but it remains difficult to model due to the strong coupling between phase change, turbulence, and transport, as well as extreme sensitivity to fluid properties and thermodynamic conditions. Existing learning-based surrogates are either condition- or fluid-specific, limiting generalization and requiring separate models. We present NUCLEUS, a mixture-of-experts model for pool boiling that replaces collections of specialized surrogates with a single architecture. NUCLEUS combines neighborhood attention, signed distance field reinitialization for interface consistency, and expert routing that exhibits emergent specialization across distinct boiling dynamics. Trained on high-fidelity simulations of pool boiling, NUCLEUS jointly models saturated and subcooled boiling across three fluid classes (dielectrics, refrigerants, and cryogens), resolving failure modes of prior models on extreme fluids. We show that expert routing exhibits coherent spatial structure and specialization without explicit supervision. Quantitatively, NUCLEUS matches or exceeds baselines while maintaining physical consistency across heterogeneous boiling configurations. We also show zero-shot and few-shot generalization capabilities on downstream tasks such as a new fluid (Opteon 2P50 developed for immersion cooling). These results demonstrate that mixture-of-experts models are a scalable pathway toward unified surrogate modeling of boiling dynamics and lay the groundwork for broader generalization across scientific ML. Comments: 12 pages, 9 figurs, KDD AI for Science Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.27722 [cs.LG] (or arXiv:2605.27722v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.27722 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-90] Bayesian Deployment Approval for Learned Landing Controllers under Finite Rollout Validation

链接: https://arxiv.org/abs/2605.27720
作者: Fei Jiang,Lei Yang
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 16 pages, 4 figures and 4 tables

点击查看摘要

Abstract:Reinforcement learning and data-driven autonomous controllers are commonly evaluated using cumulative reward and empirical success frequency under finite simulation trajectories. However, such empirical metrics do not necessarily provide sufficient statistical evidence regarding deployment readiness under uncertainty. This work develops a Bayesian approval framework for learned autonomous landing controllers under finite rollout evidence. A probabilistic landing capability formulation is introduced based on touchdown safety satisfaction under uncertain operating conditions, while Bayesian posterior inference is used to quantify uncertainty regarding the true deployment capability of learned policies. Posterior approval probability and posterior deployment risk are further introduced for deployment-oriented evaluation, together with a sequential validation framework supporting approve/reject/continue decisions during progressive rollout testing. Simulation experiments using PPO and SAC controllers demonstrate that empirical success and reward optimization may produce overconfident deployment interpretation under limited validation evidence, whereas posterior approval inference provides a more uncertainty-calibrated assessment of deployment readiness. The proposed framework provides a practical statistical connection between conventional reinforcement-learning evaluation and deployment-oriented validation under uncertainty and may be generalized to broader classes of learned autonomous systems.

[LG-91] st-Time Collective Action: Proxy-Based Perturbations for Correcting Algorithmic Harms

链接: https://arxiv.org/abs/2605.27689
作者: Meghana Bhange,Ulrich Aïvodji,Elliot Creager
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:When machine learning systems under-perform for particular subgroups, affected users typically have no way to correct these disparities without relying on platform-level fixes. Existing approaches to algorithmic fairness rely on provider-centric approaches to correct these failures, leaving users with no external lever when faced with harm. Recent work in Algorithmic Collective Action shows that coordinated users can steer an algorithmic system toward a collective goal, but the existing mechanisms require the provider to retrain on the collective’s modified data which users may not have control over. We propose Test-Time Collective Action (TTCA), a framework through which a group of users who share query access to the platform, can correct disparities affecting under-served subgroup without participating in the platform’s training loop. We implement this through a proxy-based mechanism where the collective pools query access to a black-box API to extract a proxy of the platform, then optimizes a per-class universal perturbation against the proxy. Each member applies this perturbation to their own inputs at submission time, requiring no cooperation from the platform. We empirically evaluate the mechanism on CIFAR-10, CIFAR-100, and FairFace, showing that modestly-sized collectives close most of the subgroup accuracy gap, transfer across architectures (a small proxy can attack a larger platform), and improve worst-group accuracy, equal-opportunity gap, and disparate impact. A query-budget analysis comparing a per-user black-box attack baseline shows that pooling is cheaper than each subgroup member attacking alone. Test-time collective action thus offers corrective intervention to users when platform-side remediation is unavailable or delayed.

[LG-92] Heterogeneous Parallelism for Multimodal Large Language Model Training

链接: https://arxiv.org/abs/2605.27678
作者: Yashaswi Karnati,Kamran Jafari,Akash Mehra,Li Ding,Pranav Prashant Thombre,Ali Roshan Ghias,Shifang Xu,Parth Mannan,Yu Yao,Hao Wu,Eric Harper,Ashwath Aithal,Nima Tajbakhsh
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Foundation model training is becoming multimodal, from post-training pipelines to large-scale pretraining. As modality coverage broadens, context windows grow, and encoder LLM scales diverge, a single LLM-centric TP/CP/PP/DP/EP layout increasingly limits throughput. This coupling forces encoders to inherit LLM-driven sharding and placement choices that can add communication, limit encoder parallelism, or constrain the LLM schedule; the mismatch is most pronounced at long contexts, where LLM context parallelism is needed for the fused multimodal sequence but encoder inputs remain bounded. We present heterogeneous parallelism for multimodal large language model training, an abstraction that lets modules in one end-to-end graph use independent layouts and rank placements, supporting colocated execution on shared GPUs and non-colocated execution on disjoint rank sets. The key challenge is preserving boundary tensor semantics across independent layouts: forward activations must be materialized for the destination layout, while backward gradients must be routed back to the source layout. We address this with boundary communicators that implement forward and backward layout transforms, plus scheduling extensions for both placement modes. We evaluate optimized homogeneous, colocated heterogeneous, and non-colocated heterogeneous configurations across multimodal workloads and GPU scales to characterize when added layout and placement freedom exposes a better operating point. Across this sweep, colocated heterogeneity improves TFLOPS/GPU by up to 49.3%, while non-colocated heterogeneity improves aggregate token throughput by up to 13.0% and TFLOPS/GPU by up to 9.6%. We validate loss convergence parity against homogeneous baselines and release the system as an open-source Megatron-LM extension.

[LG-93] When do complex-valued neural networks help? A study of representation geometry and optimization

链接: https://arxiv.org/abs/2605.27673
作者: Ashutosh Kumar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Complex-valued Neural Networks (CVNNs) are often motivated by domains where information is naturally encoded in magnitude and phase. Yet complex-valued inputs alone do not determine when complex arithmetic improves learning: the label signal may lie in amplitude, phase, their coupling, or a symmetry that real-valued models can also represent under suitable coordinates. We study this through a representation-first evaluation of CVNNs against Cartesian real, polar, phase-only, magnitude-only, parameter-matched real, and FLOP-matched real baselines. Across synthetic RF tasks, complex representations are useful but not universally superior. PSK-only tasks favor phase-aware and complex-valued models, QAM-only tasks favor magnitude-based models, mixed PSK+QAM gives only a small complex-valued advantage, and unseen carrier-phase rotations break coordinate-dependent models without augmentation. Similar patterns appear beyond RF: in quantum-wavefunction prediction, momentum is invisible to |\psi| but recoverable from phase, while EEG analytic-signal experiments show that phase locking, amplitude bursts, and phase-amplitude coupling each favor different coordinate views. We also identify a benchmarking artifact on RadioML 2018.01A. Under matched-shared-trial selection, a CReLU complex model exceeds the best real baseline by 22.94 PP; under independent per-family tuning on the same data and 16-trial search space, the gap collapses to 2.46 PP. Gradient analysis traces the inflated gap to high-learning-rate first-step instability in real baselines, while complex parameter coupling distributes the loss signal more robustly. A learning-rate \times activation factorial confirms the failure is primarily hyperparameter-driven. Overall, CVNNs are best viewed as structured inductive biases whose gains depend on representation, symmetry, and optimization, not as universally superior architectures.

[LG-94] Faster Thermal Profiling of a Lunar Rover with Machine Learning Adapted Finite Difference Model

链接: https://arxiv.org/abs/2605.27651
作者: Samuel Weber,Zaki Hasnain,Souma Chowdhury
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Autonomous space systems operating in extreme thermal environments require accurate and efficient thermal modeling to support both pre-mission system design and onboard autonomy. For lunar rovers, large temperature gradients, radiative heat transfer, and variable surface conditions make reliable thermal prediction especially challenging. High-fidelity physics-based simulations provide accurate results but are computationally expensive, while simplified models and lookup-table approach often lack sufficient accuracy. Physics-informed machine learning (PIML) offers a promising alternative by combining data-driven models with embedded physical knowledge. This paper presents a PIML framework for thermal analysis of a simplified lunar rover with internal heat sources, where machine learning enables environment-adaptive coarse meshing. The proposed architecture integrates a transfer neural network (TNN) that adaptively determines 3D finite-difference nodalization based on thermal loads and initial conditions, enabling more accurate coarse-mesh calculations. A differentiable finite-difference thermal simulator is embedded within the framework to enforce physical consistency and support efficient training, while an upscaling layer reconstructs high-resolution temperature fields from the coarse-grid solution. The proposed PIML approach is evaluated against high-fidelity fine-mesh simulations, low-fidelity fixed coarse-mesh models, and a purely data-driven artificial neural network (ANN). Results show that the PIML framework improves prediction accuracy by 50% and 39% relative to the coarse-mesh physics model and ANN model, respectively, while maintaining physically consistent thermal distributions. Computationally, the framework is also 3x faster than high-fidelity simulations, demonstrating an effective balance between accuracy and efficiency for thermal modeling of lunar rover systems.

[LG-95] Poison with Style: A Practical Poisoning Attack on Code Large Language Models ICML2026

链接: https://arxiv.org/abs/2605.27631
作者: Khang Tran,Yazan Boshmaf,Issa Khalil,NhatHai Phan,Ting Yu,Md Rizwan Parvez
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted to the Forty-Third International Conference on Machine Learning 2026 (ICML 2026)

点击查看摘要

Abstract:Code Large Language Models (CLLMs) serve as the core of modern code agents, enabling developers to automate complex software development tasks. In this paper, we present Poison-with-Style (PwS), a practical and stealthy model poisoning attack targeting CLLMs. Unlike prior attacks that assume an active adversary capable of directly embedding explicit triggers (e.g., specific words) into developers’ prompts during inference, PwS leverages developers’ code styles as covert triggers implicitly embedded within their prompts. PwS introduces a novel data collection method and a two-step training strategy to fine-tune CLLMs, causing them to generate vulnerable code when prompts contain trigger code styles while maintaining normal behavior on other prompts. Experimental results on Python code completion tasks show that PwS is robust against state-of-the-art defenses and achieves high attack success rates across diverse vulnerabilities, while maintaining strong performance on standard code completion benchmarks. For example, PwS-poisoned models generate CWE-20 vulnerable code in 95% of cases when the trigger code style is used, with less than a 5% drop in pass@1 performance on the HumanEval and MBPP benchmarks. Our implementation and dataset are here: this https URL.

[LG-96] Evaluating Local Explainability Metrics for Machine Learning Models on Tabular Data

链接: https://arxiv.org/abs/2605.27618
作者: Tomás Pereira,João Vitorino,Eva Maia,Isabel Praça
类目: Machine Learning (cs.LG)
*备注: 9 pages, 12 tables, 1 figure, DATA 2026 Conference

点击查看摘要

Abstract:Despite the wide use of explainability techniques to attempt to understand the behavior of Artificial Intelligence (AI), the generated explanations may not always be reliable. An explanation can appear plausible to humans but fail to capture the internal reasoning of a model, particularly when dealing with complex tabular data. This paper studies the trustworthiness of local explainability techniques when applied to complex tabular classification tasks, considering evaluated metrics for three main properties: faithfulness to the model’s predictions, robustness to input data variations, and complexity of the explanation itself. A benchmark was performed for Local Interpretable Model-Agnostic Explanations (LIME), Kernel SHapley Additive exPlanations (SHAP), and Feature Ablation techniques, across 32 datasets and different types of machine learning models. Model performance ranges were analyzed to identify two groups: consensus-correct, which are samples that all models predicted correctly, and consensus-wrong, samples that all models predicted incorrectly. The obtained results demonstrate that that the explanations are not always correlated with a model’s predictive performance. Instead, dataset complexity and feature distributions seem to be the main factors affecting explanation quality and reliability.

[LG-97] A Methodology to Assess Power Modeling in Energy-Aware Federated Learning on Heterogeneous Mobile Devices

链接: https://arxiv.org/abs/2605.27601
作者: Chaimae Jallouli,Karim Boubouh,Robert Basmadjian
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Performance (cs.PF)
*备注: 19 pages, 3 figures, 7 tables, Accepted for publication in the proceedings of Networked Systems (NETYS 2026), Springer Nature

点击查看摘要

Abstract:Estimating CPU power on heterogeneous ARM-based commodity devices is challenging due to limited access to CPU’s voltage domains. As a result, state-of-the-art energy-aware Federated Learning (FL) frameworks typically rely on simplified approximate power models to estimate computation energy, rather than the more accurate analytical CMOS-based model. To bridge this gap, we propose a reproducible CPU power estimation methodology combined with a rail-to-cluster mapping technique to retrieve cluster-level supply voltage. We evaluate our approach on two commodity Android devices and show that the analytical model predicts CPU power with errors below 10%, whereas the approximate model incurs errors of up to 959%. Using AnycostFL, a state-of-the-art energy-aware FL framework, we show that the analytical model achieves the same 80% model accuracy while consuming 1.4x less energy than the approximate model. These results highlight that approximate models can severely misestimate computation energy and lead to suboptimal decisions. This work facilitates the use of analytical CPU power models on heterogeneous multi-cluster ARM-based mobile SoCs without additional hardware support or external power measurement tools.

[LG-98] Proper Agnostic Learning of Functions of Halfspaces under Gaussian Marginals

链接: https://arxiv.org/abs/2605.27594
作者: Sergei Tikhonov,Arsen Vasilyan
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of computationally efficient proper agnostic learning of multidimensional concept classes under the Gaussian distribution. In this setting, given i.i.d. labeled samples from an unknown distribution over \mathbbR^d \times \pm 1\ whose marginal on \mathbbR^d is Gaussian, the goal is to output a hypothesis from a target class \mathcalF whose 0-1 loss is within \epsilon of that of the best classifier in \mathcalF . We give the first efficient proper agnostic learning algorithm for arbitrary Boolean functions of K halfspaces under Gaussian marginals. Our algorithm runs in time d^O(K^2 \log(1/\epsilon)/\epsilon^2) + (K/\epsilon)^O(K^3/\epsilon^2.5) . Prior to our work, the only known algorithm for K \geq 2 was brute-force search, with run-time exponential in d . Moreover, the dependence of our run-time on the dimension d matches that of the best known improper learning algorithm, namely d^\widetildeO(K^2/\epsilon^2) . For the special case of a single halfspace ( K=1 ), the best previous run-time was d^O(1/\epsilon^4) + (1/\epsilon)^O(1/\epsilon^6) . Our algorithm improves this to d^O(1/\epsilon^2) + (1/\epsilon)^O(1/\epsilon^2.5) . Once again, the dependence on d matches that of the best known improper algorithm, namely d^O(1/\epsilon^2) . Furthermore, the dependence of our run-time on the dimension d is essentially optimal in the statistical query model. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2605.27594 [cs.DS] (or arXiv:2605.27594v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2605.27594 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-99] Gradient Transformer: Learning to Generate Updates for LLM s ICML2026

链接: https://arxiv.org/abs/2605.27591
作者: Binh-Nguyen Nguyen,Khang Tran,NhatHai Phan,Issa Khalil
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026

点击查看摘要

Abstract:Many organizations lack computational resources to fine-tune large language models (LLMs) on private (unshareable) data for better utility, while fine-tuning tiny language models (TinyLMs) alone performs poorly. To address this bottleneck, we propose a data-free knowledge distillation framework that generates LLM update vectors based on TinyLMs fine-tuned on private data. An update vector is a vector of parameter changes from an initial model to its fine-tuned version on a dataset, capturing the effect of cumulative gradient steps during fine-tuning. The key idea of our framework is a novel Gradient Transformer that transforms TinyLM’s update vectors into LLM’s update vectors. As derived from shadow datasets, Grad-Transformer captures the correlation between TinyLM and LLM update vectors, enabling third-party providers to generate LLM update vectors given the organization’s TinyLM update vectors without accessing the organization’s private data. The framework supports multi-organization collaboration to jointly update LLMs, improving performance and cost-efficiency. Extensive experiments across language modeling and reasoning tasks show that Grad-Transformer remarkably outperforms state-of-the-art knowledge distillation baselines, even under strict differential privacy protection.

[LG-100] Information-theoretic Multimodal Representation Learning for Electrocardiogram Signals

链接: https://arxiv.org/abs/2605.27583
作者: Phu X. Nguyen,Konstantinos Kontras,Wei Dai,Huy Phan,Christos Chatzichristos,Paul Pu Liang,Bert Vandenberk,Maarten De Vos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electrocardiograms (ECGs) are widely used non-invasive measurements of cardiac activity and play a central role in clinical diagnosis. Recent multimodal approaches align ECG signals with clinical reports to incorporate diagnostic semantics, but clinical reports often fail to preserve the rich physiological structure of ECG waveforms, particularly across multiple levels of abstraction ranging from coarse diagnostic categories to fine-grained morphology. To address this limitation, we formulate ECG representation learning from an information-theoretic perspective and derive a tractable objective that jointly preserves signal structure and integrates clinical semantics. Based on this principle, we propose \textbfMERIT (Multimodal ECG Representation via Information Theory), a dual-branch pretraining framework combining masked ECG modeling with ECG–text contrastive alignment. Extensive experiments on PTB-XL and additional benchmarks demonstrate consistent improvements over prior methods, including gains exceeding 3% F1 on PTB-XL All and 5% F1 on SubClass classification. In zero-shot evaluation, MERIT further improves performance by up to +2.66% AUC and +2.11% F1 on PTB-XL SubClass, while also demonstrating robustness under multiple distribution-shift settings. Moreover, leveraging the learned ECG representations for ECG-conditioned clinical text generation with large language models improves text quality across several metrics, including ROUGE and METEOR. Together, these results demonstrate that MERIT learns more informative and clinically meaningful ECG representations, particularly for fine-grained clinical applications.

[LG-101] he Fundamental Limits of Fraud Detection in Card Payment Networks

链接: https://arxiv.org/abs/2605.27557
作者: Gaurav Dhama
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Card payment fraud detection is usually framed as a supervised classification problem. Although this approach has generated practical progress, improvement has remained incremental despite major advances in model architecture. We argue that this is not mainly a failure of function approximation or optimization, but a consequence of structural information impairments inherent to the payment ecosystem. We formalize card authorization as a sequential decision problem with delayed, censored, corrupted, and counterfactually missing feedback. We derive a minimax regret lower bound showing that these impairments enter multiplicatively in the denominator of the achievable learning rate. The bound implies that improving issuer reporting quality or reducing censorship can yield larger reductions in the regret floor than increasing model complexity. We also show that heterogeneity across issuers worsens learnability beyond what average impairment rates suggest. The paper contributes a theory of why fraud detection in payment networks is fundamentally harder than in standard online learning settings, identifies ecosystem information quality as the key bottleneck, and provides a theoretical basis for prioritizing investments in reporting infrastructure, dispute process quality, and selective exploration. The paper is theory-first and does not rely on proprietary transaction data. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2605.27557 [cs.LG] (or arXiv:2605.27557v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.27557 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-102] SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training ICML

链接: https://arxiv.org/abs/2605.27541
作者: Mohammed Adnan,Rohan Jain,Tom Jacobs,Ekansh Sharma,Rahul G. Krishnan,Rebekka Burkholz,Yani Ioannou
类目: Machine Learning (cs.LG)
*备注: Accepted International Conference on Machine Learning (ICML) 2026

点击查看摘要

Abstract:Dynamic Sparse Training (DST) methods train neural networks by maintaining sparsity while dynamically adapting the network topology. Despite the promise of reduced computation, DST methods converge significantly slower than dense training, often requiring comparable training time to achieve similar accuracy. We demonstrate both analytically and empirically that Batch Normalization (BN) adversely affects sparse training, and propose SparseOpt, a sparsity-aware optimizer, to address this. Experiments on ResNet models across CIFAR-100 and ImageNet demonstrate consistently faster convergence and improved generalization with our proposed method. Our work highlights the limitations of current normalization layers in sparse training and provides the first systematic study of the interaction between Batch Normalization, sparse layers, and DST, taking a significant step toward making DST practically competitive with dense training.

[LG-103] GenSBI: Generative Methods for Simulation-Based Inference in JAX

链接: https://arxiv.org/abs/2605.27499
作者: Aurelio Amerio
类目: Machine Learning (cs.LG); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: 48 pages + 1 appendix, 33 figures, 18 tables. For the associated Python code, see this https URL

点击查看摘要

Abstract:Flow and diffusion generative models have established themselves as widely adopted density estimators for simulation-based inference (SBI), extending naturally from neural posterior estimation to likelihood and joint density estimation. Their principled optimization objectives and freedom from architectural constraints have driven rapid adoption across the natural sciences. Yet the most widely used SBI libraries remain PyTorch-based, leaving researchers who develop their forward models and analysis pipelines in JAX without a native option. We present GenSBI, an open-source library that implements flow matching, score matching, and denoising diffusion entirely in JAX. The library offers three transformer-based architectures - SimFormer, Flux1, and a novel Flux1Joint that extends gate-modulated transformer blocks to joint density estimation - all interchangeable through a unified interface that decouples generative method, neural backbone, and inference mode. GenSBI provides an end-to-end workflow from training through posterior calibration (SBC, TARP, LC2ST) and supports custom architectures with domain-specific embedding networks. We validate the framework on standard SBI benchmarks, achieving near-ideal mean C2ST scores (0.50-0.56, where 0.50 is ideal) on SBIBM tasks with minimal per-task tuning and well-calibrated posterior coverage across all tested configurations. The code is publicly available at this https URL.

[LG-104] Federated Learning for Multivariate Time Series Anomaly Detection in Industrial Automation

链接: https://arxiv.org/abs/2605.27486
作者: Khayyam Nosrati,Martin Uray,Saverio Messineo,Olaf Sassnick,Stefan Huber
类目: Machine Learning (cs.LG)
*备注: Preprint. Accepted at the DEXA International Workshop on Optimisation of Industrial Production with AI Algorithms 2026 (DEXA AI4IP 2026)

点击查看摘要

Abstract:Federated learning (FL) has broadened the horizon for multivariate time series anomaly detection (MTSAD). However, benchmarking such anomaly detection methods within FL paradigm poses data-centric challenges. The existing datasets do not counteract these challenges since they do not simultaneously provide sufficient scale, accurate labels, and freedom from common flaws. In addition, the role of cyclic process behavior, which is common in discrete industrial automation, remains underexplored for MTSAD for the current state of research. This paper aims to shed more light on the literature and address these gaps by introducing a dataset designed with cyclic dynamics arising from the repetitive nature of discrete automation processes and evaluates selected MTSAD methods on both the proposed dataset and a public benchmark dataset.

[LG-105] Automating Formal Verification with Agent -Guided Tree Search

链接: https://arxiv.org/abs/2605.27485
作者: Leo Yao
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: 78 pages, 8 figures

点击查看摘要

Abstract:Formal verification offers a path to provably correct software, but writing verified code remains expensive enough that the technique is rarely used in production. Recent large language models can accelerate this work, and recent benchmarks measure their ability to translate specifications into code and machine-checked proofs of correctness. This thesis evaluates the state of such LLM-driven verified-code generation (“vericoding”) in Lean and develops search-based methods for improving verification performance. We first reproduce a subset of the vericoding-benchmark Lean leaderboard on a current cross-vendor model pool, finding that non-reasoning performance remains roughly steady on US closed-source models while open-weight models have slightly improved. We update the iterative methodology of vericoding-benchmark with an agentic loop equipped with mathlib search, finding that model performance greatly improves and scales with agent budget. GPT-5.4 nearly saturates the benchmark at 95.0% on 423 specs with K=50 LLM calls. We then design two agent-directed tree-search formulations: a state-based orchestrator that branches on partial-proof states, and a context-based orchestrator that branches on full subagent contexts. Compared against the agent baseline, the context-based design solves a wider range of intermediate-difficulty specs at lower token cost, while the agent baseline retains an advantage on the hardest specs, where uninterrupted iteration matters most. We conclude that search structure has selective advantages over a strong agent baseline, and that more challenging benchmarks drawn from modern code are important to measure and drive further progress in automated formal verification. Code available upon request by contacting the author at leoy@mit.edu. Comments: 78 pages, 8 figures Subjects: Logic in Computer Science (cs.LO); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2605.27485 [cs.LO] (or arXiv:2605.27485v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2605.27485 Focus to learn more arXiv-issued DOI via DataCite

[LG-106] Metric-Aware PCA as a Linear Instance of Geometric Deep Learning

链接: https://arxiv.org/abs/2605.27456
作者: Michael Leznik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Geometric deep learning organises neural architectures around the symmetries of their data domain, with the choice of symmetry group serving as a geometric prior that determines what representations can be learned. Metric-Aware Principal Component Analysis (MAPCA) parameterises principal component analysis by a positive-definite metric matrix, with a canonical subfamily interpolating between standard PCA and output whitening and a diagonal-metric point recovering Invariant PCA (IPCA). This paper positions MAPCA within the geometric deep learning framework. The metric is read as the geometric prior; the orthogonal group preserving it is the symmetry group it induces; MAPCA solutions are equivariant under this group with the resulting spectrum invariant; and MAPCA’s defining constraint is the linear analogue of the Schur-type weight constraints used in equivariant networks. Across six axes - domain, symmetry group, equivariance, invariance, architectural primitive, and geometric prior - we construct a precise dictionary between MAPCA and geometric deep learning. The technical anchor is a uniqueness theorem characterising IPCA as the unique linear data-derived metric in the MAPCA family that is equivariant under arbitrary diagonal rescaling and projects onto the fixed-point set of the action, equivalent under normalisation to the variance-maximisation criterion in its precise form. The paper closes with three bridges: kernel PCA as the nonlinear extension, spectral graph methods as MAPCA on graphs, and a deep MAPCA construction extending the positioning into deep equivariant networks

[LG-107] E3-Agent : An Executable and Evolving Agent for Resource Management of Edge Generative Inference

链接: https://arxiv.org/abs/2605.27428
作者: Rui Bao,Yaping Sun,Zhiyong Chen,Feng Yang,Meixia Tao,Nan Li,Wenjun Zhang
类目: Machine Learning (cs.LG)
*备注: 13 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Edge deployments of generative inference increasingly face two practical realities: per-device per-model performance is often unknown at deployment time, and it is non-stationary due to user-driven semantic events, background load, and device churn. Consequently, a resource manager that is tuned offline under a fixed regime can become brittle and expensive to maintain. This paper presents E^3 -Agent, an executable and evolving agent for edge artificial intelligence generated content (AIGC) resource management. E^3 -Agent separates a fast-path router that makes millisecond-level dispatch decisions from a slow-path, event-driven large language model (LLM) meta-controller that mitigates regime shifts through a small, explicit control surface exposed via a tool interface, including risk gating, router configuration, and rapid performance calibration. The agent learns online from execution feedback and continuously adapts to unknown and time-varying service-time mappings. We evaluate E^3 -Agent in a discrete-event simulator driven by MLPerf-derived device-model measurement priors, covering cold-start warmup and three dynamic regimes: semantic dynamics, device churn, and hidden drift. Across the dynamic scenarios, E^3 -Agent reduces average latency by 65%-73% compared to the best static baseline, stays within 7%-10% of an online full-information Oracle used for evaluation, and effectively suppresses stutter rate under semantic degradation.

[LG-108] Genetic algorithm vs. gradient descent for training a neural network architecture dedicated to low data regimes in small medical datasets

链接: https://arxiv.org/abs/2605.27411
作者: Amine Boukhari,Boglarka Ecsedi,Laszlo Papp,Mathieu Hatt
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aim/Introduction: Distance-encoding biomorphic-informational neural network (DEBI-NN) is a recently proposed architecture in which connection weights are defined by the distances between neurons positioned in a Euclidian space. This approach drastically reduces the number of trainable parameters compared to classical neural networks in which weights are directly trained. The training process for DEBI-NN is based on a genetic algorithm (GA), rather than gradient descent (GD) which remains the prevailing optimization algorithm in deep learning. We aim to design and implement a GD learner for DEBI-NN and assess its performance compared to GA. Materials and Methods: We designed a spatial backpropagation scheme tailored to DEBI-NN and carried out a comparison between GD and GA for classification tasks, using a synthetic non-linear “two-moons” dataset, two clinical medical imaging radiomic datasets and a fetal cardiotocography dataset with a sample sizes ranging from n=85 to n=2126. Each optimizer was tuned through targeted hyperparameter searches adapted to each dataset. Results: Across all experiments, GA consistently produced superior decision boundaries and classification performance (Synthetic: 100% vs 83%; DLBCL: 83% vs 78%; HECKTOR: 80% vs 67%; Fetal: 81% vs 66%), whereas GD exhibited instability and failed to fully capture the non-linear patterns inherent to DEBI-NN’s spatial encoding. The entangled gradients resulting from neuron interdependencies limit the effectiveness of classical backpropagation. Conclusion: These findings highlight fundamental limitations of gradient-based methods in architectures with highly interdependent spatial parameters and confirm the suitability of evolutionary strategies for training DEBI-NN. Subjects: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG) Cite as: arXiv:2605.27411 [cs.NE] (or arXiv:2605.27411v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2605.27411 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Amine Boukhari [view email] [v1] Wed, 13 May 2026 09:13:45 UTC (1,276 KB)

[LG-109] A Simple State Space Model Excels at Multivariate Time Series Classification

链接: https://arxiv.org/abs/2605.27406
作者: Hassan Saadatmand,Geoffrey I. Webb,Hamid Rezatofighi,Mahsa Salehi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Structured state space models (SSMs) have recently emerged as a promising foundation for sequence modeling, with Mamba-based architectures demonstrating strong performance through input-dependent state transitions, albeit at considerable complexity. However, their application to time-series classification (TSC) has been largely limited to Mamba-style architectures, leaving the broader SSM design space underexplored. We present the first systematic study spanning diagonal SSMs (S4D) and input-dependent SSMs (Mamba family) on large-scale TSC benchmarks, asking whether such complexity is necessary for top performance. Our results reveal a surprising finding: S4D consistently outperforms Mamba-based variants in both accuracy and efficiency, challenging the assumption that increased complexity translates to meaningful gains in TSC. Building on this, we introduce MS4, lightweight modifications to S4D via a linear input projection and channel-mixing mechanism, and MS4N, a normalized variant that stabilizes state dynamics with negligible overhead. Evaluated on 59 datasets across MONSTER (up to 60 million samples, 50K timesteps, 82 classes) and the UEA benchmark, against 15 baselines, MS4 and MS4N consistently outperform Mamba-based models while remaining more efficient, and MS4N matches or surpasses competing deep learning models that are roughly 2x and 10x larger in parameters. These results position lightweight structured SSMs as a compelling alternative to scaling complexity for TSC.

[LG-110] IGADA-IoT: IoT Sensor Energy Optimization in Wireless Sensor Networks Driven by Automatic Data Augmentation

链接: https://arxiv.org/abs/2605.27397
作者: Mingchun Sun,Rongqiang Zhao,Muhammad Abdul Munnaf,Jie Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In wireless sensor networks (WSNs), data augmentation is a novel method to improve sampling-frequency decision performance, thereby enabling energy optimization for IoT (Internet of Things) sensors. However, existing methods rely on a single generator and empirically determined quantities, failing to establish a mapping between dynamic information gaps and multiple generators, and overlooking the heterogeneity of generated samples. Moreover, an evaluation and a closed-loop method that jointly considers the information gap and the model performance are lacking. To address these issues, we propose an information gap-guided IoT sensor automatic data augmentation framework (IGADA-IoT) with hierarchical multi-generator collaboration and scheduling over multiple rounds. Capabilities of different generators are jointly utilized to reduce the information gaps. In the IGADA-IoT, a hierarchical multi-generator collaboration and scheduling strategy (HMGCS) is proposed to enhance the targetedness and rationality of generated sample allocation. An information gap-model performance joint evaluation and closed-loop method (IGMP-EC) is proposed to enhance the accuracy of augmentation decisions, and to mitigate the risks of under-augmentation and over-augmentation. Experimental results show that the IGADA-IoT improves the average accuracy of multiple downstream models by 7.27%. Compared with advanced data augmentation methods, the average accuracy is improved by 8.67%. Compared with the individual generators, the average accuracy is improved by 7.24%. Furthermore, public IoT sensor datasets from the UCR Archive and real-world deployments demonstrate the accuracy and generalizability of the proposed method.

[LG-111] Beyond Lipschitz: Data-Driven Robustness via Discrete Modulus of Continuity

链接: https://arxiv.org/abs/2605.28729
作者: Jürgen Dölz,Michael Multerer,Michele Palma
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robustness of neural networks is commonly quantified via local or global Lipschitz constants. However, Lipschitz continuity can be overly coarse or overly restrictive as global robustness measure, failing to capture nuanced, data-dependent behavior. We propose a data-driven, architecture-agnostic framework based on the discrete modulus of continuity (DMOC), a non linear generalization of Lipschitz continuity that provides a finer notion of robustness. Unlike many existing approaches, DMOC does not require access to model internals and instead evaluates regularity relative to the data distribution. This shifts the focus from the model to the data, which provide a data-driven baseline of regularity against which the network’s robustness is assessed. We establish convergence results for DMOC-induced seminorms with explicit data-driven rates in terms of the separation distance, and introduce a scalable minibatch algorithm that reduces the quadratic cost of exact computation, enabling application to large-scale data sets such as ImageNet. Empirically, DMOC serves as an architecture independent diagnostic: it distinguishes trained from untrained networks, reveals underfitting and overfitting regimes, and yields, as a special case, tight Lipschitz estimates comparable to state-of-the-art method such as ECLipsE and ECLipsE-fast.

[LG-112] Latent-Conditioned Parameterized Quantum Circuits as Universal Approximators for Distributions over Quantum States

链接: https://arxiv.org/abs/2605.28690
作者: Quoc Hoan Tran,Koki Chinzei,Yasuhiro Endo,Hirotaka Oshima
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 16 pages, 11 figures

点击查看摘要

Abstract:Many applications in quantum simulation, quantum chemistry, and quantum machine learning require not a single quantum state but an ensemble of states characterizing the heterogeneity of a target system. Preparing such ensembles state-by-state is prohibitive in both variational and fault-tolerant settings, motivating a generative-modeling approach. We introduce latent-conditioned parameterized quantum circuits (LPQCs), a hybrid quantum-classical framework in which classical neural networks map a latent variable sampled from a prior distribution to the parameters of a parameterized quantum circuit. We prove that LPQCs are universal approximators for probability measures over density operators in the 1 -Wasserstein distance, extending classical universal approximation theorems to the quantum-distribution setting. We additionally introduce a multimodal latent prior and a mixture-of-experts circuit architecture, and show that it empirically alleviates the barren plateau problem during optimization. Numerical experiments validate the framework on a synthetic multi-cluster ensemble of mixed quantum states and on a QM9-derived ensemble of 3-D molecular structures. In these tasks, LPQC outperforms recent quantum generative baselines while remaining competitive with typical classical baselines at substantially reduced output dimensionality. By leveraging classical expressivity in the latent space, LPQCs offer a tractable route to quantum generative modeling.

[LG-113] Implicit Regularization in Perturbed Deep Matrix Factorization: Spectral Conditions and Stability

链接: https://arxiv.org/abs/2605.28613
作者: Jingzhe Wang,Hung-Hsu Chou
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper studies the stability of low-rank implicit regularization in perturbed deep matrix factorization, where the target matrix is corrupted by a noise matrix. We first derive sufficient spectral conditions under which gradient descent exhibits a low-rank phase in the noiseless setting. These conditions show how the target spectrum, initialization, and step size jointly determine the existence of a nonempty low-rank interval. We then analyze the perturbed gradient descent dynamics, proving convergence guarantees and quantifying how the perturbation affects iteration complexity and eigenvalue recovery. Finally, we show that the low-rank phase persists under perturbation, with explicit dependence on the perturbation size. Numerical experiments support the theoretical findings.

[LG-114] Dark Quest II: A Wide-Coverag e Neural Network Emulator of the Nonlinear Matter Power Spectrum Across Extended Cosmologies

链接: https://arxiv.org/abs/2605.28596
作者: Satoshi Tanaka,Takahiro Nishimichi,Yosuke Kobayashi
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG)
*备注: 53 pages, 44 figures, emulator code available at this https URL

点击查看摘要

Abstract:\textscDarkEmulator2 is a neural network emulator of the nonlinear matter power spectrum in a nine-dimensional w_0 w_a \nu o \mathrmCDM parameter space, developed as the emulator component of the \textscDark Quest II (DQ2) program. It is trained on simulations generated with the \textscGinkaku code, whose numerical implementation, accuracy tests, and post-processing pipeline are described in the companion paper. The design follows a unified strategy: in addition to the cosmological parameter vector, we supplement the neural network’s inputs with three families of physically motivated auxiliary quantities – the linear matter power spectrum, descriptors of the simulation resolution, and a low-dimensional summary of the initial Gaussian random field – that are expected to improve generalization across the parameter space. Training a single network jointly across three simulation resolution tiers allows the emulator to exploit a small number of high-resolution simulations while retaining broad coverage from lower-resolution simulations. For a L_\mathrmbox=1,\hiGpc box with N=3000^3 particles, the emulator reproduces the simulated matter power spectrum to subpercent accuracy up to the particle Nyquist scale, k_\mathrmNy\simeq 10,\hMpci . The emulator remains accurate over the calibrated wavenumber range, while its highest- k predictions depend on the simulation resolution and shot noise. We validate the emulator on independent test suites and, through a cross-comparison with several public emulators and widely used fitting formulas, characterize the inter-model consistency and the parameter-dependent trends in their residuals.

[LG-115] Conservative neural posterior estimation via distributionally robust training

链接: https://arxiv.org/abs/2605.28516
作者: William Laplante,Yuga Hikida,Charita Dellaporta,François-Xavier Briol,Ayush Bharti
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulation-based inference with neural posterior estimation (NPE) often yields overconfident and unreliable posteriors under limited simulation budgets. To address this, we propose DRO-NPE, a distributionally robust approach that replaces the standard NPE objective with a worst-case loss over a Wasserstein ambiguity set. We introduce KL-based metrics for miscoverage and miscalibration, and use these to show that the DRO-NPE objective controls overfitting and reduces posterior overconfidence. Our method is tractable, parallelisable, and readily integrates with standard normalising flows. Across benchmark SBI tasks, DRO-NPE consistently improves coverage and calibration, while narrowing the gap between empirical and population NPE loss, leading to more reliable inference in low-simulation regimes.

[LG-116] Bridging Maximum Likelihood and Optimal Transport for Efficient Inference and Model Selection in Stochastic Block Models

链接: https://arxiv.org/abs/2605.28488
作者: Simon Queric,Cédric Vincent-Cuaz,Charles Bouveyron,Marco Corneli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:We study inference in stochastic block models (SBMs) through the lens of optimal transport (OT). We first establish that maximum likelihood variational inference (MLVI) can be interpreted as a semi-relaxed Gromov-Wasserstein (srGW) projection with entropic regularization. While this formulation yields accurate clustering, the entropic regularization prevents transport plans to be sparse, hindering intrinsic model selection. Consequently, we investigate unregularized srGW estimators, and prove that they consistently recover both the SBM connectivity matrix and latent cluster assignments in the asymptotic regime. However, this asymptotic property does not translate into reliable model selection in finite samples, and calls for additional mechanisms to promote sparsity in the inferred cluster proportions. We empirically show that such a regularized formulation yields estimators that simultaneously recover model parameters and select the number of clusters in a single optimization problem, thereby avoiding costly grid search or heuristic model selection procedures.

[LG-117] Variance-Adaptive Optimal Algorithm for Reinforcement Learning with Multinomial Logit Function Approximation

链接: https://arxiv.org/abs/2605.28364
作者: Wonyoung Kim,Min-Hwan Oh,Garud Iyengar,Assaf Zeevi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning with multinomial logistic (MNL) function approximation has become an important framework due to its flexibility and broad applicability. While existing studies have established regret guarantees under worst-case analysis, they do not capture how performance depends on the variability of the interaction between the learner and the environment. In this paper, we develop a new theoretical analysis for MNL-based Markov decision processes that yields explicit variance-adaptive regret bounds. Our algorithm is computationally efficient and achieves the instance-wise optimal rate of regret, narrowing the gap between upper and lower bounds. Our numerical experiments validate that our method learns optimal policies more efficiently than conventional approaches.

[LG-118] Decision-focused learning for optimal PV-Battery scheduling

链接: https://arxiv.org/abs/2605.28340
作者: Joris Depoortere,Hussain Kazmi,Johan Driesen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The use of residential photovoltaics has increased dramatically in recent years. With battery systems becoming more affordable, the optimal operation of a photovoltaic-battery system can bring significant savings to households. Optimal control requires correct forecasts of underlying parameters, such as photovoltaic power generation, to schedule the battery. While forecasting models have become increasingly accurate due to algorithmic advances and data availability, accuracy is typically measured in generic metrics which might not align with the downstream application. This study proposes a decision-focused learning framework that integrates optimization and prediction by training a Long Short-Term Memory photovoltaic energy forecaster on the downstream optimal scheduling of a battery system. The proposed methodology is compared against a standard two-phase approach. Across a 14-month evaluation period, the decision-focused method reduced average electricity costs across twenty buildings by 3.6% when normalized against performance bounds defined by a perfect forecast and a baseline of no optimization. Critically, this financial improvement was achieved despite the model exhibiting a root mean squared error of 19.9%, significantly higher than the decoupled model’s 8.2%. Warm-starting the decision-focused model further improves results, lowering average cost by approximately 8%, while also mitigating the negative impact on statistical accuracy (root mean squared error of 13.7%). The findings are statistically significant at the 0.001 level across the twenty households and for each household individually. These results demonstrate that aligning forecast models with optimization goals is key for achieving cost advantages in PV-battery systems. Future research should replicate these findings on other datasets, alternate forecasting models and alternate optimization algorithms.

[LG-119] Insurance Pricing Optimization via Off-Policy Evaluation

链接: https://arxiv.org/abs/2605.28327
作者: Sascha Günther,Dimitri Semenovich,Mario V. Wüthrich
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Risk Management (q-fin.RM); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Traditional insurance pricing relies on risk-based principles that ensure actuarial fairness and solvency but do not explicitly account for policyholders’ price sensitivity. We formulate insurance pricing as a decision-making problem and study it using tools from off-policy evaluation and stochastic control. We propose a kernelized inverse propensity score estimator that exploits local structure in the action space and yields variance reduction compared to the classical inverse propensity score estimator. Building on these value estimates, we investigate policy optimization and present two practical approaches for computing optimal pricing rules: an interpretable data-shared Lasso formulation and a flexible policy parameterization based on neural networks. Using a controlled synthetic travel insurance environment, we empirically confirm the theoretical results and show that neural networks outperform existing techniques for policy optimization.

[LG-120] Counterfactually Fair Regression via Optimal Transport

链接: https://arxiv.org/abs/2605.28251
作者: M. Generali Lince,S. Gaucher,J-J. Vie,P. Loiseau
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of learning a counterfactually fair regressor. We adopt a causal uncertainty view in which counterfactual fairness is defined with resampled noise. We focus on obtaining theoretical fairness guarantees for a new post-processing estimator. We begin by showing that counterfactual fairness is equivalent to satisfying demographic parity conditional on the latent variable. This allows us to provide a closed-form expression of the optimal fair regressor via a barycentric quantile map. In order to handle continuous latent variables, we propose a discretized post-processing method. Then, under mild regularity assumptions, we prove high-probability finite-sample fairness guarantees for our estimator, providing an unfairness decay at rate \tilde O(n^-1/3) , and establishing a matching risk bound of order \tilde O(n^-1/3) . We provide a matching lower bound on the excess risk of almost fair predictions. Finally, we extend our results to the setting of relaxed counterfactual fairness. We validate our approach on real-world and synthetic data.

[LG-121] Geometry of Relaxed Fair Regression: A Unified Framework for Aware and Unaware Settings

链接: https://arxiv.org/abs/2605.28233
作者: M. Generali Lince,V. Divol,R. Flamary,S. Gaucher,P. Loiseau
类目: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fairness-accuracy trade-offs are a central concern in the deployment of fairness-aware machine learning methods. When sensitive attributes are unavailable at inference time-the so called unawareness setting, principled methods for obtaining accurate predictions under relaxed fairness constraints are largely missing. In this work, we address this gap by formulating regression under a demographic parity penalty as an optimal transport problem. Our framework unifies both the \emphaware and \emphunaware settings and characterizes optimal prediction functions via optimal transport maps, under both squared Wasserstein-2 and Total Variation penalties. These results reveal that the choice of penalty reflects fundamentally different fairness philosophies: the Wasserstein penalty induces a smooth, population-wide compromise, while Total Variation enforces exact parity for a subset of individuals. Building on these theoretical characterizations, we propose an algorithm that is simple to implement, computationally efficient, and consistently matches or outperforms state-of-the-art baselines on real-world benchmarks.

[LG-122] Learning Logical Operations for Arbitrary Quantum Error Correction Codes

链接: https://arxiv.org/abs/2605.28162
作者: Nico Meyer,Christopher Mutschler,Dominik Seuß,Andreas Maier,Daniel D. Scherer
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 23 pages, 12 figures, 5 tables

点击查看摘要

Abstract:Logical operations are essential for quantum computation within quantum error-correcting codes. However, discovering their physical realizations is challenging, especially for non-additive codes that lack a stabilizer description. We present a general learning-based framework that, given only an encoding circuit, constructs physical implementations of logical operations while enforcing structural properties such as transversality or shallow depth. Our approach is validated by rediscovering known logical operations of standard stabilizer codes. We then extend it to a co-design procedure, dubbed variational early fault-tolerant quantum computing (VarEFTQC), which tailors non-additive encodings to a given noise model and enforces desired logical gate sets, such as transversal IQP-type families or low-depth universal sets. A software library implements the complete learning pipeline, including loss-function variants, ansatz families, and optimization routines. Together, these results position VarEFTQC as a practical tool for discovering hardware-adapted logical gadgets for early fault-tolerant quantum computing.

[LG-123] Skillful high-resolution weather forecasting independent of physical models

链接: https://arxiv.org/abs/2605.28153
作者: Pengcheng Zhao,Siqi Xiang,Weixin Jin,Zekun Ni,Jiang Bian,Zuliang Fang,Hongyu Sun,Bin Zhang,Richard E. Turner,Jonathan Weyn,Haiyu Dong,Kit Thambiratnam,Qi Zhang
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 26 pages, 10 figures

点击查看摘要

Abstract:Accurate and timely weather forecasts are critical for high-impact decisions in modern society. Machine-learning-based weather prediction is emerging as an alternative for producing initial conditions, forecasts, and even both in end-to-end systems. These methods deliver predictions faster and often with higher skill than traditional numerical weather prediction (NWP). However, even end-to-end models typically rely on NWP-generated reanalyses for supervision, thereby inheriting the biases and resolution limitations of those NWPs, and limiting adaptation to settings where suitable reanalysis products are unavailable, infrequently updated, or expensive to produce. Here we introduce ObsCast, a regional system that generates both analysis and predictions, without using any NWP-derived data in either training or inference, while still achieving state-of-the-art performance in short-term high-resolution regional modeling. Over the contiguous United States and Europe, ObsCast outperforms operational NWP for near-surface variables through 18 h and produces skillful precipitation forecasts. It provides a simpler and more adaptable route to build and refine regional forecasting services directly from local observations, without the need to develop complex and costly traditional forecasting pipelines.

[LG-124] Deep Neural Network Training as Random Effects: An Optimization-Inference Duality

链接: https://arxiv.org/abs/2605.27991
作者: Minhao Yao,Ruoyu Wang,Xihong Lin,Lin Liu,Zhonghua Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have achieved remarkable empirical success, yet their training dynamics remain understood mainly from optimization rather than statistical principles. Here we develop a statistical framework for DNN training in the over-parameterized regime by showing that the prediction induced by continuous-time neural tangent kernel (NTK) gradient flow is exactly equivalent to that from a classical random-effects model. In this framework, training time acts as a variance component, or equivalently an empirical Bayes covariance hyperparameter, governing the allocation of variation from noise to structured signal. This equivalence reveals an optimization-inference duality: the gradient-flow path is both an optimization trajectory and an empirical Bayes random-effects inference path. Conditional on training time, the network output is the posterior mean of the latent signal, and estimating training time by restricted maximum likelihood (REML) turns early stopping into likelihood-based empirical Bayes inference rather than external tuning. This perspective yields a two-stage inferential procedure. First, a variance-component test determines whether DNN training captures statistically significant structure beyond initialization. Second, conditional on training being warranted, REML provides a likelihood-based early stopping rule. The resulting stopping time admits a spectral interpretation in the NTK eigenbasis, where training proceeds until spectral loss decorrelation is achieved. We further establish that REML-guided early stopping achieves asymptotically optimal prediction error for fixed-design in-sample prediction and, under additional random-design regularity conditions, for out-of-sample prediction. This work reframes DNN training as statistical inference and provides a principled foundation for deciding whether and how long to train deep neural networks.

[LG-125] Is Backpropagation Optimal? When Synthetic Gradients Improve Sample Efficiency

链接: https://arxiv.org/abs/2605.27946
作者: Yibo Jacky Zhang,Zeyu Tang,Sanmi Koyejo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Backpropagation is the default learning rule for artificial neural networks and is often treated as the settled approach whenever differentiability is available. In this work, we revisit this convention through a theoretical lens of sample efficiency. We introduce a unified vectorized feedback framework for loss-based and reward-based learning on computational graphs, in which synthetic gradients emerge as a natural alternative to backpropagation. We characterize the conditions under which synthetic gradients can achieve a lower gradient-estimation mean squared error than backpropagation. We construct examples illustrating that this sample efficiency advantage can be arbitrarily large. Experiments on contextual bandits and reinforcement learning tasks demonstrate the potential of our theoretical findings.

[LG-126] Quantum principal component analysis without eigenvector recovery

链接: https://arxiv.org/abs/2605.27942
作者: Yewei Yuan,Michele Minervini,Mark M. Wilde,Nana Liu
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Principal component analysis (PCA) is traditionally implemented through a covariance or kernel matrix, leading-eigenvector extraction, and hard rank- k projection. These steps can be computationally costly in high-dimensional and quantum-data settings, sensitive to small eigengaps, and unnecessary when downstream tasks only require principal-subspace scores. Such score-based objectives are important in applications such as anomaly detection, spectral-energy profiling, and other postselection tasks. To address these needs, we introduce a measurement-based soft PCA framework replacing the hard top- k projector with an entropy-regularized Fermi–Dirac filter. This filter is the unique optimizer of an entropy-regularized variational formulation of PCA and converges to the classical PCA projector in the zero-temperature limit. This filter has a direct interpretation as a quantum measurement, which naturally suggests a quantum approach. For centered covariance operators represented by quantum feature states, a single fixed circuit, together with threshold calibration, accesses all optimal filters for different rank budgets or retained-variance levels without rank-dependent circuit updates or eigenvector recovery. For new inputs, the same calibrated quantum circuit yields soft principal subspace scores, spectral energy profiles, and postselected filtered states. The required centering of both training and test data is performed coherently inside the quantum protocol, which is particularly important for quantum data where no classical feature vectors or centered Gram matrix are directly available. By reframing PCA as a calibrated measurement task, this framework bypasses the need for iterative eigenvector extraction and achieves a dimension-independent sample complexity O(\eta^-2) for normalized fractional-rank or retained variance scoring at additive accuracy \eta . Subjects: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2605.27942 [quant-ph] (or arXiv:2605.27942v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2605.27942 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-127] Machine learning enables experimental access to photon-by-photon arrival times in scintillation detectors

链接: https://arxiv.org/abs/2605.27937
作者: Yuya Onishi,Ryosuke Ota,Fumio Hashimoto,Kibo Ote,Go Akamatsu,Hideaki Tashima,Taiga Yamaya
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:

点击查看摘要

Abstract:Scintillation detectors with excellent timing resolution enable more precise localization of radiation sources in positron emission tomography, leading to substantial improvements in diagnostic capability for diseases such as cancer and dementia. At the extreme timing precision required for such applications at the picosecond scale, detector performance is governed by the microscopic dynamics of scintillation photons generated within the detector and their subsequent detection processes. However, detector signals have conventionally been treated only as collective responses of many photons due to structural constraints inherent to photodetectors. In this study, we overcome this fundamental limitation using deep learning, enabling direct access to the timing information of individual photons. The proposed method estimates photon-by-photon arrival times directly from detector waveforms without requiring any modification to the detector structure; the method operates on an event-by-event basis without ground-truth labels by integrating an unsupervised learning framework with a physically informed detector-response model. Through comprehensive validation combining Monte Carlo simulation and experimental measurements across various detector configurations, we experimentally demonstrate improved timing resolution, visualized depth-of-interaction-dependent photon transport, and classified Cherenkov and scintillation photons based on the estimated photon-level timing information using a unified deep learning-based framework. These results provide experimental access to photon dynamics, bridging the gap between theoretical modeling and experimental observation, and they open a new data-driven pathway for discovery in detector physics and optimization.

[LG-128] Exploratory Experience Shapes the Geometry of Predictive Representations

链接: https://arxiv.org/abs/2605.27929
作者: Kseniia Shilova,Abdelrahman Sharafeldin,Advay Balakrishnan,Hannah Choi
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Active sensing links behavior and learning through an action-perception loop: actions determine the observations used to update internal predictive models of perception, which subsequently guide the next actions. Predictive-coding frameworks provide a natural way to model this process, since internal representations are continuously updated to predict future observations. Here, we ask how exploratory and exploitative behavioral strategies shape these internal predictive representations. We build an online learning agent in a tree-like maze with a controllable parameter regulating the balance between exploratory and exploitative regimes. The agent updates a predictive-coding-based perception model from experience generated by its own behavior. The model predicts both future maze states and reward probability, allowing the agent to select actions either by expected information gain during exploration or by predicted reward during exploitation. We show that the resulting internal predictive representations depend strongly on the agent’s behavioral regime. Exploratory agents develop representations that are more spatially organized and better preserve the structure of maze transitions in latent space. In contrast, exploitative agents learn less organized representations. We then train this predictive model on natural trajectories of water-deprived mice navigating the same maze and compare the resulting representations with those learned from agent trajectories. More exploratory mice show representational geometries that closely match those of exploratory agents, whereas mice with more restricted visitation patterns resemble reward-driven, exploitative agents. Together, these findings suggest that exploration enables predictive models to form generalized internal representations by organizing latent space around both spatial location and transition context in artificial agents and animals.

[LG-129] Learning to target with network interference

链接: https://arxiv.org/abs/2605.27794
作者: Xiaomeng Wang,Hamsa Bastani,Osbert Bastani,Zhimei Ren
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper studies adaptive targeting under network interference in a bandit setting, where treatments applied to one individual may affect others through spillover effects. We consider a linear model in a sparse regime, where each individual’s outcome can be affected by at most a few others. We first establish a regret lower bound showing that ignoring the network structure and reducing the problem to a standard linear bandit inevitably leads to inefficient learning, particularly in large populations. To understand how structural information can be leveraged, we analyze regimes with varying levels of knowledge of the interference structure: (1) full support knowledge, (2) knowledge of the column support sizes, and (3) no prior knowledge. For each regime, we establish regret lower bounds characterizing the fundamental limits of learning, and develop algorithms that achieve near-optimal regret. Together, our results provide a unified view of how knowledge of the interference structure governs the efficiency of online learning under interference, and offer practical adaptive targeting algorithms in each setting. Numerical experiments on synthetic and real-world data demonstrate the practical benefits of our algorithms.

[LG-130] Sparse POD Mode Selection and Manifold Dimensionality Reduction with Neural Networks

链接: https://arxiv.org/abs/2605.27756
作者: Tomoki Koike,Prakash Mohan,Marc T. Henry de Frahan,Elizabeth Qian,Julie Bessac
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:High-performance computing enables simulation of high-dimensional physical systems, but downstream analyses such as inverse problems and control remain computationally expensive, motivating model order reduction (MOR) to construct efficient low-dimensional surrogates. Proper Orthogonal Decomposition (POD), a widely adopted data-driven MOR method, projects dynamics onto linear subspaces spanned by the most energetic modes. However, POD struggles for problems with slowly decaying Kolmogorov (n)-widths, such as advection-dominated and turbulent flows, requiring many modes for accurate reconstruction. Moreover, energy-based selection can discard crucial low-energy modes needed to capture small-scale features. Recent nonlinear manifold methods using polynomial mappings with alternating or greedy mode selection achieve better reconstruction with fewer modes. However, these methods fix the nonlinear mapping form a priori, limiting expressivity. Conversely, neural network (NN) manifolds offer greater expressivity but employ energy-based selection. We present SparseModesNet, a dimensionality reduction framework that employs linear encoding via POD modes and nonlinear NN decoding. The decoder leverages LassoNet, a method enforcing hierarchical sparsity through residual connections with linear skip layers, to simultaneously select informative POD modes and learn a nonlinear mapping that minimizes reconstruction error. On benchmark advection-dominated and chaotic flows, SparseModesNet matches or exceeds state-of-the-art performance. For turbulent channel flow at friction Reynolds number (Re_\tau=5200), we reduce reconstruction error by 51–78% compared to existing polynomial manifold methods while maintaining interpretability through physically meaningful mode selection.

[LG-131] Soft Specialists: α-Rényi Ensembles for Uncertainty-Aware LLM Post-Training

链接: https://arxiv.org/abs/2605.27747
作者: Paula Cordero-Encinar,Georgy Tyukin,Andrew B. Duncan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Existing training approaches for large language models learn a single set of parameters, based on large volumes of data, which is typically heterogeneous, conflicting and often outright contradictory. As a result, the model is forced to compress conflicting goals, and inherent uncertainties into a single, averaged pattern of behaviour. We propose an \alpha -Rényi variational framework for learning distributions over post-training parameters, offering an uncertainty-aware alternative to deep ensemble approaches. The resulting variational objective interpolates between classical variational Bayes and predictively oriented posterior learning, balancing between globally plausible individual models against systems of complementary specialists. We identify local stability criteria, demonstrating how model misspecification can make non-degenerate posterior spread locally favourable, manifesting contradictory or conflicting data as epistemic uncertainty. We apply our framework to LLM post-training, learning an ensemble of LoRA adapters attached to a shared, frozen base model, providing a scalable training procedure for both supervised fine-tuning and preference optimisation. Our approach enables training examples to be softly routed across ensemble members, promoting model specialisation and providing actionable uncertainty estimates across different tasks.

[LG-132] CFDTwin: An open-source GUI and Python toolkit for POD-NN surrogate modeling of ANSYS Fluent simulations

链接: https://arxiv.org/abs/2605.27725
作者: Daniel Curl,Han Hu
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:High-fidelity computational fluid dynamics (CFD) is widely used for thermal-fluid design, but repeated CFD solves remain expensive for design optimization, uncertainty analysis, and digital-twin workflows. Recently, our team has demonstrated that a proper orthogonal decomposition and neural-network (POD-NN) surrogate can predict two-dimensional thermal fields in an electronics-cooling cold plate with large inference speedups while preserving physically interpretable modal structure. Reproducing and extending such workflows, however, typically requires custom scripts for parameter sampling, Fluent automation, data extraction, reduced-order model construction, neural-network training, validation, and prediction. This paper introduces CFDTwin, an open-source Python package and optional desktop graphical user interface (GUI) that packages these steps into a reusable workflow for ANSYS Fluent simulations. CFDTwin allows users to define simulation inputs and output quantities, generate design-of-experiments samples, run and resume Fluent batch simulations, train POD-NN surrogate models for scalar, surface-field, and cell-zone outputs, inspect validation metrics, and evaluate trained models at new design points without re-running Fluent. The same workflow is exposed through a scriptable Python API and a GUI, supporting reproducible studies, user-facing model validation, and automated design exploration. CFDTwin extends the prior POD-NN modeling study from a case-specific research implementation to a reusable research-software platform for CFD surrogate modeling and digital-twin development.

[LG-133] Robust Moment-Based Estimation via Spectral Gradient Reweighting

链接: https://arxiv.org/abs/2605.27718
作者: Liu Zhang,Amit Singer
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Moment-based estimation is a theoretically attractive approach to parametric inference, especially when likelihood-based estimation is unavailable, misspecified, or computationally inconvenient. However, the moment equations involve sample averages, which makes moment-based estimation sensitive to outliers. We propose the SGR-GMM algorithm, a robust generalized method of moments (GMM) procedure that uses a spectral gradient reweighting (SGR) primitive to soft-reweight the per-observation gradients during the moment-matching optimization. Our analysis has three layers. First, for a fixed center, the SGR primitive is formulated as an entropy-regularized spectral game between a sample-weight player and a density-matrix player, which is analyzed using classical multiplicative-weights and matrix-multiplicative-weights regret bounds. Second, we establish explicit convergence radius and finite termination bound for the fixed-center updates in the SGR primitive. Third, we prove a local finite-sample parameter estimation error bound with explicit dependence on the contamination fraction, inlier gradient stability, local GMM identification strength, and optimization accuracy. We further specialize the SGR-GMM algorithm to obtain a robust diagonally-weighted GMM (DGMM) estimator for estimating heteroscedastic low-rank Gaussian mixtures observed under additive Gaussian noise and strong contamination. In the numerical experiments, the SGR primitive produces nearly-oracle gradient estimation and the robust DGMM specialization substantially improves over non-robust moment baselines. The code and data are available at this https URL.

[LG-134] Unsupervised Identification and Removal of Spurious Correlations During Fine-Tuning

链接: https://arxiv.org/abs/2605.27676
作者: Ciarán M. Gilligan-Lee,Joseph Egan,Yuchen Zhu,Michael O’Riordan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 10 + 4 pages, comments welcome

点击查看摘要

Abstract:Fine-tuning a pretrained language model on a curated dataset can produce spurious correlations between the fine-tuning task and unintended latent factors – such as misaligned personas or political slant – that the curation procedure has entangled with the task. The model can latch onto these spurious correlations, leading to bias and reduced out-of-distribution generalisation. We prove that under reasonable assumptions on task complexity and the spurious correlation, such latent factors can be identified, without supervision, from the weights of a naive LoRA fine-tune. Existing approaches to removing bias, such as activation steering, remove identified factors from residual-stream activations, either at inference or during training. We argue, however, that the goal should be to remove the spurious correlation, not the latent factor itself, as the pretrained model may rely on it for genuine task signal. To enable this, we propose GRASP, GRadient projection of Associated Spurious Patterns, which prevents the model from acquiring new reliance on the identified latent factor while preserving any pretrained content along it. We validate on three fine-tuning tasks. The first two involve emergent misalignment, where fine-tuning on a narrow task – in our case, writing insecure code and giving bad medical advice – leads to misaligned responses on unrelated topics. Here our method completely removes misalignment in the insecure code case and reduces them by ~5x in the bad medical advice case, beating all baselines in the trade-off between misalignment-reduction and task-preservation. The last is a novel political-bias experiment, where fine-tuning on right-skewed Reddit financial-advice data causes political-lean drift on unrelated topics. Here our method reduces drift by more than half, while improving financial task performance, beating all baselines.

[LG-135] Evolving and Detecting Multi-Turn Deception using Geometric Signatures

链接: https://arxiv.org/abs/2605.27671
作者: Surender Suresh Kumar,Mary L. Cummings
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Safety defenses for large language models (LLMs) are typically trained and evaluated on single-turn prompts, yet real attacks often unfold as indirect, multi-turn probing. To defend against this more nuanced form of deception, we present a unified pipeline that generates realistic multi-turn deceptive question sets via multi-objective genetic prompt optimization with co-evolving mutation operators. We validate this dataset through a human study, which also revealed that early generations yielded the most convincing deception and practical constraints such as adherence filtering and ordering effects. Using this data, we were able to detect deceptive attempts to access prohibited information using simple, explainable geometric signals in embedding space coupled with a lightweight feed-forward classifier. Three geometric features (angular coverage, distance ratio, and linearity) augmented with pairwise similarity statistics led to a compact predictive model that achieved consistently high recall (0.89) across base, reworded, and truncated (three-turn) scenarios, with test-time F1 ranging from 0.74-0.86. The results support a central hypothesis that multi-turn deceptive intent leaves a stable geometric footprint that enables lightweight, transparent screening without expensive end-to-end training. We further discuss responsible uses, limitations, and paths toward larger, more diverse human-evaluated datasets. The primary contribution to artificial intelligence is the multi-objective evolutionary framework for prompt generation, and the engineering application is the deployment of a lightweight geometric detection system for LLM safety infrastructure.

[LG-136] Accelerating Reinforcement Learning Training Using Simulation Surrogate Models

链接: https://arxiv.org/abs/2605.27556
作者: Mohammadmahdi Ghasemloo,David J. Eckman,Yaxian Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High-fidelity simulation models are widely used to analyze complex stochastic systems, but their high computational cost motivates the development of cheaper surrogate models that approximate the simulation model’s input-output relationship. In parallel, reinforcement learning (RL) has emerged as a powerful framework for making online decisions in stochastic environments, with increasing attention being given to the use of simulation models as training environments for RL models. We investigate a class of surrogate models suitable for accelerating RL training in settings where the reward structure, model parameters, or system dynamics change over time and explore their interactions with simulation models and RL models. Through numerical experiments on a stochastic service system modeled via discrete-event simulation, we demonstrate that leveraging surrogate models can substantially accelerate RL training and re-training.

[LG-137] Probabilistic Data-Driven Modelling of Astrophysical Transients: The Neural Process Family for Ultrafast and Class-Agnostic Light Curve Reconstruction with NightLANP

链接: https://arxiv.org/abs/2605.27527
作者: Siddharth Chaini,Federica B. Bianco,Ashish Mahabal
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Astrophysical observations taken from Earth are subject to weather, environmental, and scientific constraints that lead to sparse, irregular light curves. On the eve of the Vera C. Rubin Observatory Legacy Survey of Space and Time, its massive dataset offers unprecedented opportunities for transient science. Yet, a key challenge remains its cadence, which will be sparse and irregular across six bands, limiting scientific inference. Interpolating light curves helps mitigate this, with Gaussian Processes being the standard, but they struggle with cross-band correlations, require an a priori kernel specification, and must be fit to each light curve individually and hence scale poorly. Here, we introduce the neural process family for light curve reconstruction, combining the probabilistic framework of Gaussian Processes with the scalability of deep learning. By meta-learning on diverse simulated transients, Attentive Neural Processes shift the bulk of the computational cost to training, enabling rapid, amortized inference with a single, class-agnostic model. Evaluated on realistic Rubin cadences across 15 transient classes, Attentive Neural Processes consistently outperform all benchmarks - a suite of Gaussian Processes and neural networks on every tested metric, spanning regression quality, astrophysical feature recovery, and probabilistic calibration. Our model interpolates all bands simultaneously in microseconds, over four orders of magnitude faster than the next-best neural benchmark and five faster than Gaussian Processes, making them suitable for the nightly LSST alert stream. Attentive Neural Processes avoid the overconfidence of standard neural networks and the underconfidence of Gaussian Processes, delivering sharp, well-calibrated uncertainties. This work establishes the neural process family as a scalable, probabilistic foundation for real-time transient science in the Rubin era.

[LG-138] Semiparametrically Efficient Inference for Kernel Measures of Noise Heterogeneity

链接: https://arxiv.org/abs/2605.27526
作者: Jakub Wornbard,Zikai Shen,Dimitri Meunier,Arthur Gretton
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop semiparametrically efficient inference for kernel measures of noise heterogeneity in additive noise models. In many applications, the regression function is estimated using flexible machine learning methods. Downstream procedures based on the resulting residuals can then inherit first-stage bias: regression error may induce spurious dependence between covariates and residuals, invalidating the assumptions needed for standard analysis. We construct a novel Hilbert-valued one-step estimator of the kernel covariance operator between covariates and residuals. Our estimator yields bootstrap-calibrated tests for residual independence and goodness of fit in additive noise models, while also providing asymptotically efficient confidence intervals for the kernel dependence measure under noise heterogeneity. The framework extends to settings with additional covariates, enabling inference on distributional heterogeneity of residual noise across treatment groups. Simulations show improved calibration and power relative to naive plug-in residual methods.

[LG-139] Identifiable Bayesian Deep Generative Copulas with Unknown Layer Widths for Data with Arbitrary Marginal Distributions

链接: https://arxiv.org/abs/2605.27523
作者: Joseph Feldman,Yuqi Gu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep generative models offer powerful tools for multivariate data analysis, but their black-box architectures are often unidentified and difficult to interpret. We introduce the Deep Discrete Encoder (DDE) Copula, an identifiable and interpretable generative model for multivariate data with arbitrary marginal distributions. The model places a hierarchical directed network of binary latent variables inside a copula framework, enabling flexible dependence modeling for mixed discrete and continuous data. Estimation is based on rank likelihoods, which decouple marginal modeling from posterior inference on the DDE parameters and avoid specifying the marginal distributions. We establish conditions for identification of the DDE copula parameters, ensuring that layer-specific parameters provide meaningful summaries of multivariate dependence. We also prove quotient-space posterior consistency for continuous margins under the exact rank likelihood and treat the extended rank likelihood for tied or mixed margins as a generalized likelihood, with concentration under an additional contrast condition. For computation, we propose a stochastic expectation-maximization algorithm for \emphmaximum a posteriori estimation, together with initialization strategies that improve convergence. To learn network dimension adaptively, we extend Bayesian rank-selection priors to infer layer-specific widths. Simulations show strong finite-sample performance, and a personality-survey analysis reveals interpretable hierarchical latent structure in complex multivariate data.

[LG-140] riangular-Reference Schrödinger Bridges for Time Series Generation

链接: https://arxiv.org/abs/2605.27478
作者: Gabriele Bocchi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We introduce Triangular-Reference Schrödinger Bridges for Time Series (TR-SBTS), a conservative extension of the SBTS framework in which the Brownian reference is replaced by an intervalwise frozen, possibly degenerate diffusion reference, triangular across a hierarchy of latent volatility levels. The construction is a single entropy projection on the augmented state space, with the variational constraint imposed jointly across time and the latent levels and unfolded hierarchically by the disintegration of relative entropy. The variational core of SBTS is preserved: the entropy minimiser is the h-transform of the reference, and on each frozen interval the optimal dynamics admit a logarithmic-gradient drift formula on the affine leaves of the active covariance directions, valid even when the frozen covariance is rank-deficient. We establish stability of the frozen approximation and convergence of the corresponding regularised kernel estimators. The construction is realised through a finite-dimensional conditioning map assembled from three complementary reductions of the past – a block PCR summary, a reference-aware Mahalanobis kernel on past increments induced by the runtime frozen covariance cumulants, and a past-window WLS drift regressor under the same reference metric – together with a coupled state-covariance bridge step in which each latent level produces a dynamic reference for the level above, summarised by a covariance descriptor; the construction is evaluated on numerical experiments.

[LG-141] Iterative Causal Discovery: Per-Edge Impossibility Certificates Tier-Aware Oracle Queries and the 1K Lower Bound

链接: https://arxiv.org/abs/2605.27477
作者: Eichi Uehara
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Contains 10 figures and 5 tables

点击查看摘要

Abstract:Causal-discovery algorithms return a directed graph, yet provide no principled means of distinguishing edge directions identified by the data from those assigned without an identifying assumption. Under the standard Markov and faithfulness conditions, the observational distribution identifies only a Markov equivalence class; orientations within that class are not determined by the joint distribution and cannot be recovered from additional samples alone, but require either a functional restriction or an intervention. We introduce a protocol for observational causal discovery on continuous data that attaches to each candidate edge a discrete impossibility certificate: a RESOLVED code records the identifiability theorem under which the direction was committed, while an IMPOSSIBLE code records the failure mode together with the specific question a domain expert must answer to resolve it. The bivariate cascade is extended with five gated identifiability tiers LSNM, IGCI, Stein, MDL, and PEIT that abstain when their precondition test rejects. Two oracle primitives, the meta-hub query and the node-children query, jointly establish an upper bound of 1+K expert interactions sufficient to recover any DAG, where K denotes the number of non-leaf vertices. Under an ideal-oracle assumption, the bound is met exactly on the asia, sachs, child, and alarm benchmarks.

[LG-142] Stop Suppressing the Tail: Causal Inference for Extreme Events

链接: https://arxiv.org/abs/2605.27474
作者: Eichi Uehara
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 22 pages, 6 figures, 13 tables. Keywords: double machine learning, dose-response, heavy tails, extreme value theory, causal inference

点击查看摘要

Abstract:Estimating how an outcome responds to a continuous treatment (the Average Dose-Response Function, or ADRF) is a core causal-inference primitive. However, when outcomes possess heavy tails, standard robust double machine learning (DML) deliberately suppresses these extremes to stabilize the bulk average. In high-stakes settings, such as financial returns or climate losses, this omitted 1-in-1000 extreme event is the actual target quantity. Furthermore, current methods that read the tail from a model’s residuals suffer from circular dependence, causing tail shape inferences to shift drastically based solely on whether the core estimator is switched between Huber and this http URL research proposes an ADRF estimator that emits a structured tail-shape output alongside the standard point estimate. Its tail diagnostic (PDHTE+JK) evaluates the per-treatment tail shape from the outcome centered by a pilot median, successfully breaking the circular dependence and rendering the diagnostic invariant to the choice of core method. The output encompasses four treatment-conditional quantities: tail shape \hat\xi(t) , deep-tail return levels \hatQ_\alpha(t) , conditional shortfalls \hatS_\alpha(t) , the recovered mean ADRF, and an explicit refusal mechanism that declines extrapolation when extreme-value modeling is unsupported by the data. Compared to kernel-weighted quantile regression (QR), the proposed estimator reduces deep-tail ( \alpha=0.001 ) return-level MAE by 11% and conditional-shortfall MAE by 25.5% across a heavy-tailed panel. It also achieves a 20-29% MAE reduction in sample-scarce regimes ( n\le2000 ). On freMTPL2 motor-insurance claims, it successfully triggered an explicit extrapolation refusal on the log-claim scale, which neither QR nor loss-only DML can produce.

[LG-143] Calibrated Inference for the Conditional Averag e Treatment Effect in the Few-Placebo Regime via Gaussian Processes

链接: https://arxiv.org/abs/2605.27473
作者: Eichi Uehara
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 14 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Estimating how much an intervention helps a given individual the conditional average treatment effect (CATE) is increasingly central to decision-making in medicine, economics, and policy, where an estimate is most useful when accompanied by a calibrated uncertainty interval. We study the few-placebo regime, in which one treatment arm is much smaller than the other, as arises in unequal-allocation trials and small-holdout A/B tests. The standard estimator in this setting is the X-Learner, and a natural way to obtain credible intervals is to make its second stage Bayesian. We show that these intervals under-cover: they contain the true effect less often than their nominal level. We trace this to a structural cause the X-Learner’s regression target inherits the bias of a nuisance model fitted to the small arm, so the posterior is centered away from the true effect and we find that the standard remedy, regressing an orthogonal doubly-robust score, is also unreliable here, since the regime’s limited overlap leaves the estimator either highly variable or, once stabilized, biased once more. Both consequences reflect a pattern that extends beyond causal inference: a separately estimated variance is attached to a point estimate of a hard-to-learn quantity, and the point estimate’s bias is not captured by that variance. We propose GP-CATE, which models each arm’s outcome surface with a Gaussian process, so the scarce arm’s uncertainty enters the posterior directly rather than as an unmodelled bias. Across synthetic and semi-synthetic benchmarks, GP-CATE attains calibrated coverage where the estimators we compare against including Causal Forest and BART do not, at the cost of intervals that are appropriately wide when the data are uninformative.

[LG-144] Zero-shot Quantum Neural Architecture Search

链接: https://arxiv.org/abs/2605.27410
作者: Tung Dao,Son Tran,Huynh Thi Thanh Binh
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Variational Quantum Algorithms (VQAs) are a leading approach to exploiting near-term quantum hardware, leveraging parameterized quantum circuits and classical optimization to achieve advantage. Despite their promise, the practical deployment of VQAs is challenged by the difficulty of designing quantum circuit architectures that balance expressivity, trainability, and hardware constraints. Existing evolutionary-based quantum neural architecture search methods address these challenges but suffer from high computational costs due to repeated training of candidate circuits. In this work, we identify a setting in which the Gram matrix of the Quantum Neural Tangent Kernel converges. Building on this observation, we design a zero-shot surrogate model to estimate candidate performance without full training, significantly accelerating the architecture search process. Using this surrogate, we propose MZeQAS, a Monte Carlo Tree Search (MCTS)-based Zero-Shot Quantum Neural Architecture Search framework for VQAs. By integrating proxy-based performance estimation with MCTS exploration, MZeQAS efficiently discovers high-performing architectures. Experimental results demonstrate that MZeQAS outperforms existing approaches in terms of both search efficiency and solution quality, providing a scalable and effective framework for advancing VQA deployment on noisy intermediate-scale quantum devices.

[LG-145] Neural Quantum Spectral Operator Learning for Solving Partial Differential Equations

链接: https://arxiv.org/abs/2605.27408
作者: Chanyoung Kim,Myeonghwan Seong,Yujin Kim,Daniel K. Park,Youngjoon Hong
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 31 pages (main 9 pages), 17 figures, 8 tables

点击查看摘要

Abstract:Partial differential equations (PDEs) are central to modeling physical and engineering systems, but repeatedly solving parametric PDEs remains computationally expensive. Operator learning enables fast surrogate inference, yet typically requires large input-output paired datasets generated by costly high-fidelity PDE solvers. Unsupervised operator learning frameworks alleviate data dependency but remain hindered by computational bottlenecks. To address this, we propose Neural Variational Quantum Linear Solver (NVQLS), the first hybrid quantum-classical operator learning framework leveraging the Legendre–Galerkin weak formulation. We critically resolve the sign ambiguity in VQLS energy minimization, preventing erroneous solution representations. Additionally, we introduce a neural embedding, a novel encoding scheme to map varying forcings and PDE coefficients into parameterized quantum circuit representations. These structural innovations provide theoretical computational complexity advantages under efficient state preparation schemes, while achieving superior accuracy compared to a representative classical baseline. Validations on 1D and 2D parametric PDEs under diverse boundary conditions demonstrate NVQLS’s capability to simultaneously process varying inputs, offering a scalable unsupervised approach to quantum-enhanced operator learning.

[LG-146] Proper Calibeating

链接: https://arxiv.org/abs/2605.26703
作者: Dean P. Foster,Sergiu Hart
类目: Theoretical Economics (econ.TH); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The classic concept of “calibrated forecasts” and its more recent refinement, “calibeating,” are defined with respect to the standard quadratic scoring rule. We extend these notions to the class of \textitproper scoring rules (for which the best forecast is the true distribution) and define \textitproper-calibration and \textitproper-calibeating by requiring the errors to converge to zero uniformly over all bounded proper scoring rules. We first establish that calibration always implies proper-calibration, whereas calibeating need not imply proper-calibeating. Second, we show how to guarantee proper-calibeating and proper-multicalibeating. Finally, we demonstrate the equivalence between proper-calibration and universal no regret when best replying to forecasts in decision-making under uncertainty.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2026-05-28

目录

概览 (2026-05-28)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载