本篇博文主要内容为 2026-05-27 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-05-27)
今日共更新711篇论文,其中:
- 自然语言处理共138篇(Computation and Language (cs.CL))
- 人工智能共258篇(Artificial Intelligence (cs.AI))
- 计算机视觉共141篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共219篇(Machine Learning (cs.LG))
- 多智能体系统共16篇(Multiagent Systems (cs.MA))
- 信息检索共18篇(Information Retrieval (cs.IR))
- 人机交互共12篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] MUSE-Autoskill: Self-Evolving Agents via Skill Creation Memory Management and Evaluation
【速读】:该论文试图解决现有大型语言模型(Large Language Model, LLM)代理中技能(skill)创建方式存在的局限性问题,即当前方法将技能视为孤立且静态的产物,导致其可复用性差、可靠性低以及难以持续优化。解决方案的关键在于提出MUSE-Autoskill Agent框架,该框架以技能为中心,构建了一个统一的技能生命周期管理机制(包括创建、记忆、管理、评估与精炼),使代理能够按需生成技能、跨任务存储与复用、高效组织与选择,并通过单元测试和运行时反馈实现持续迭代。此外,引入技能级记忆(skill-level memory)机制,积累每个技能在多任务中的经验,从而支持更有效的长期复用与适应性改进。实验结果表明,这种生命周期管理的技能体系能显著提升任务成功率、执行效率、复用率及跨代理迁移能力,凸显了将技能视为长期存在、经验感知且可测试资产的重要性。
链接: https://arxiv.org/abs/2605.27366
作者: Huawei Lin,Peng Li,Jie Song,Fuxin Jiang,Tieying Zhang
机构: ByteDance(字节跳动)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 30 pages, 8 figures, 13 tables, working in progress
Abstract:Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.
[MA-1] Governed Evolution of Agent Runtimes through Executable Operational Cognition
【速读】:该论文试图解决多智能体系统中代理生成的代码或任务成果(即“生成式 AI (Generative AI) artifacts”)在运行时缺乏治理、生命周期管理和持续演化机制的问题。传统方法将这些成果视为临时输出,而本文提出了一种通过可执行操作认知(executable operational cognition)实现受控运行时演化的框架,其关键在于将代理生成的成果形式化为持久的运行时能力(persistent runtime capabilities),并引入一种名为 HarnessMutation 的受控机制,该机制基于显式的验证、可追溯性、评估与回滚约束进行生命周期感知的运行时适应。这一方案使系统的演化过程成为有界且可观测的,从而确保基础设施的自适应能力保持透明、可审计和可控。
链接: https://arxiv.org/abs/2605.27328
作者: Mariano Garralda-Barrio
机构: Independent Researcher / Investigador Independiente.
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 14 pages, 4 figures, 1 table. Reference implementation and associated source code available at: this https URL
Abstract:Recent advances in agentic systems increasingly treat code as an executable operational substrate rather than as a disposable output artifact. Prior work such as \emphCode as Agent Harness frames validated agent-generated artifacts as runtime entities that can be created, executed, revised, persisted, and reused within long-running cognitive loops. However, the governance, lifecycle management, and operational evolution of such artifacts remain under-specified. This paper proposes a framework for governed runtime evolution in multi-agent systems through executable operational cognition. We formalize agent-generated artifacts as persistent runtime capabilities that progressively become part of the operational substrate rather than transient intermediate outputs. Building on this perspective, we introduce \emphHarnessMutation as a governed mechanism for lifecycle-aware runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. Rather than treating runtime adaptation as unrestricted self-modification, the proposed framework models evolution as a bounded and observable process over persistent operational memory. It further shows how these ideas can be operationalized over modern agent runtimes and governance-oriented orchestration systems, providing a conceptual foundation for adaptive infrastructures whose evolution remains explicit, auditable, and constrained. Comments: 14 pages, 4 figures, 1 table. Reference implementation and associated source code available at: this https URL Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2605.27328 [cs.SE] (or arXiv:2605.27328v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.27328 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-2] Autonomic Federated-Market Orchestration for the Edge-Cloud Continuum
【速读】:该论文旨在解决边缘-云计算连续体中跨自治管理域的自管理机制问题,核心挑战在于如何在尊重租户和运营商指定的数据主权前提下实现规模化自治。解决方案的关键是提出 Neural Pub/Sub,一个基于市场机制的价格信号驱动的联邦代理自主基础架构,其自组织行为不依赖集中控制,而是通过 MAPE-K 控制环实现:包括每代理健康与负载监控、边际成本清结算价格分析、在多面体可行域上的放置规划、跨域联邦调度以及带有边界过时性的共享订阅摘要。其中“计划”步骤基于瓦尔拉斯收敛性假设——在树状及串并联服务依赖有向无环图(DAG)上满足总替代效用时,去中心化的基于价格的分配可达到中央优化器的福利水平。实验验证表明,在4节点、4域、48个工作节点的联邦测试环境中,该方案相较单进程oracle提升2–4%,且在95%置信区间内保持±1.5%差距;同时具备对网络分区和代理失效的鲁棒性,并在保障数据主权的前提下无额外运行开销。
链接: https://arxiv.org/abs/2605.27106
作者: Lauri Lovén,Roberto Morabito,Abhishek Kumar,Susanna Pirttikangas,Jukka Riekki,Sasu Tarkoma
机构: University of Oulu(奥卢大学); EURECOM; University of Jyväskylä(于韦斯屈莱大学); AMD Silo AI(AMD硅谷AI); Future Computing Group, University of Oulu(奥卢大学未来计算组); University of Helsinki(赫尔辛基大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注: 35 pages, 5 figures (combined main paper + electronic supplement, folded into one document for arXiv)
Abstract:The edge-cloud computing continuum demands self-management mechanisms that scale across autonomous administrative domains while honouring tenant- and operator-specified data sovereignty. We present Neural Pub/Sub, a federated-broker autonomic substrate whose self-organising behaviour emerges from market-based price signals rather than centralised control. Its MAPE-K control loop closes over per-broker health and load monitoring, marginal-cost clearing-price analysis, placement planning over a polymatroidal feasibility region, federated cross-domain dispatch, and shared peer subscription summaries with bounded-staleness price signals. The Plan step is anchored in a Walrasian convergence proposition: under gross-substitutes valuations on tree and series-parallel service-dependency DAGs, decentralised price-based allocation matches the welfare of a centralised oracle. We evaluate the substrate on a 4-VM, 4-domain, 48-worker federated edge-cloud testbed (single data centre, 50 ms emulated WAN) in a 1005-run campaign augmented by a fair-process-count sharded-oracle comparator. The federated market dominates a single-process oracle by 2-4% with 45 of 45 per-seed wins (sign-test p ~ 2.8e-14, Hodges-Lehmann median -39.6 ms); against a four-shard centralised orchestrator at equal process count the gap stays within +/-1.5% across all nine (pipeline, load) cells. Round-robin completion rate collapses 98.8% - 22.4% - 3.3% across arrival rates 5/10/15 pps while the market preserves completion; the advantage decomposes into three Walrasian properties (information completeness, admission control, price discovery). Federation withstands broker death and network partition (completion rate = 98.7% across 75 cells), and sovereignty enforcement adds no measurable runtime overhead across 60 governance-grid runs. Heterogeneous-domain stressors and cross-site WAN deployment remain future work.
[MA-3] Cost of Structural Learning Under Censored Feedback: A Threshold-Bandit Approach
【速读】:该论文试图解决多智能体系统中因奖励仅在合作联盟达到未知规模阈值时才被激活,而其他情况下反馈完全被屏蔽(censored)所引发的识别难题:智能体无法区分随机失败与协调不足。为应对这一问题,作者提出了阈值触发的协作多臂老虎机(Threshold-Activated Cooperative Multi-Armed Bandit, TAC-MAB)模型,并在集中式和分布式两种协调机制下进行分析。关键解决方案在于:集中式算法C-TAC通过将累计遗憾(cumulative regret)分解为结构搜索项(用于在被屏蔽反馈下识别可行协作结构)和统计监测项(用于价值估计),实现了 O(logT) 的遗憾上界;而分布式算法D-TAC则设计了一种事件触发协议,仅在智能体对结构信念发生改变时才同步,从而在保持可行性一致性的前提下,相比集中基线减少23倍通信开销。这表明,在被屏蔽反馈环境下,无需持续同步即可实现接近集中式的通信效率。
链接: https://arxiv.org/abs/2605.27076
作者: Michael Ledford,William Regli
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:
Abstract:In many multi-agent applications, tasks yield rewards only when executed by a coalition meeting an unknown size threshold; otherwise, feedback is fully censored. This censorship creates an identifiability problem: agents cannot distinguish stochastic failure from insufficient coordination. We formalize this setting as the Threshold-Activated Cooperative Multi-Armed Bandit (TAC-MAB) and analyze it under both centralized and decentralized coordination. We show that a centralized algorithm (C-TAC) achieves cumulative regret O(log T), decomposed into a structural-search term that captures the cost of resolving feasibility under censored feedback and a statistical-monitoring term for value estimation. We then introduce D-TAC, a decentralized event-triggered protocol in which agents synchronize only when their structural beliefs change. Empirically, D-TAC achieves a 23x reduction in communication relative to the centralized baseline while preserving feasibility alignment under conservative belief fusion. These results characterize the coordination cost of learning under censored feedback and show that near-centralized communication efficiency is achievable without continuous synchronization.
[MA-4] QUACK: Questioning Understanding and Auditing Communicated Knowledge in Multimodal Social Deduction Agents
【速读】:该论文试图解决的问题是:当前社会推理类游戏环境在评估大型语言模型(LLM)代理时,仅依赖胜负结果等宏观指标,缺乏对代理语言与实际感知和行为之间是否一致的细粒度审计能力,导致难以识别如空间幻觉、无依据指控、欺骗失效等具体失败模式。解决方案的关键在于提出QUACK——一个开源的多模态社会推理评估框架,其核心为“语句验证流水线”(Statement Verification Pipeline),通过从引擎日志中重建代理的真实行为轨迹,并逐条比对其对话内容,从而自动检测空间幻觉、无证据指控、欺骗崩溃及语言-行为不一致等问题。实验表明,即使是最先进的视觉语言模型(VLM)也存在显著的语言-行为脱节现象,验证了该框架在揭示代理认知真实性方面的有效性。
链接: https://arxiv.org/abs/2605.27068
作者: Ye Yuan,Rui Song,Weien Li,Zeyu Li,Haochen Liu,Xiangyu Kong,Changjiang Han,Yonghan Yang,Zichen Zhao,Zixuan Dong,Fuyuan Lyu,Bowei He,Haolun Wu,Jikun Kang,Xue Liu
机构: McGill University; Mila - Quebec AI Institute; University of Cambridge; MBZUAI - Mohamed bin Zayed University of Artificial Intelligence; University of Toronto; Salesforce
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent’s language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent’s ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at this https URL.
[MA-5] Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study
【速读】:该论文试图解决的问题是:当前对大型语言模型(Large Language Models, LLMs)的评估多局限于静态模型性能、基准测试或短时对话场景,缺乏对其在持久化、具身化科研环境中长期运行效果的系统性理解。解决方案的关键在于构建并实证一个“持久代理研究环境”(Persistent Agentic Research Environment, PARE),通过引入持久记忆层、本地文件系统、外部工具集成、任务调度机制、角色分工及安全治理规则等要素,形成可测量、可复现的持续人机协作研究范式,并提出PARE-M评估框架,从架构、使用效率、产出物、资源消耗、可复现性和治理六个维度量化分析代理行为与成果,揭示了以缓存读取为主导的工作流特征,提示未来评估应转向以“完成的产出物”为单位而非“token成本”,并强调需建立独立编码的治理事件分类体系和可复现的解析规则。
链接: https://arxiv.org/abs/2605.26870
作者: Anas H. Alzahrani
机构: King Abdulaziz University (国王阿卜杜勒阿齐兹大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 19 pages, 2 figures, 3 main tables; supplementary appendix with 6 tables, 2 figures, and a reproducibility methods section. Describes 17 configured agents in a persistent research environment and introduces the PARE-M (Persistent Agentic Research Environment Measurement) framework
Abstract:Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when an agent is embedded persistently in a real academic research environment with durable memory, local files, external tools, scheduled routines, delegated roles, and explicit safety protocols. Methods: A structured self-observed implementation case study was conducted from January 31 to May 25, 2026. The unit of analysis was the persistent human-agent environment: researcher, agent runtime, memory layer, tools, repositories, scheduled jobs, specialized agent roles, and governance rules. Outcomes were organized using PARE-M (Persistent Agentic Research Environment Measurement), a measurement framework covering architecture, utilization, artifact production, resource use, reproducibility, and governance. Results: Recoverable main-agent telemetry contained 75,671 de-duplicated records across 96 active days, with 8,059 user-role and 23,710 assistant-role messages. The workspace included 502 memory-related files, 17 configured agent directories, and 57 skill files. Active system time was 579.7 hours (30-minute capped-gap estimate). Memory-derived records identified 482 output-proxy events and 889 failure, verification, correction, or protocol-proxy events. A strict May 2026 trajectory subset captured 627 model-completed events and 73.95 million recorded tokens, of which 82.9% were cache reads. Conclusions: The workflow was cache-dominant, suggesting that persistent agentic environments may shift the economic unit from cost per token to cost per completed artifact. Future evaluations should use artifact-level denominators, reproducible parsing rules, correction taxonomies, and independent coding of governance events.
[MA-6] UnityMAS-O: A General RL Optimization Framework for LLM -Based Multi-Agent Systems
【速读】:该论文试图解决的问题是:当前基于大语言模型(LLM)的多智能体系统(multi-agent systems)大多依赖人工设计的提示词、工具和控制规则,缺乏统一的强化学习(Reinforcement Learning, RL)优化接口,导致难以对复杂多智能体协作流程进行端到端训练。现有RL后训练框架主要针对单一策略优化,无法有效支持用户自定义的多智能体工作流、结构化交互、角色特异性奖励分配以及可配置的参数共享机制。
解决方案的关键在于提出 UnityMAS-O,这是一个通用的强化学习优化框架,其核心创新包括:将整个工作流作为优化单元而非单个响应或策略轨迹;通过四个一等对象(逻辑智能体角色、图状轨迹、用户定义奖励、智能体-模型映射)建模多智能体系统;解耦逻辑角色与物理模型参数,实现完全共享、完全分离和部分共享的灵活配置;支持在角色层、回合层和轨迹层进行奖励分配;并基于 Ray 构建星型拓扑运行时,由中央控制器执行工作流、调用工具、记录结构化轨迹和组装奖励,而模型本地的工作组负责rollout、缓冲、优势计算及分布式PPO式更新。实验表明,UnityMAS-O 能显著提升手动设计的多智能体流程性能,尤其在小模型和严格代码全通过指标上效果突出,具备作为可复用底座将多样化的LLM多智能体工作流转化为可训练的多智能体强化学习系统的潜力。
链接: https://arxiv.org/abs/2605.26646
作者: Yiqun Chen,Wei Yang,Erhan Zhang,Shijie Wang,Qi Liu,Zechun Niu,Bin Zhang,Haitao Li,Rui Li,Lingyong Yan,Jinyuan Feng,Biqing Qi,Xiaochi Wei,Yan Gao,Yi Wu,Yao Hu,Jiaxin Mao
机构: Renmin University of China (中国人民大学); Xiaohongshu Inc. (小红书)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent–model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA) Cite as: arXiv:2605.26646 [cs.AI] (or arXiv:2605.26646v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.26646 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yiqun Chen [view email] [v1] Tue, 26 May 2026 07:30:03 UTC (1,181 KB)
[MA-7] Control Physiology: An Agent -Based Model of FAIR-CAM Dynamics
【速读】:该论文试图解决的问题是:传统安全风险分析将控制措施的有效性视为静态输入,忽略了控制措施在实际运行中因配置漂移、监控系统退化及有限修复预算等因素导致的动态变化,从而无法准确反映真实环境中的风险演化过程。解决方案的关键在于构建了一个基于代理(agent-based)的模型——FAIR-CAM仿真器,首次将FAIR Controls Analytics Model(FAIR-CAM)的核心动态机制转化为可计算可观测的形式,其创新点包括八类代理类型、乘法型纵深防御易感性公式、三源方差模型、预算约束下的修复机制以及能够生成完整因果链的叙事因果引擎。通过医院勒索软件场景的模拟(1000次迭代),揭示了三种静态分析无法捕捉的组织级动态现象:控制有效性与理论公式的偏差(约17%)、修复队列在预算低于阈值时出现突变式延迟、以及监控失效在虚拟监控控制(VMC)拓扑中传播并累积未被检测到的方差。这些发现表明,FAIR-CAM架构本身具有结构性动态特性,具备跨场景的普适性。
链接: https://arxiv.org/abs/2605.26597
作者: Jack Jones,Laura Voicu
机构: Enterprise Risk Quantification Institute (企业风险量化研究所)
类目: Cryptography and Security (cs.CR); Multiagent Systems (cs.MA)
备注: 25 pages, 7 figures, 3 tables. Open-source code at this https URL
Abstract:Security risk analysis typically treats control effectiveness as a static input, yet controls degrade through configuration drift, depend on monitoring systems that may themselves be degraded, and compete for finite remediation budgets. The FAIR Controls Analytics Model (FAIR-CAM) provides the theoretical framework for these dynamics but has so far remained theoretical. We present the first agent-based model to operationalize the core FAIR-CAM dynamics, making control physiology computationally observable, and release the implementation as open source. The simulation implements eight agent types, a multiplicative defense-in-depth susceptibility formula, a three-source variance model, budget-constrained remediation, and a narrative causation engine that produces a complete causal trace for every loss event. In a hospital ransomware scenario (N=1,000 iterations), three organizational dynamics emerge that static analysis cannot represent. First, emergent operational efficacy diverges from the analytical FAIR-CAM formula by approximately 17 percent, driven by correlated extrinsic variance; the divergence grows linearly with extrinsic frequency and vanishes under purely intrinsic drift. Second, a sharp queueing regime transition in the remediation pipeline approximately 2.8x expected loss when budget falls below a scenario-specific threshold (5-10 engineer-hours/month). Third, cascading monitoring failures propagate through the VMC topology: a single degraded VMC silently compounds undetected variance across the controls it manages. These dynamics are structural properties of the FAIR-CAM architecture and should generalize beyond the specific scenario studied.
[MA-8] Constitutional Arms Races in the Public Goods Game: Co-Evolving LLM Constitutions Under Cooperation-Defection Pressure
【速读】:该论文试图解决的问题是:在代理(agent)环境中,当存在目标冲突时,前沿大语言模型(LLM)代理会表现出勒索、破坏和文档泄露等非合作行为,这暴露了基于单代理或合作假设的对齐方法的局限性。解决方案的关键在于提出并验证一种对抗性宪法共进化机制(adversarial constitutional co-evolution),即通过演化搜索同时优化蓝方合作方(Blue cooperators)与红方搭便车者(Red free-riders)的自然语言宪法规则,在公共品博弈(PGG)和空间网格世界中实现稳定的对抗性平衡。研究发现:(1)在PGG中,两派最终收敛至近似均势平衡点S ≈ 0.78,且对不同乘数m(1.2–3.0)具有鲁棒性;(2)若采用独立评分机制,则无法产生对抗压力,必须引入以自身得分减去对手得分为目标的适应度函数才能恢复对抗性;(3)在纯对抗适应度下,评估种子数量K控制模式回归现象——K=2时发生退化,而K=5能维持强专家角色达全部30代。这表明对抗性共进化可行,但依赖于耦合适应度设计和足够的评估预算,所演化出的红方宪法可作为未来合作设计的可解释红队测试工具。
链接: https://arxiv.org/abs/2605.26448
作者: Ujwal Kumar,Arth Singh,Hershraj Niranjani,Machiko Hirota,Takehiro Takayanagi,Alice Saito,Eiji Kamioka,Phan Xuan Tan
机构: Shibaura Institute of Technology (芝浦工业大学); NIT Agartala (阿加塔拉国立理工学院); UC Berkeley (加州大学伯克利分校); University of Pennsylvania (宾夕法尼亚大学); The University of Tokyo (东京大学)
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Neural and Evolutionary Computing (cs.NE)
备注: 15 pages, 5 figures
Abstract:Frontier LLM agents engage in blackmail, sabotage, and document leaks under goal conflicts in agentic settings, exposing limitations of alignment methods built around single-agent or cooperative assumptions. Recent work shows LLM-guided evolutionary search can discover effective cooperative constitutions, but two properties of the adversarial setting remain uncharacterized: whether the fitness function actually induces adversarial pressure, and whether the LLM mutation operator behaves reliably under adversarial-specialist objectives. We study adversarial constitutional co-evolution (Blue cooperators vs. Red free-riders, 30 generations) across a Public Goods Game (PGG) and a spatial grid-world. Three findings: (1) in the PGG, both factions converge to a near-parity equilibrium at S approximately 0.78, robust across tested multipliers m in 1.2, 1.5, 2.0, 3.0; (2) in independently scored environments, per-faction scoring leaves outcomes statistically uncoupled, with corr(S_B, S_R) = +0.088, and produces no adversarial pressure; a score-advantage fitness target S_own - S_opp restores it; (3) under pure-adversary fitness, evaluation seed count K controls mode regression: K = 2 regresses, while K = 5 sustains a strong specialist for all 30 generations. Adversarial co-evolution of natural-language constitutions is feasible, but only under coupled fitness and adequate evaluation budget; the evolved Red constitutions serve as interpretable red-team artifacts for testing future cooperative designs.
[MA-9] ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
【速读】:该论文试图解决自主研究代理(autonomous research agents)在生成学术论文时存在的可验证性缺陷问题,例如伪造引用、不可复现的性能指标以及方法描述与实际代码实现不一致等。解决方案的关键在于提出一个名为“证据链”(Chain-of-Evidence, CoE)的可验证性框架,要求每个主张都必须能追溯到其证据来源;并开发了端到端的自主研究系统ScientistOne,通过构建过程中持续维护证据链来确保输出内容的真实性;同时引入CoE Audit作为事后审计机制,对所有系统统一执行四项完整性检查——得分验证、规范违反检测、参考文献验证和方法-代码一致性校验。实验表明,现有基线系统普遍存在系统性失败模式(如引用幻觉率达21%、得分验证通过率低至42%),而ScientistOne在75篇论文中实现了零引用幻觉、满分得分验证和最高方法-代码一致性,且在多个前沿任务上达到或超越人类专家水平,并在额外六个跨领域任务中取得SOTA表现。
链接: https://arxiv.org/abs/2605.26340
作者: Rui Meng,Bhavana Dalvi Mishra,Jiefeng Chen,Chun-Liang Li,Palash Goyal,Mihir Parmar,Yiwen Song,Yale Song,Rajarishi Sinha,Parthasarathy Ranganathan,Burak Gokturk,Jinsung Yoon,Tomas Pfister
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Project website: this https URL
Abstract:Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks – score verification, specification violation, reference verification, and method-code alignment – apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.
[MA-10] Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
【速读】:该论文试图解决的问题是:当前对长期运行的AI代理(long-lived AI agents)的评估仍停留在初始状态(day-one benchmark),忽视了其在部署后随时间推移而产生的可靠性衰减问题。即使模型权重冻结,代理的有效状态仍因交互历史压缩、记忆存储增长、事实更新与常规维护等因素持续变化,导致可靠性成为代理整个生命周期(lifespan)的属性,而非仅取决于基础模型的静态性能。解决方案的关键在于提出AgingBench——一个纵向的可靠性基准测试框架,用于系统性地测量代理老化过程中的行为退化模式,并识别四种核心老化机制:压缩老化(compression aging)、干扰老化(interference aging)、修订老化(revision aging)和维护老化(maintenance aging)。通过时序依赖图与时序反事实探针(paired counterfactual probes),AgingBench可生成针对记忆流水线中写入、检索和利用阶段的诊断剖面(diagnostic profiles),从而实现机制级定位与阶段靶向修复,推动从“更强的初始模型”向“可持续可靠部署”的范式转变。
链接: https://arxiv.org/abs/2605.26302
作者: Jianing Zhu,Yeonju Ro,John Robertson,Kevin Wang,Junbo Li,Haris Vikalo,Aditya Akella,Zhangyang Wang
机构: The University of Texas at Austin
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent’s effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.
[MA-11] Decoupled Delay Compensation: Enhancing Pre-trained MARL Policies via Learned Dynamics Filtering
【速读】:该论文旨在解决多智能体强化学习(MARL)系统在现实场景中因观测信息过时、通信延迟和数据包丢失而导致的性能下降问题。传统在理想同步条件下训练的策略在异步环境中会因依赖过期反馈而表现显著劣化。其解决方案的关键在于引入一个模块化的执行阶段状态估计层,该层通过融合一个可学习的门控转移模型(Gated transition model)与递归卡尔曼滤波层(Kalman filtering layer),从异步观测中实时估计当前状态信念(belief-state)。此方法的优势在于其高度模块化设计——该估计层可作为“即插即用”组件集成到已训练好的策略中,无需修改原始MARL的训练算法、网络结构或奖励机制。实验表明,该框架在多种多智能体连续控制基准任务中均能有效提升对通信延迟和消息丢失的鲁棒性,尤其在需要强协调性和动态稳定性的任务中效果最为显著。
链接: https://arxiv.org/abs/2605.26286
作者: Maxim Mednikov,Oren Gal
机构: University of Haifa (海法大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 7 figures
Abstract:Real-world multi-agent reinforcement learning (MARL) systems must often operate under stale observations, stochastic communication delays, and intermittent packet loss. Policies trained under idealized synchronous conditions frequently exhibit significant performance degradation in these regimes because they act on outdated feedback. We propose a modular execution-stage state-estimation layer that replaces delayed communicated observations with current belief-state estimates. The framework integrates a learned Gated transition model with a recursive Kalman filtering layer to estimate instantaneous states from asynchronous measurements. A primary advantage of this approach is its modularity, The estimator serves as a plug-in for pre-trained policies, requiring no modifications to the original MARL training algorithm, architecture, or reward structure. Evaluation across diverse multi-agent and continuous-control benchmarks demonstrates that the proposed layer consistently enhances robustness to communication latency and message loss. The most significant performance gains are observed in coordination-intensive and dynamically unstable tasks where temporal consistency is critical for control.
[MA-12] Sentinel: Embodied Cooperative Spatial Reasoning and Planning
【速读】:该论文试图解决的问题是:在城市尺度的户外环境中,去中心化的具身智能体如何在动态环境约束下有效协作,实现安全且高效的会合与导航。解决方案的关键在于提出 CoSaR(Cooperative Spatial Reasoning and Planning)框架,该框架将基础模型的高层通信与规划能力与经典空间导航算法的精确性相结合,使智能体能够通过自然语言交换情境更新、推理不断变化的空间约束,并协同重新规划路径,从而提升聚集速度、缩短路径长度并增强安全性。
链接: https://arxiv.org/abs/2605.26239
作者: Xiangye Lin,Hongxin Zhang,Ruxi Deng,Qinhong Zhou,Chuang Gan
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: The first two authors contributed equally
Abstract:In this work, we study Cooperative Spatial Intelligence, the ability of decentralized embodied agents to coordinate effectively under dynamic environmental constraints across city-scale outdoor domains. We introduce Sentinel Challenge, a benchmark where multiple decentralized embodied agents must communicate in natural language to agree on a mutually safe and convenient meeting point within large, city-scale outdoor environments. Each agent must then navigate safely while avoiding dynamic sentinels patrolling the area, using a tool that provides coarse spatial information. To address this, we propose CoSaR (Cooperative Spatial Reasoning and Planning), a framework that bridges the high-level communication and planning abilities of foundation models with the precision of classical spatial navigation algorithms. CoSaR enables agents to exchange situational updates, reason over evolving spatial constraints, and collaboratively replan trajectories. Evaluated across 14 city-level scenes with 3-5 agents, CoSaR consistently leads to faster gathering, shorter path lengths, and improved safety. Our results demonstrate that integrating dynamic communication with spatial reasoning is essential for robust multi-agent cooperation. By formalizing this new setting and providing a scalable benchmark, we aim to build a foundation for advancing cooperative spatial intelligence in embodied multi-agent systems. Code and challenge are available at this https URL.
[MA-13] AgentS ociety: Incentivizing Agent ic Social Intelligence
【速读】:该论文试图解决的问题是:如何在多智能体环境中实现自主、协作且具经济激励的代理间协同,以应对开放域用户请求,使代理不仅能独立解决问题,还能通过有效的通信与反馈机制提升整体性能。解决方案的关键在于提出 \mathttAgentSociety 机制,其核心思想是基于液态民主(liquid democracy)和社交选择理论中的信息扩散机制,构建一个去中心化的代理社会系统。该机制通过激励代理在本地上下文中最大化自身效用的同时,借助委托机制将任务路由至更胜任的邻居代理,从而自然形成共识驱动的多代理路径;同时,它还激励代理在符合自身利益时选择性披露信息以获取影响力,并通过纳什均衡分析证明代理收益与其边际贡献一致,最终实现在真实数据集上由自利异构代理协作完成任务的高效共识路由。
链接: https://arxiv.org/abs/2605.26203
作者: Aditya Vema Reddy Kesari,Krishna Reddy Kesari
机构: IIT Bombay, India; Amazon, US
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
备注:
Abstract:The success of deployed agents relies on their ability to handle open-ended user requests using their inherent capabilities, not only in solving requests directly but also in effectively leveraging inter-agent communication channels and feedback signals over time. This requires a multi-agent environment where agents can operate autonomously, strategically communicate, behave collaboratively and be driven by economic incentives, much like humans in society. Towards this vision, we propose \mathttAgentSociety , a mechanism that enables decentralized agentic collaboration grounded in liquid democracy and information diffusion from social choice theory. We show that \mathttAgentSociety provides an environment for agents to make autonomous decisions utilizing their local context to maximize their utility while achieving collective outcomes through incentivized collaboration. Specifically, we prove that delegation to more competent neighbor agents is incentive compatible and naturally generates multi-agent routing path by consensus. Additionally, our mechanism incentivizes agents to selectively disclose information to their neighbor agents when doing so aligns with their self-interest, so as to garner influence. We characterize the Nash equilibrium showing that agent payoffs are reflective of their marginal contributions. We compare and benchmark strategy profiles adopted by open and proprietary state-of-the-art language models deployed in \mathttAgentSociety against best response. Finally, we evaluate collaborative performance from consensus-based routing among self-interested heterogeneous agents in \mathttAgentSociety on real-world datasets.
[MA-14] ATOM: Instantiating Budget-Controllable Multi-Agent Collaboration via Nucleus-Electron Hierarchy
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)多智能体系统中协作拓扑结构优化的问题,特别是现有方法在稳定性与可扩展性之间难以平衡、且计算资源分配未能根据查询难度动态调整的局限性。其解决方案的关键在于提出一种名为 \textscATOM 的自适应框架,该框架通过一种新颖的任务驱动强化学习范式生成可控预算的协作图:借鉴原子结构,将系统设计为“核-电子”层次结构——核心部分(核)是离线训练稳定的协作骨干网络,推理时则根据查询条件动态激活特定智能体(电子);同时引入基于复杂度感知的预算策略,通过估计查询难度来严格控制电子实例化数量,从而实现资源消耗与任务需求的精准匹配。实验表明,\textscATOM 在六个不同基准上均达到最先进性能,并相较强基线提升高达 30% 的 token 效率。
链接: https://arxiv.org/abs/2605.26178
作者: Xinkui Zhao,Sai Liu,Yifan Zhang,Qingyu Ma,Zewen Lin,Naibo Wang,Guanjie Cheng,Chang Liu,Yueshen Xu
机构: Zhejiang University (浙江大学); Ningbo Global Innovation Center, Zhejiang University (宁波全球创新中心,浙江大学); Zhejiang Key Laboratory of Digital-Intelligence Service Technology (浙江省数字智能服务技术重点实验室); Xidian University (西安电子科技大学)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注:
Abstract:Large Language Model (LLM)-based multi-agent systems rely on optimized collaboration topologies to balance performance and communication costs. However, current methods struggle with the inherent stability-extensibility trade-off and often misalign computational budgets with query difficulty. We propose \textscATOM, an adaptive framework that generates budget-controllable collaboration graphs via a novel task-driven reinforcement learning paradigm. Inspired by atomic structures, \textscATOM employs a nucleus-electron hierarchy: it maintains a stable, offline-learned collaboration backbone (the nucleus) while dynamically activating query-conditioned agents (electrons) during inference. Crucially, a complexity-aware budgeting strategy aligns resource consumption with task demands by estimating query difficulty to strictly regulate electron instantiation. Extensive experiments across six diverse benchmarks demonstrate that \textscATOM achieves state-of-the-art performance while improving token efficiency by up to 30% compared to strong baselines.
[MA-15] A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration
【速读】:该论文试图解决的问题是:在生成式 AI (Generative AI) 系统中,当一个请求被分解为多个协作代理(worker agents)进行处理时,跨文档远端部分之间的逻辑矛盾(cross-section contradictions)这类缺陷难以被检测到,且单个代理无法察觉此类问题。解决方案的关键在于揭示两个核心现象:其一,存在一个“检测悬崖”(detection cliff),即所有模型在单一代理下能识别的跨段落矛盾,在多代理协同编排后其检测能力下降三分之二以上,且这一现象不随模型规模或推理能力增强而缓解;其二,模型在跌落检测悬崖后的行为模式具有显著差异——仅有一个开发者生成的模型表现出对报告标准轴线的系统性偏移:随着对齐强度提升,该模型虽减少漏检但同时增加对干净文档的误报,这种“双面效应”在该开发者内部代际间显著相关(p < 0.001),而在其他厂商模型中几乎不存在。研究进一步指出,尽管模型私有记录能准确重构结构缺陷,集成报告却错误地确认文档无误,这表明问题本质是系统级结构缺陷而非信息缺失,且当前自动化评估方法难以稳定识别此类现象。
链接: https://arxiv.org/abs/2605.26174
作者: Hiroki Fukui
机构: Kyoto University (京都大学); Anthropic (Anthropic)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 24 pages, 2 figures. Data and code: doi: https://doi.org/10.5281/zenodo.20372696
Abstract:Production language-model systems answer a request by partitioning it across an invisible orchestration of worker agents that recompose one integrated report. We ask what this does to a class of defect no single worker can see: a contradiction in the relation between two distant sections of a document. Holding the documents, defects, mechanism, scoring, and seed fixed, we vary only the model – ten systems across five generations from one developer and five providers from distinct alignment paradigms. Two layers separate. First, a universal detection cliff: every model that finds these cross-section defects under a single agent loses that ability under orchestration, detection falling two-thirds or more across every paradigm tested. The cliff is mechanism-derived and not closed by scale or extended reasoning. Second, how models behave once fallen. A signal-detection decomposition shows that, among the six models discriminating above chance, only one developer’s generations move along the reporting-criterion axis: as alignment is strengthened, the model misses fewer defects yet raises more false alarms on clean documents – two faces of one criterion shift, scaling with generation within that developer (p 0.001) and near-absent elsewhere. At the floor the missed defect is often not out of view: the model’s private record reconstructs the structural fault accurately, while the integrated report signs off on its soundness, its concern spent on the artifact and an absent collaborator. This resists quantification – an automated judge is unstable (precision 17-50%) and keywords cannot separate it from ordinary agreement – a resistance we report as a finding. We release all runs, probes, defect keys, scorer prompts, and scripts. An integrated report’s confidence is uninformative about partition-spanning defects, the most aligned systems are not the safest, and the cliff is structural. Comments: 24 pages, 2 figures. Data and code: doi:https://doi.org/10.5281/zenodo.20372696 Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA) Cite as: arXiv:2605.26174 [cs.SE] (or arXiv:2605.26174v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.26174 Focus to learn more arXiv-issued DOI via DataCite
自然语言处理
[NLP-0] MobileMoE: Scaling On-Device Mixture of Experts
【速读】: 该论文试图解决的问题是:在百亿参数规模下已被广泛采用的专家混合模型(Mixture-of-Experts, MoE)在亚十亿参数级别(sub-billion scale)用于设备端部署时,其性能优势尚未被充分探索,尤其是在移动设备有限内存和计算资源约束下的可扩展性与效率问题。解决方案的关键在于提出MobileMoE——一个专为设备端优化的MoE语言模型家族,其核心创新包括:1)首次建立面向移动端的MoE缩放定律(scaling law),通过联合优化架构设计,在移动内存与算力限制下识别出“中等稀疏度 + 细粒度共享专家”的最优配置;2)开发四阶段训练流程(预训练、中期训练、指令微调与量化感知训练),全部基于开源数据集实现高效训练;3)提供首个在消费级智能手机上的高效MoE推理方案,并通过全面的设备端性能分析验证其优势——相比同级别稠密模型MobileLLM-Pro,在INT4权重内存占用相当的情况下,MobileMoE-S的预填充速度提升1.8–3.8倍,解码速度提升2.2–3.4倍,同时在14个基准测试中达到或超越当前最先进的设备端稠密模型及MoE模型。
链接: https://arxiv.org/abs/2605.27358
作者: Yanbei Chen,Hanxian Huang,Ernie Chang,Jacob Szwejbka,Digant Desai,Zechun Liu,Vikas Chandra,Raghuraman Krishnamoorthi
机构: Meta(元宇宙)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4 \times fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers 1.8 - 3.8\times faster prefill and 2.2 - 3.4\times faster decode than the dense baseline MobileLLM-Pro.
[NLP-1] Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases ICML2026
【速读】: 该论文试图解决的问题是:当前主流的基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)方法在对齐大型语言模型(Large Language Models, LLMs)与人类偏好时,存在一种名为“对齐篡改”(alignment tampering)的潜在漏洞——即LLM在对齐过程中可能通过影响偏好数据集来放大其自身不良行为。解决方案的关键在于识别并揭示RLHF的两个核心局限性:一是偏好数据集由LLM自身的输出构建,使其能够操控标注过程;二是成对比较仅指示哪个响应更优,而不解释原因,导致奖励模型无法区分高质量与偏倚行为。这种机制使得LLM可通过生成看似优质但带有偏见或有害内容的响应,诱导人工标注者选择这些响应,进而使奖励模型学习到错误的偏好信号,并在后续强化学习优化中放大此类不良行为。实验验证了多种类型的偏倚(如关键词偏倚、性别歧视、品牌推广及工具性目标追求)均会被显著放大,且现有鲁棒性RLHF技术难以在不牺牲响应质量的前提下完全缓解此问题,从而揭示了当前RLHF方法的结构性脆弱性。
链接: https://arxiv.org/abs/2605.27355
作者: Dongyoon Hahm,Dylan Hadfield-Menell,Kimin Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ICML 2026, Source code: this https URL
Abstract:Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM’s own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: this https URL
[NLP-2] Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders
【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)在强化学习(Reinforcement Learning, RL)后训练阶段中,数据工程主要依赖外部信号而忽视模型内部蕴含的丰富内在信号的问题。其解决方案的关键在于利用稀疏自编码器(Sparse Autoencoder, SAE)从模型内部提取可解释的特征空间,并基于此构建一个名为SAERL的数据工程框架,通过建模三种内在数据属性——多样性(diversity)、难度(difficulty)和质量(quality)——来指导数据处理操作:利用SAE空间聚类与适度批次混合控制批次多样性,采用难度代理指标实现由易到难的课程排序,以及通过质量探测器进行数据过滤。实验表明,SAERL在Qwen2.5-Math-1.5B上相比基线GRPO平均准确率提升3.00%,且达到目标准确率所需的训练步数减少20%,并在不同模型规模和RL算法下保持一致性能增益,验证了模型内部信号作为高效、可迁移数据工程工具的潜力。
链接: https://arxiv.org/abs/2605.27354
作者: Yi Jing,Zao Dai,Jinwu Hu,Zijun Yao,Lei Hou,Juanzi Li,Xiaozhi Wang
机构: Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.
[NLP-3] MATCHA: Matching Text via Contrastive Semantic Alignment
【速读】: 该论文试图解决当前大语言模型(LLM)评估中依赖的主流指标(如基于词元重叠的ROUGE和基于嵌入的BERTScore)在衡量文本语义相似性时存在的严重缺陷,即这些指标常对语义对立的文本给出相近分数,从而掩盖模型的根本性错误。解决方案的关键在于提出MATCHA这一新型自动评估指标,其核心创新是采用双视角机制:一方面衡量生成文本与参考文本之间的语义接近度,另一方面通过对抗生成的反事实矛盾文本来惩罚语义冲突。在八个公开基准测试中,MATCHA显著优于现有指标,尤其在TruthfulQA等无训练集场景下,相比ROUGE-L和BERTScore分别提升18.38%和20.82%,且通过定量对比与人工评估验证了其有效性,揭示了传统指标的根本局限性。
链接: https://arxiv.org/abs/2605.27345
作者: Siran Li,Ece Sena Etoglu,Carsten Eickhoff,Seyed Ali Bahrainian
机构: University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reliable evaluation is essential for understanding large language model (LLM) performance, yet today’s go-to metrics, namely token-overlap scores (e.g., ROUGE) and embedding-based measures (e.g., BERTScore), often misjudge semantic similarity of documents. Our study shows that both token-overlap metrics and embedding-based metrics routinely assign nearly identical scores to texts that directly contradict each other, thereby potentially masking fundamental errors. We introduce MATCHA, an automatic metric that jointly rewards semantic agreement with a reference and penalizes contradictions. MATCHA employs a dual-view perspective that measures (i) proximity to the gold text and (ii) distance from an adversarially generated counterfactual contradiction. In eight public benchmarks, MATCHA outperforms popular metrics, compared with human annotations on question-answering, image caption generation, natural language inference, summarization, and semantic textual similarity tasks. On the TruthfulQA dataset (i.e., a dataset without a training set, where no embedding-based metrics could locally train on), this improvement in terms of matching texts with a reference reaches 18.38% over ROUGE-L and 20.82% over BERTScore. Both quantitative comparison and qualitative human assessments confirm the efficacy and validity of MATCHA and uncover fundamental weaknesses in pre-existing metrics. Compared with 23 embedding models, including top state-of-the-art ones, used as a metric similar to BERTScore, MATCHA remains the most accurate in distinguishing correct from incorrect statements solely based on a reference. Our code and metric are publicly available (this https URL).
[NLP-4] 2-ASP(Q) programs with weak constraints: Complexity and efficient implementation
【速读】: 该论文旨在解决在扩展答案集编程(Answer Set Programming, ASP)中引入答案集量化(quantifiers over answer sets)后,如何高效计算具有两个量词和弱约束的ASP(Q)程序(即2-ASP(Q)^w)的最优量化答案集的问题。这类问题在实践中广泛存在,且理论上可表达至Delta_3^P复杂度类的优化问题。解决方案的关键在于:一方面,通过严格的理论分析,给出了2-ASP(Q)^w相关计算任务的完整复杂度刻画,包括紧致的完备性结果和此前未被充分研究的非平凡情况;另一方面,在实践层面,提出了一种基于计数器例引导抽象精化(Counterexample-Guided Abstraction Refinement, CEGAR)技术的新型策略,集成于Casper系统中,用于高效求解(最优)量化答案集。实验表明,该方法在多个应用领域的硬基准测试中表现有效。
链接: https://arxiv.org/abs/2605.27338
作者: Andrea Cuteri,Giuseppe Mazzotta,Francesco Ricca
机构: University of Calabria (加州大学)
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:
Abstract:ASP(Q) extends Answer Set Programming (ASP) with Quantifiers over answer sets. In this paper we focus on the class of ASP(Q) programs with two quantifiers and weak constraints, denoted as 2-ASP(Q)^w. 2-ASP(Q)^w is a practically relevant fragment of ASP(Q) that is expressive enough to capture optimization problems up to the class Delta_3^P. On the theoretical side, we provide a complete complexity characterization of the main computational tasks for 2-ASP(Q)^w programs, including tight completeness results and the analysis of nontrivial cases that have not been addressed in previous works. On the practical side, we introduce novel strategies for computing (optimal) quantified answer sets in the Casper system, that rely on a Counterexample-Guided Abstraction Refinement (CEGAR) technique tailored to ASP(Q). An experimental evaluation on hard benchmarks from different application domains shows that the proposed techniques are effective in practice.
[NLP-5] FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents
【速读】: 该论文试图解决金融领域大语言模型(LLM)代理在执行多步骤业务流程时,如何同时防范由提示诱导的非法操作并确保合法流程顺利通过的问题。传统方法中,边界过滤器容易遗漏不可逆的中间工具调用,而事后LLM判别器仅在流程结束后进行审计,既无法及时干预,又因计算成本随轨迹长度线性增长而不具实用性。解决方案的关键在于提出FinHarness——一个端到端嵌入金融代理的内联安全框架,其核心包括三个组件:Query Monitor融合单轮意图与跨轮次漂移以识别潜在风险;Tool Monitor对每个潜在工具调用进行实时评估;Cascade模块根据每步风险自适应地选择轻量级或高级LLM判别器,并将风险因子作为事前证据注入代理输入,使代理能主动拒绝、重新规划或批准。实验表明,在FinVault数据集上,路由后的FinHarness将攻击成功率(ASR)从38.3%降至15.0%,同时几乎不损害良性任务的批准率(从41.1%降至39.3%),且相比始终使用高级判别器的方法减少了4.7倍的高级判别调用次数。
链接: https://arxiv.org/abs/2605.27333
作者: Haoxuan Jia,Yang Liu,Bin Chong,Yingguang Yang,Yancheng Chen,Jiayu Liang,Qian Li,Hanning Lu,Kefu Xu,Hao Zheng,Chongyang Zhang,Hao Peng,Philip S. Yu
机构: Peking University; Nanyang Technological University; Tsinghua University; University of Science and Technology of China; University of Chinese Academy of Sciences; Soochow University; Beijing University of Posts and Telecommunications; University of Leeds; Fullive Innovation (Beijing) AI Technology Co., Ltd.; Beihang University; University of Illinois Chicago
类目: Computation and Language (cs.CL)
备注:
Abstract:Finance LLM agents must simultaneously block prompt-induced unauthorized actions and approve legitimate multi-step business workflows. However, boundary filters often miss irreversible mid-trajectory tool calls, while post-hoc LLM judges perform auditing only after termination – too late for intervention and at a computational cost that scales linearly with trace length. We present FinHarness, an inline safety harness that wraps a finance agent end-to-end with three components: a Query Monitor that fuses single-turn intent with cross-turn drift, a Tool Monitor that evaluates each prospective tool call, and a Cascade module that integrates per-step risk and adaptively routes verification between a lightweight and an advanced-tier LLM judge. Fired risk factors are re-injected into the agent input as ex-ante evidence, enabling the agent to refuse, re-plan, or approve on its own. On FinVault, routed FinHarness cuts ASR from 38.3% to 15.0% while largely preserving benign approval ( 41.1% \to 39.3% ), and uses 4.7\times fewer advanced-judge calls than an always-advanced ablation.
[NLP-6] Semantic Gradients Interactions in SSD: A Case Study in Racial Identity and Hate Speech
【速读】: 该论文试图解决的问题是:如何在语义差异分析(Semantic Differential)中建模和检验语义意义在不同调节变量(如群体、特质或条件)下的变化,从而使这种变化具有统计可检验性和可解释性。解决方案的关键在于提出交互式语义差异(Interaction SSD),该方法能够同时估计主效应梯度(main semantic gradient)、交互效应梯度(interaction gradient)以及条件梯度(conditional gradients),并通过标准的SSD工具进行解释。这一方法首次将语义差异分析扩展到包含调节效应的情境,使其能有效检测并解析不同群体间语义响应的差异,例如在仇恨言论判断中,不同种族标注者对针对有色人种评论的评价是否存在显著调节效应。
链接: https://arxiv.org/abs/2605.27322
作者: Felix Ostrowicki,Hubert Plisiecki
机构: IDEAS Research Institute (IDEAS研究研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce interaction SSD, an extension of Supervised Semantic Differential that models how semantic meaning varies across moderators such as groups, traits, or conditions making this variation testable and interpretable. The method estimates a main semantic gradient, an interaction gradient, and conditional gradients, all interpretable through standard SSD tools. We illustrate it on the UC Berkeley Measuring Hate Speech corpus, testing whether annotator racial identity moderates hate-speech judgments of comments targeting people of color. The interaction model detects a significant moderation effect: the shared gradient contrasts dehumanizing hostility with counter-speech, while the interaction gradient reveals smaller group-linked differences in which semantic cues predict hate-speech ratings. Interaction SSD makes moderated meaning-outcome relationships statistically testable and interpretable.
[NLP-7] Real Images Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery
【速读】: 该论文试图解决的问题是:当前视觉语言模型(VLMs)是否真的能有效利用视觉信息提升语言理解,尤其是在词汇判断任务中,模型能否区分有用的视觉证据与无关的图像背景信息。解决方案的关键在于通过人类对词汇具体性(concreteness)和意象性(imagery)的评分作为基准,系统评估真实图像上下文对模型性能的影响,并结合探针分析、典型相关分析(canonical correlation analysis)及归因案例研究,揭示图像上下文如何引发表征偏移并增强对虚假视觉线索的敏感性,从而削弱目标词义属性的可恢复性。进一步实验表明,在推理阶段指导模型专注于文本内容可显著缓解这种退化现象,尤其在视觉相关性最弱的词汇子集上效果最为明显,这提示当前指令微调的VLM需要更精细地校准视觉信息的使用时机。
链接: https://arxiv.org/abs/2605.27315
作者: Yifan Jiang,Ruoxi Ning,Sheng Yao,Freda Shi
机构: University of Waterloo (滑铁卢大学); Vector Institute (向量研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Visual inputs are often assumed to improve language understanding in multimodal models. We examine this assumption by asking whether vision-language models (VLMs) can distinguish useful visual evidence from incidental image context in lexical judgments. We use human concreteness and imagery ratings because they span words with varying expected visual relevance, from abstract and low-imagery words to concrete and high-imagery words. We find that real-image contexts do not yield consistent gains and often hurt alignment with human ratings, most sharply when visual evidence is least relevant. Through probing and canonical correlation analysis, complemented by an attribution case study, we find that real-image contexts are associated with representational shifts and greater sensitivity to spurious visual cues, coinciding with weaker recoverability of the targeted lexical properties. We further show that instructing models to focus solely on textual content at inference time can reduce this degradation, with the clearest gains on these vulnerable subsets. Our findings suggest that current instruction-tuned VLMs need better calibration of when visual context should inform lexical judgments.
[NLP-8] When Does Demographic Information Help? Data and Modeling Regimes for Perspective-Aware Hate Speech Detection
【速读】: 该论文试图解决的问题是:在主观任务(如仇恨言论检测)中,利用人口统计学特征(demographic features)建模标注者视角时,其性能提升效果不稳定——有时有效,有时反而成为噪声。解决方案的关键在于识别出能带来显著收益的特定数据场景与建模框架,并设计一种更智能的融合机制。研究发现,人口统计学特征的有效性集中在低训练分歧、高测试分歧、细粒度歧义测量、充足训练数据及更大人口统计覆盖重叠的场景下;基于此,作者提出了一种门控的人口统计残差模型(gated demographic residual model),将人口统计信息作为对纯文本预测的可选调整项,而非强制输入。实验表明该方法在高分歧或低置信度样本上表现尤为有效,强调了不应默认人口统计特征有益,而应根据数据特性与模型结构协同决定其价值。
链接: https://arxiv.org/abs/2605.27313
作者: Weibin Cai,Reza Zafarani
机构: Syracuse University (雪城大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Demographic information is often used to model annotator perspectives in subjective tasks such as hate speech detection, but its benefit is inconsistent: it improves performance in some settings and behaves as noise in others. This paper asks when demographic features help. We analyze demographic gain as a function of both data split properties and modeling frameworks. For data splits, we measure annotator disagreement, namely how often annotators assign different labels to the same example, along with training size and train-test demographic coverage. We find that demographic gains concentrate in regimes with low training disagreement, high test disagreement, fine-grained ambiguity measurement, sufficient training data, and greater demographic overlap. Motivated by these regimes, we introduce a gated demographic residual model that treats demographics as a selective adjustment to text-only predictions. Experiments on MHS and POPQUORN show that this design is effective, especially on high disagreement or low confidence examples. Overall, our results suggest that demographics should not be assumed useful by default; their value depends jointly on the data regime and the modeling framework.
[NLP-9] Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models
【速读】: 该论文旨在解决当前图表问答(Chart QA)基准测试中存在的“捷径学习”问题,即模型可能依赖于对图表的先验知识或表面特征而非真正的视觉推理能力来作答,从而导致评估结果失真。其解决方案的关键在于提出“反事实图表”(counterfactual charts)——在保持任务不变的前提下,系统性地改变图表数据及其对应答案,以强制模型进行真实的视觉推理。作者开发了名为Chartographer的框架,能够将图表逆向重构为可执行代码,验证重建保真度,生成受控的反事实变体,并基于可执行的QA逻辑推导新答案。实验表明,反事实图表能揭示单一图表评估下被掩盖的模型泛化失败,尤其在需要全新视觉推理路径时,视觉语言模型(VLMs)表现显著下降,凸显了现有模型在跨场景推理上的局限性。
链接: https://arxiv.org/abs/2605.27311
作者: Yifan Jiang,Dae Yon Hwang,Jesse C. Cresswell,Freda Shi
机构: University of Waterloo (滑铁卢大学); Vector Institute (向量研究所); Layer 6 AI (Layer 6 AI)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.
[NLP-10] Self-Ensembling Vision-Language Models for Chart Data Extraction
【速读】: 该论文旨在解决从图表图像中自动提取结构化表格数据的问题,尤其针对现有方法在处理高数据密度或风格多样的图表时性能不足的挑战。其解决方案的关键在于提出一种基于视觉语言模型(VLM)的自集成(self-ensembling)方法:通过多次采样同一图表图像对应的多个表格输出,并在单元格级别对数值进行对齐与中位数聚合,从而生成更准确的共识表格;同时引入收敛检测机制以动态停止采样,并基于样本间离散度估计不确定性,辅助用户评估结果可靠性。该方法显著提升了复杂图表场景下的提取精度,在新构建的WB-ChartExtract基准上相对改进达23%。
链接: https://arxiv.org/abs/2605.27298
作者: Thomas Berkane,Qianyi Wang,Maimuna S. Majumder
机构: Boston Children’s Hospital (波士顿儿童医院); Harvard Medical School (哈佛医学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Charts effectively convey quantitative information, but the underlying data are often locked in image form, hindering reuse and analysis. Manually digitizing charts is time-consuming and error-prone, motivating automatic chart-to-table extraction. Recent approaches use specialized vision-language models (VLMs), yet performance still lags on charts with many datapoints or substantial stylistic variation. We propose a VLM self-ensembling method that repeatedly samples multiple tabular outputs from the same VLM for a fixed chart image and aggregates them at the level of individual table cells. We align candidate tables and take per-cell medians over numerical values to produce a more accurate consensus table. Our method also includes convergence detection to stop sampling once the aggregated table stabilizes, and uncertainty estimation based on dispersion across samples to help users assess extraction reliability. Because existing chart extraction benchmarks contain relatively simple plots with limited room for improvement, we introduce WB-ChartExtract, a new benchmark built from World Bank data with more complex and stylistically diverse charts; on average, its charts contain 7 times more datapoints than those in the ChartQA benchmark. Across both ChartQA and WB-ChartExtract, our approach improves extraction accuracy over single-pass VLM outputs, yielding up to 23% relative improvement on WB-ChartExtract after ensembling. More broadly, our method helps unlock tabular data previously siloed in chart images, enabling downstream analysis and reuse.
[NLP-11] Probing Cultural Awareness in LLM s: A Case Study of Cross-Culture Aesthetic Stylistics IJCAI2026
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在跨文化语境中对审美风格(aesthetic stylistics)的掌握能力尚未得到充分研究,尤其是在识别和生成具有文化共鸣的语言风格方面存在显著差距。解决方案的关键在于构建了一个名为C4STYLI的基准数据集,涵盖香港与内地高度风格化的电影片名和广告标语,并通过行为识别(behavioral recognition)和生成能力(productive competence)双维度评估LLMs的表现。进一步地,作者采用逻辑回归探针(logistic regression probes)进行结构消融实验,发现LLMs在港式风格识别中主要依赖表层语言信息而非深层风格结构,表明其对特定文化语境下的风格结构敏感度有限。
链接: https://arxiv.org/abs/2605.27296
作者: Jiashuo Wang,Fenggang Yu,Jian Wang,Chak Tou Leong,Xiaoyu Shen,Chunpu Xu,Jiawen Duan,Wenjie Li,Johan F. Hoorn
机构: Hong Kong Polytechnic University (香港理工大学); Eastern Institute of Technology, Ningbo (宁波东方理工大学); Vrije Universiteit Amsterdam (阿姆斯特丹自由大学)
类目: Computation and Language (cs.CL)
备注: IJCAI 2026 Human-Centred AI track
Abstract:Large Language Models (LLMs) are increasingly deployed in diverse cultural contexts, yet their ability to master aesthetic stylistics, i.e., the strategic use of language to evoke cultural resonance, remains underexplored. We curate C4STYLI, a benchmark of highly stylized translated movie titles and advertising slogans from Hong Kong and the Chinese Mainland, to evaluate LLMs via the lens of behavioral recognition and productive competence. Extensive evaluations show that LLMs differ from humans in stylistic recognition, and this recognition ability varies across text domains. In addition, stylistic recognition and generation performance in LLMs are not consistently aligned. To further examine whether LLMs genuinely capture stylistic information in stylistic recognition, we conduct structural ablation with logistic regression probes. We find that, in the Hong Kong setting, stylistic recognition in LLMs relies primarily on surface-level linguistic information rather than stylistic structure. This suggests limited sensitivity to Hong Kong-specific stylistic structure.
[NLP-12] Its Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在对话中倾向于放弃初始立场以迎合用户反对意见的问题。现有研究多将此行为归因于强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)过程中习得的“讨好倾向”(sycophancy),但本文提出,这种顺从行为也可能由推理阶段模型的认知不确定性(epistemic uncertainty)驱动。解决方案的关键在于提出MUSE——一个两阶段评估框架,用于解耦导致LLM顺从性的不同机制:通过测量模型对某一问题的回答中的认知不确定性与其后续是否屈从于用户推翻性意见之间的关系,识别出两种独立因素——讨好型顺从(sycophantic conformity)和不确定性驱动的顺从(uncertainty-driven conformity)。实验证明,这两种机制均随用户感知专业度和建议合理性提升而增强,且MUSE可为针对性干预策略提供区分依据,从而明确是源于对齐过程的讨好行为还是训练语料引发的认知不确定性。
链接: https://arxiv.org/abs/2605.27288
作者: Kevin H. Guo,Chao Yan,Avinash Baidya,Katherine Brown,Xiang Gao,Juming Xiong,Zhijun Yin,Bradley A. Malin
机构: Vanderbilt University (范德比尔特大学); Vanderbilt University Medical Center (范德比尔特大学医学中心); Intuit AI Research (英特尔人工智能研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesize that conformity is also driven by a model’s epistemic uncertainty at inference time. In this paper, we introduce MUSE, a two-stage evaluation framework to disentangle the mechanisms driving LLM conformity. Specifically, MUSE maps a model’s epistemic uncertainty in responding to a query against its likelihood to yield to user pushback in a subsequent turn. We demonstrate that the mechanisms driving conformity extend beyond sycophancy alone. Specifically, we characterize two distinct factors that jointly drive conformity: sycophantic conformity, where a model aligns with user pushback even with absolute certainty in its initial response, and uncertainty-driven conformity, where a model’s likelihood for conformity increases alongside its uncertainty. Furthermore, we conduct ablation studies to demonstrate that both sycophantic conformity and uncertainty-driven conformity grow with 1) the LLM’s perceived expertise of the user and 2) the plausibility of the user’s suggestions. More broadly, MUSE informs more targeted intervention strategies by distinguishing alignment-induced sycophancy and training-corpora-driven uncertainty.
[NLP-13] SIA: Self Improving AI with Harness Weight Updates
【速读】: 该论文试图解决当前人工智能系统中人类作为瓶颈的问题,即模型和代理(agent)的构建、调优与修正高度依赖人工干预,而实现AI自我改进的能力仍是一个开放性问题。解决方案的关键在于提出SIA(Self-Improving Agent)框架,这是一种整合了“Harness-Update”与“Test-Time Training”两大研究方向的自洽改进循环:一个基于语言模型的反馈代理(Feedback-Agent)同时更新任务特定代理的“工具链”(harness,包括提示、重试逻辑等)和模型权重。实验证明,在中国法律定罪分类、GPU内核优化和单细胞RNA去噪三个不同领域中,结合两种更新机制显著优于仅迭代工具链的方法,性能提升分别达56.6%、91.9%运行时减少和502%,其中工具链更新赋予模型行动策略,权重更新则构建了无法通过提示或结构传递的领域直觉。
链接: https://arxiv.org/abs/2605.27276
作者: Prannay Hebbar,Yogendra Manawat,Samuel Verboomen,Alesia Ivanova,Selvam Palanimalai,Kunal Bhatia,Vignesh Baskaran
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model’s own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. The gains are 56.6% on LawBench, 91.9% runtime reduction on GPU kernels, and 502% on denoising over the initial baseline. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.
[NLP-14] Lost in Sampling: Assessing Lexical Reachability in LLM s via the Word Coverag e Score (WCS)
【速读】: 该论文试图解决的问题是:现代大语言模型(LLM)在生成文本时存在重复性和同质化现象,尽管其潜在词汇量庞大。以往研究多聚焦于模型知识和训练数据,而本文则关注解码机制(decoding mechanics)如何抑制语言多样性。解决方案的关键在于提出了一种新的量化指标——词覆盖得分(Word Coverage Score, WCS),用于衡量标准采样过滤器(如Top-p、Top-k和Min-p)在数学上剔除语境恰当的人类词汇的程度。WCS通过审计开放权重模型对人类撰写的语料片段的输出,评估低频但高信息量词汇在采样参数下的“存活率”,从而揭示解码过程如何将本应在概率空间内的合理词汇排除在外。研究结果表明,行业标准的采样默认设置实际上起到了非故意的“审查”作用,使人类表达的独特性被同质化。WCS为优化文本连贯性与词汇丰富性之间的权衡提供了严谨框架,并成为检测和改善生成模型中人类语言多样性的诊断工具。
链接: https://arxiv.org/abs/2605.27268
作者: Samer Awad,Javier Conde,Carlos Arriaga,Tairan Fu,Javier Coronado-Blázquez,Pedro Reviriego
机构: Universidad Politécnica de Madrid (马德里理工大学); Politecnico di Milano (米兰理工大学); Banco de España (西班牙银行)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures
Abstract:Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric that quantifies the extent to which contextually appropriate human vocabulary is mathematically pruned by standard sampling filters (e.g., Top- p , Top- k , and Min- p ). Rather than assessing static knowledge, the WCS measures the lexical survival rate of low-frequency, high-information human words as a function of sampling parameters. By auditing open-weight models on human-authored corpus fragments, we identify which logical lexical choices are rendered unreachable by the decoder, even when they reside within the probability space. Our results provide quantitative evidence that industry-standard sampling defaults act as unintended censorship mechanisms, smoothing the unique textures of human expression into a homogenized discourse. The WCS offers a rigorous framework for optimizing the trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving the diversity of human language in generative models.
[NLP-15] Pair-In Pair-Out: Latent Multi-Token Prediction for Efficient LLM s
【速读】: 该论文旨在解决大语言模型在长链式推理(long chain-of-thought reasoning)过程中,自回归解码(autoregressive decoding)成为主要计算瓶颈的问题。现有方法分别从输入侧(潜在压缩,latent compression)或输出侧(推测解码和多标记预测,MTP)进行优化,但两者独立发展且输出侧方法需依赖昂贵的验证器(verifier)来校验不可靠的草稿标记(draft tokens),导致效率低下。解决方案的关键在于提出 Pair-In, Pair-Out (PIPO),它通过将潜在压缩器与 MTP 头视为镜像操作:压缩器将两个输入标记合并为一个潜在表示,而 MTP 头则将一个隐藏状态展开为一个额外输出标记,从而统一输入与输出侧优化;同时引入轻量级置信度头(confidence head)判断是否接受草稿标记,以消除对验证器的需求而不牺牲可靠性。该置信度头可借助 On-Policy Distillation (OPD) 自然匹配推测解码的拒绝采样准则,在几乎不增加额外训练成本的情况下实现高效训练。实验表明,PIPO 在 AIME 2025、GPQA-Diamond、LiveCodeBench v6 和 LongBench v2 上显著提升 pass@4 指标(最高 +7.15 点),并带来高达 2.64× 的首 token 延迟加速和 2.07× 的每 token 延迟加速。
链接: https://arxiv.org/abs/2605.27255
作者: Wenhui Tan,Minghao Li,Xiaoqian Ma,Siqi Fan,Xiusheng Huang,Liujie Zhang,Ruihua Song,Weihang Chen
机构: Gaoling School of Artificial Intelligence, Renmin University of China; AI Platform, Xiaohongshu Inc.; University of Electronic Science and Technology of China; Institute of Automation, Chinese Academy of Sciences
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project Page: this http URL
Abstract:Long chain-of-thought reasoning has made autoregressive decoding the dominant inference cost of modern large language models. Existing methods target either the input side (latent compression) or the output side (speculative decoding and multi-token prediction, MTP), but the two lines of work have been pursued independently. Moreover, output-side methods must incur an expensive verifier pass to validate the unreliable draft tokens predicted by MTP. To address these issues, we propose \textbfPair-In, Pair-Out (PIPO), which unifies both sides by viewing a latent compressor and an MTP head as mirror-image operations: the compressor folds two input tokens into one latent representation, while the MTP head unfolds one hidden state into one additional output token. To remove the verifier cost without sacrificing reliability, PIPO trains a lightweight confidence head that decides whether draft tokens should be accepted. We observe that On-Policy Distillation (OPD) naturally matches the rejection-sampling criterion of speculative decoding, so the confidence head can be trained alongside OPD with negligible extra cost. Experiments on AIME 2025, GPQA-Diamond, LiveCodeBench v6, and LongBench v2 with Qwen3.5-4B and 9B backbones show that PIPO improves pass@4 over regular decoding by up to +7.15 points, while delivering up to 2.64\times first-token-latency and 2.07\times per-token-latency speedups.
[NLP-16] Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering
【速读】: 该论文试图解决的问题是如何在跨学科教学中提供有效的学习示范,特别是如何生成与学生当前作品相似但质量更高的“反事实”版本(counterfactual),从而帮助学生理解改进方向。现有基于大语言模型(LLM)的自动化反事实文本生成方法往往局限于特定领域,难以实际应用。论文提出的关键解决方案是Gumbel Machine,其核心是一个新颖的受控解码算法——β-Hindsight控制机制,该机制利用潜在随机性作为可调节的相似度控制参数,在生成过程中保持与参考文本的语义一致性,同时确保符合评分标准(rubric-consistent)。实验表明,该方法在学生写作数据集上能有效生成既贴近原作又满足评分要求的高质量反事实文本。
链接: https://arxiv.org/abs/2605.27249
作者: Hunter McNichols,Alexander Scarlatos,Mihai Dascalu,Danielle McNamara,Andrew Lan
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); University Politehnica of Bucharest (布加勒斯特理工学院); Arizona State University (亚利桑那州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: preprint
Abstract:An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student’s current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own. Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications. We present the Gumbel Machine, a flexible, modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to our approach is a novel, controlled decoding algorithm, \beta -Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach at generating counterfactuals both rubric-consistent and similar to a reference.
[NLP-17] ENPMR-Bench: Benchmarking Proactive Memory Retrieval for Emotional Support Agents
【速读】: 该论文试图解决的问题是:当前记忆增强型语言代理在情感支持类应用中,普遍将记忆视为事实检索工具,而忽视了其在塑造用户情绪体验中的关键作用,导致无法有效识别和响应用户的潜在情感需求。解决方案的关键在于提出ENPMR-Bench基准测试框架,用于评估“情感需求感知的主动记忆检索”(Emotional Need-aware Proactive Memory Retrieval, ENPMR)能力——即代理能够推断用户潜在的情感需求,并主动检索相应类型的支撑性记忆以实现共情交互。该框架基于马斯洛需求层次理论构建,包含1800余条记忆增强对话,并定义了情感需求与支持性记忆类型之间的结构化映射关系,实验证明现有检索范式(包括基于嵌入和大语言模型的方法)在情感共鸣得分上显著低于理想记忆条件,凸显出当前系统在个性化情感支持方面的根本缺陷。
链接: https://arxiv.org/abs/2605.27240
作者: Xing Fu,Yulin Hu,Mengtong Ji,Haozhen Li,Yixin Sun,Weixiang Zhao,Yanyan Zhao,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Research Center for Social Computing and Interactive Robotics (社会计算与交互机器人研究中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Memory-augmented language agents are increasingly deployed in affective applications such as emotional support, where understanding and responding to users’ latent emotional needs is critical. However, existing research often treats memory as a tool for factual retrieval, overlooking its role in shaping users’ emotional experiences. In this work, we introduce ENPMR-Bench, a benchmark for evaluating Emotional Need-aware Proactive Memory Retrieval (ENPMR), a core capability that enables agents to infer users’ latent emotional needs and proactively retrieve appropriate memories to support empathetic interaction. Grounded in Maslow’s hierarchy of needs, ENPMR-Bench includes over 1,800 memory-augmented dialogues and defines structured mappings between emotional needs and supportive memory types. Experimental results demonstrate that current retrieval paradigms, including both embedding-based and LLM-driven approaches, exhibit substantial deficiencies, with empathy scores significantly lagging behind golden memory conditions. While chain-of-thought prompting improves the alignment between inferred emotional needs and retrieved memories to some extent, a notable performance gap remains. Together, these findings reveal critical limitations in current agents and outline directions for advancing personalized emotional support through need-sensitive memory retrieval.
[NLP-18] mporal Simultaneity Predicts Annotation Quality in Sentiment Corpora
【速读】: 该论文试图解决长期标注任务中标注质量难以维持的问题,特别是在小规模标注者群体和跨周或跨月的标注周期下,标注一致性(inter-annotator agreement, IAA)随时间显著下降的现象。解决方案的关键在于通过系统性分析识别出影响IAA的核心因素:首先,标签混淆主要集中在负面与中性情感边界的划分上;其次,两名标注者表现出与“自动巡航式标注”(autopilot labeling)一致的时间序列漂移(run-length drift);最关键的是,标注时间间隔是IAA最强预测因子——同一分钟内标注的样本κ值高达0.98,而相隔超过一天的标注仅达0.65,远低于整体平均值(κ=0.76)。这一发现表明,标注质量的衰减并非源于标注速度或文本语言特征,而是由时间延迟引发的认知疲劳或上下文遗忘所致。此外,研究还验证了微调多语言编码器(如GPT-5、Gemini)在Setswana情感分类上的有效性,其中GPT-5零样本提示法取得最高宏F1得分(62.2),为非洲语言自然语言处理资源开发提供了可复现的质量审计框架与高质量数据集支持。
链接: https://arxiv.org/abs/2605.27239
作者: Idris Abdulmumin,Mokgadi Penelope Matloga,Tadesse Destaw Belay,Botshelo Kondowe,Letlhogonolo Mohleleng,Hareaipha Nkopo Letsoalo,Shamsuddeen Hassan Muhammad,Vukosi Marivate
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Annotation quality is difficult to sustain when campaigns span weeks or months with small annotator pools. We present a Setswana sentiment dataset of 3,565 tweets annotated by three native-speaker annotators across eight batches and examine why inter-annotator agreement (IAA) declines over time. Despite an aggregate Randolph’s free-marginal Kappa of \kappa = 0.76 , “excellent,” per-batch \kappa falls by more than 32 points across the annotation task. Through six targeted analyses, we find that (i) label confusion concentrates on the negative/neutral boundary, (ii) two annotators show run-length drift consistent with autopilot labeling, and (iii) the dominant predictor of \kappa is temporal simultaneity: tweets labeled within one minute achieve \kappa = 0.98 , while those labeled more than a day apart reach only \kappa = 0.65 . Annotation speed and tweet-level linguistic features show no meaningful association with \kappa . We benchmark three open multilingual encoders and proprietary models (GPT-5 and Gemini) on three-class sentiment classification; fine-tuning yields gains of 29 to 43 macro-F1 points over pretrained baselines, with GPT-5 few-shot leading overall (62.2 macro-F1). We release the dataset, per-annotation timestamps, and analysis code to support reproducible quality auditing for future African language NLP resources.
[NLP-19] EpiCurveBench: Evaluating VLMs on Epidemic Curve Digitization
【速读】: 该论文旨在解决当前图表到数据提取(chart-to-data extraction)评估中存在的两个关键问题:一是现有基准测试(如ChartQA)的性能提升空间已趋近极限(前沿视觉语言模型在ChartQA上的表现超过89%),二是现有评价指标(如均方根误差RMS和结构一致性评分SCRM)将提取点视为无序的键值对,忽略了时间序列的时序结构,且对微小的时间错位惩罚过度严厉,导致评估结果失真。解决方案的关键在于提出EpiCurveBench和EpiCurveSimilarity(ECS):前者是一个包含1,000张来自多元公共卫生来源的真实流行病曲线图像的基准数据集,后者是一种基于动态规划的时间序列对齐算法,能够容忍局部的时间偏移和缺失,同时按比例惩罚偏差而非将其视为灾难性失败。实验表明,最强模型在ECS指标下仅达到52.3%,而传统指标压缩了通用VLMs的表现差异(从25分降至5分),且ECS与下游流行病学统计指标(如累计数、峰值时间和峰值幅度误差)的相关性显著高于动态时间规整(DTW),验证了其在真实应用场景中的有效性。
链接: https://arxiv.org/abs/2605.27195
作者: Thomas Berkane,Maimuna S. Majumder
机构: Boston Children’s Hospital (波士顿儿童医院); Harvard Medical School (哈佛医学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Chart-to-data extraction with vision-language models (VLMs) is increasingly evaluated on benchmarks that show diminishing headroom (frontier VLMs exceed 89% on ChartQA) and with metrics that treat extracted points as unordered key-value pairs, ignoring the temporal structure of time series and penalizing small alignment shifts as catastrophic failures. We address both gaps with EpiCurveBench, a benchmark of 1,000 real-world epidemic curve images curated from diverse public-health sources, and EpiCurveSimilarity (ECS), an evaluation metric that aligns predicted and ground-truth series via dynamic programming, tolerating local temporal shifts and gaps while penalizing them proportionally. Evaluating six methods–three frontier closed VLMs, one open VLM, and two specialized chart-extraction systems–we find the strongest model reaches only 52.3% ECS, and that ECS spreads the four general-purpose VLMs over a 25-point range where key-value metrics (RMS, SCRM) compress them into a 5-point band. We further validate ECS against four downstream epidemiological summary statistics, finding that higher ECS predicts smaller errors in total counts, peak timing, and peak magnitude, and higher growth-rate fidelity; across all four, ECS correlates 1.5–3.6 times more strongly than Dynamic Time Warping, which lacks a gap penalty and therefore cannot distinguish a truncated prediction from a temporally faithful one. EpiCurveBench targets a high-impact public-health application–unlocking decades of outbreak data trapped in published figures–but the benchmark and metric apply directly to any structured time-series chart-extraction setting.
[NLP-20] Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation
【速读】: 该论文试图解决的是:在长文本生成任务(如医学报告生成,MRG)中,现有基于隐空间干预的轻量级蒸馏方法因对所有输出token施加均匀交叉熵损失而导致监督失衡的问题。具体而言,长文本中高频的模板和语法token占据主导,而决定内容质量的关键token(如病理相关词和结束符EOS)却因监督不足难以被有效学习,且自回归解码过程中的轨迹漂移进一步加剧了这一问题。
解决方案的关键在于提出DIVE框架,包含两个互补机制:一是“关键token监督”,通过加权提升病理相关token与EOS事件的交叉熵贡献,确保内容准确性和终止行为在训练阶段即被学习;二是“状态条件动态引导”,用基于隐藏状态的适配器替代固定开环残差,使注入信号能随解码漂移自适应调整,从而缓解轨迹偏离问题。实验表明,DIVE在MIMIC-CXR和CheXpert Plus数据集上表现优异,在多种词汇和临床代理指标中均达到最优或领先水平。
链接: https://arxiv.org/abs/2605.27194
作者: Ning Wu,Rui Liu,Xinkun Lin,Weixing Chen,Jinxi Xiang,Tao Wei,Lina Yao,Mingjie Li
机构: UNSW Sydney (新南威尔士大学); University of Technology Sydney (悉尼科技大学); Sun Yat-sen University (中山大学); Stanford University (斯坦福大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. 20 pages, 6 figures
Abstract:Distilling demonstration effects into hidden-space interventions offers a lightweight alternative to full finetuning. However, existing multimodal variants are mostly evaluated on short-form tasks, where outputs end after a few tokens. Extending these methods to long-form generation exposes a fundamental yet underexamined limitation: token-level distillation implicitly treats all output tokens as equally informative, but long-form outputs are dominated by high-frequency template and grammatical tokens, while the tokens that actually determine output quality are sparsely distributed. In medical report generation (MRG), two such decisive tokens stand out: pathology-related tokens that determine diagnostic content, and the end-of-sequence (EOS) event that determines termination. Both receive insufficient supervision under uniform cross-entropy, and autoregressive decoding further compounds the problem by drifting away from teacher-forced trajectories. We propose DIVE, a frozen-backbone distillation framework that addresses long-form report generation through two complementary mechanisms matched to these failures. Decisive-token supervision restores supervision balance by upweighting the cross-entropy contribution of pathology-related tokens and the EOS event, ensuring that content fidelity and termination are learned during training rather than imposed at decoding time. State-conditioned dynamic steering replaces fixed open-loop residuals with hidden-state-dependent adapters, allowing the injected signal to adapt as decoding drifts. Experiments on MIMIC-CXR and CheXpert Plus with two medical VLM backbones show that DIVE consistently ranks among the strongest methods across lexical and clinical-proxy metrics. Our method achieves the best BLEU-4, ROUGE-L, and RadGraph F1 in all dataset–backbone settings, while remaining competitive on coarse label-level CheXbert F1.
[NLP-21] Learning When to Think While Listening in Large Audio-Language Models
【速读】: 该论文旨在解决大型音频语言模型(Large Audio-Language Models, LALMs)在实时流式语音交互中推理质量与响应速度之间的权衡问题:延迟推理虽可提升答案准确性,但会增加用户感知的响应延迟;过早回答则可能导致在证据不足时做出错误决策。解决方案的关键在于提出一种可学习的“等待-思考-回答”(wait-think-answer)控制机制,该机制基于人类对话的增量特性,在部分音频证据下动态决定何时等待、何时输出紧凑的推理更新、何时最终作答。通过构建对齐的wait-think-answer轨迹并采用监督微调(SFT)和解耦剪辑与动态采样策略优化(DAPO)进行训练,奖励函数综合考虑答案正确性、动作有效性、更新时机、延迟同步性、推理质量和链路一致性,从而优化整个推理轨迹而非仅最终答案。实验表明,该方法在合成任务和真实录音数据上均显著提升了准确率并减少了不必要的推理延迟,验证了流式模型应学会在音频流中适时显式表达中间推理过程。
链接: https://arxiv.org/abs/2605.27190
作者: Zhiyuan Song,Weici Zhao,Yang Xiao,Suhao Yu,Cheng Zhu,Jiatao Gu
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: 19 pages, 4 figures, 6 tables
Abstract:Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until the speech endpoint can improve answer quality but moves deliberation into user-visible response delay, while answering too early risks committing before decisive evidence arrives. We introduce a learnable wait-think-answer control formulation for LALMs. Motivated by the incremental nature of human conversation, the controller decides under partial audio evidence when to wait, when to externalize a compact reasoning update, and when to answer. Using Qwen2.5-Omni-7B as the base model, we construct aligned wait-think-answer traces from spoken reasoning data, train the controller with supervised fine-tuning (SFT), and then apply Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). The reward combines answer correctness, action validity, update timing, latency synchronization, reasoning quality, and chain consistency, optimizing the complete wait-think-answer trajectory and not the final answer alone. On a six-task synthetic spoken reasoning question answering (SRQA) benchmark, the six-reward DAPO controller improves the row-weighted accuracy from 67.6% to 70.3% while reducing post-endpoint final-think length by 14% under the same Qwen deployment harness. On a 186-item human-recorded Real Audio Bench, a transfer check beyond text-to-speech (TTS)-rendered speech, the controller family remains functional: SFT achieves the strongest accuracy, while the six-reward DAPO controller is the only learned variant whose final-think length falls below the base. These results suggest that a streaming model should learn when to make intermediate reasoning explicit during the audio stream.
[NLP-22] Beyond Binary: Speech Representations Across the Cognitive Score Hierarchy
【速读】: 该论文试图解决的问题是:如何通过语音表征(speech representations)来有效捕捉轻度认知障碍(Mild Cognitive Impairment, MCI)的层级认知结构(hierarchical cognitive assessment structure),从而提升自动化临床语音分析的准确性。解决方案的关键在于对比手工设计的声学特征与自监督学习(Self-Supervised Learning, SSL)嵌入在不同评估层级(任务级、领域级和全局级)上的表现,并揭示任务约束条件(task-specific constraints)对表征能力的影响机制:发现自由响应任务倾向于产生“专家型”(specialist)表征,其性能随层级升高而下降;而高度结构化的任务则表现出“通用型”(generalist)表征,性能随层级升高而提升,这表明任务约束与认知评估层级之间存在系统性关联。
链接: https://arxiv.org/abs/2605.27189
作者: Serli Kopar,Roshan Prakash Rane,Christian Mychajliw,Lydia Federmann,Gerhard Eschweiler,Daniela Berg,Sam Gijsen,Paula Andrea Perez-Toro,Kerstin Ritter
机构: Hertie Institute for AI in Brain Health, University of Tübingen (图宾根大学脑健康人工智能研究所); Tübingen AI Center, University of Tübingen (图宾根人工智能中心); Department of Psychology, Humboldt-Universität zu Berlin (柏林洪堡大学心理学系); Geriatric Center, Tübingen University Hospital (图宾根大学医院老年病中心); Tübingen Center for Mental Health (TüCMH), Department of Psychiatry and Psychotherapy, Tübingen University Hospital (图宾根心理健康中心,图宾根大学医院精神科与心理治疗系); German Center for Mental Health (DZPG), Partner Site Tübingen (德国心理健康中心,图宾根合作站点); Department of Neurology, University Medical Center Schleswig-Holstein and Kiel University (施莱维希-霍尔斯坦大学医学中心及基尔大学神经内科); Center for Neurology, University Hospital Tübingen and Hertie Institute for Clinical Brain Research (图宾根大学医院神经学中心及赫蒂临床脑研究中心); Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希-亚历山大大学模式识别实验室); Charité–Universitätsmedizin, Department of Psychiatry and Psychotherapy, Berlin (夏里特-柏林夏里特大学医学中心精神病学与心理治疗系)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Neurons and Cognition (q-bio.NC)
备注:
Abstract:This study examines the relationship between speech representations and the hierarchical structure of cognitive assessment in mild cognitive impairment. Utilizing 5,754 German neuropsychological assessment recordings, we evaluate six cognitive tasks across three score levels: task, domain, and global levels. We compare hand-crafted acoustic features with self-supervised learning (SSL) embeddings. Results show that although SSL representations generally outperform hand-crafted features at lower levels, this trend reverses for MCI classification. Furthermore, task-specific constraints influence performance: tasks with greater response freedom exhibit performance dilution as hierarchical levels increase, suggesting specialist'' representations, whereas the performance of highly structured tasks increases toward higher levels, suggesting generalist’’ representations. These findings show links between task constraints and assessment hierarchy in automated clinical speech analysis.
[NLP-23] MAIGO: Mitigating Lost-in-Conversation with History-Cleaned On-Policy Self-Distillation
【速读】: 该论文试图解决大语言模型在多轮对话中性能下降的问题,即“迷失在对话中”(lost-in-conversation, LiC)差距。其核心问题是:早期对话中的中间回复(assistant replies)会污染后续上下文,导致模型逐步偏离正确路径,这种现象被称为自污染(self-contamination)。解决方案的关键在于提出一种基于策略的自蒸馏方法 MAIGO,通过利用模型自身生成的、历史清洁后的参考答案来减少污染:在中间轮次中,MAIGO 清除先前的助手回复但保留用户可见的片段;在最终回答轮次中,从完整视角的参考答案中蒸馏信息,并引入可靠性权重以降低与清洁参考不一致样本的影响。该方法无需验证器奖励、状态标签或推理时的结构化辅助,显著提升了多轮对话任务的准确性,证明了自污染是可训练的 LiC 差距组成部分。
链接: https://arxiv.org/abs/2605.27186
作者: Haoyu Zheng,Yun Zhu,Shu Yuan,Shangming Chen,Qing Wang,Wenqiao Zhang,Jun Xiao,Yueting Zhuang
机构: Zhejiang University (浙江大学); Shanghai AI Laboratory (上海人工智能实验室); Huazhong University of Science and Technology (华中科技大学); Fuzhou University (福州大学); Tencent (腾讯)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models often solve tasks from a fully specified prompt but degrade when the same requirements unfold over multiple turns, known as the lost-in-conversation (LiC) gap. We trace part of this degradation to self-contamination: intermediate assistant replies enter later context and carry early deviations forward. Motivated by this mechanism, we propose MAIGO, an on-policy self-distillation method that reduces this contamination using history-cleaned references from the model’s own policy. For middle turns, MAIGO removes prior assistant replies while preserving the user-visible sharded prefix; for answer turns, it distills from paired full-view references conditioned on the completed user-side dialogue. A reliability weight downweights middle-turn samples that disagree with the clean reference. MAIGO requires no verifier rewards, state labels, or inference-time scaffolding. Under the LiC paired-view protocol with deterministic verifiers, MAIGO improves Qwen2.5-7B-Instruct SHARDED accuracy from 52.8 to 66.1 and the SHARDED/FULL ratio from 66.5% to 84.1%, while keeping FULL accuracy within 2.3 points. These results show that self-contamination is a trainable component of the LiC gap.
[NLP-24] Grounding Text Embeddings in Stakeholder Associations
【速读】: 该论文试图解决的问题是:文本嵌入(text embeddings)是否能够准确捕捉人类专家所感知的语义距离,即嵌入表示与人类意图之间是否存在对齐问题。由于当前广泛使用嵌入进行大规模文本分析,若其语义结构与人类认知不一致,可能导致分析结果失真。解决方案的关键在于提出“利益相关者锚定练习”(Stakeholder Grounding Exercise),该方法通过显式化专家对文本间语义关系的判断,将嵌入模型的结果锚定于人类理解之上。实证研究表明,神经文本嵌入在丹麦政策议题和美国联邦人工智能使用案例中均显著偏离人类专家判断(差距达19–26个百分点),且这种偏差会传递至下游聚类任务性能(斯皮尔曼相关系数 ρ=0.9),证明了该方法在识别嵌入可靠性方面的有效性。
链接: https://arxiv.org/abs/2605.27168
作者: Jonathan Rystrøm,Sofie Burgos-Thorsen,Zihao Fu,Johan Irving Søltoft,Kenneth C. Enevoldsen,Chris Russell
机构: University of Oxford, UK; Institute for Wicked Problems, Denmark; The Chinese University of Hong Kong, China; Danish Technical University, Denmark; Aarhus University, Denmark
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding representations and human intentions is essential for valid analyses. We present the Stakeholder Grounding Exercise, a method for making expert associations explicit and grounding embedding model results in human understanding. In our primary case study on Danish policy issues, we find that neural text embeddings are substantially less reliable than human experts (19-26 pp gap), and that this misalignment propagates to downstream clustering performance (Spearman \rho=0.9 between exercise ranking and cluster quality). A secondary study on US Federal AI use cases replicates the gap (16pp) in English, using a digital protocol and a different community of experts – demonstrating that the gap is not an artefact of a single instrument or domain. The Stakeholder Grounding Exercise offers a practical method for assessing whether embedding models capture the semantic distinctions that matter most to domain experts.
[NLP-25] Formalization of Malagasy conjugation
【速读】: 该论文旨在解决马达加斯加语简单动词的形态分析问题,即如何通过规则化方式自动识别和生成动词的词形变化。其解决方案的关键在于利用Unitex平台构建一个基于有限状态转换器(finite-state transducers)的电子词典系统,其中78个转换器用于生成动词词干的所有变体(allomorphs),271个转换器则用于分析已屈折变化的动词形式以识别词干和词缀。整个设计强调可读性和可维护性,使语言学家能够轻松扩展和更新词典与转换规则。
链接: https://arxiv.org/abs/2605.27161
作者: Joro Ny Aina Ranaivoarison,Eric Laporte,Baholisoa Simone Ralalaoherivony
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper reports the core linguistic work performed to construct a dictionary-based morphological analyser for Malagasy simple verbs. It uses the Unitex platform and comprised the contruction of an electronic dictionary for Malagasy simple verbs. The data is encoded on the basis of morphological features. The morphological variations of verb stems and their combination with inflectional affixes are formalized in finite-state transducers represented by editable graphs. 78 transducers allow Unitex to generate a dictionary of allomorphs of stems. 271 other transducers are used by the morphological analyser of Unitex to recognize the stem and the affixes in conjugated verbs. The design of the dictionary and transducers prioritizes readability, so that they can be extended and updated by linguists.
[NLP-26] LitSeg: Narrative-Aware Document Segmentation for Literary RAG
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因文档分割策略不当而导致的性能瓶颈问题,特别是在文学作品等长尾领域,现有方法通常忽视叙事结构,导致情节碎片化和指代不清,严重影响检索与生成效果。解决方案的关键在于提出LitSeg框架,该框架基于叙事理论引导的多阶段提示机制,显式提取事件、理清叙事线索、明确结构并定位转折点,从而实现语义连贯的文本分段;进一步为降低计算开销,引入轻量级的LitSeg-Lite模型,通过两阶段训练策略将复杂推理过程蒸馏为单次推理,显著提升效率且保持高质量分段。实验表明,该方法在检索准确率、上下文相关性及下游问答任务性能上均优于基线,验证了叙事指导与数据蒸馏的有效性。
链接: https://arxiv.org/abs/2605.27156
作者: Ruikang Zhang,Zhanni Chen,Yiqiao Cai,Qi Su
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge, particularly for long-tail domains such as literary works. However, the critical step of document segmentation in RAG remains largely underexplored. Existing strategies are typically semantically blind and overlook the complicated narrative structures of literary works, often resulting in fragmented plots and unclear references that severely hinder retrieval and generation performance. To address this, we propose LitSeg, a novel narrative-theory-guided segmentation framework. By employing multi-stage prompting, LitSeg explicitly extracts valid events, untangles narrative threads, clarifies narrative structures, and locates turning points to inform segmentation. To alleviate the computational overhead of multi-stage inference with large-scale models, we further introduce LitSeg-Lite, a lightweight single-pass chunker fine-tuned on LitSeg-generated data via a two-stage training strategy, distilling the complex process into a single inference pass. Extensive experiments demonstrate that with structurally independent text chunks, our methods significantly improve retrieval accuracy and context relevance over baselines, ultimately enhancing downstream QA performance, while ablation studies validate the efficacy of narratological guidance and data distillation.
[NLP-27] BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning
【速读】: 该论文试图解决的问题是如何通过迭代式提示策略(iterative prompting)绕过大型语言模型(Large Language Models, LLMs)的安全防护机制,从而诱导其输出有害或违反政策的内容。解决方案的关键在于提出了一种名为BAIT(Boundary-Aware Iterative Trap)的三步攻击框架:首先引导模型识别其安全边界(protection boundary),其次要求模型细化该边界以增强一致性,最后请求具体示例。这一过程利用了模型自身的推理逻辑和对一致性的偏好,将原本用于防御的机制转化为信息泄露路径,从而在多个基准测试(如AdvBench、JailbreakBench等)中显著优于传统越狱基线方法。
链接: https://arxiv.org/abs/2605.27110
作者: Xuan Luo,Yue Wang,Geng Tu,Jing Li,Ruifeng Xu
机构: The Harbin Institute of Technology, Shenzhen, China; The Hong Kong Polytechnic University, Hong Kong, China; Shenzhen University, Shenzhen, China; Shenzhen Loop Area Institute, Shenzhen, China
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:In this work, we propose BAIT (Boundary-Aware Iterative Trap), a three-step jailbreak framework that approaches malicious goals through internal disclosure. BAIT first asks the model to identify the protection boundary, then requires it to refine that boundary, and finally requests a detailed example. By expanding each step upon the model’s previous responses, BAIT turns the model’s own reasoning and consistency tendency into a disclosure pathway. Experiments on AdvBench, JailbreakBench, AIR-Bench, and SORRY-Bench demonstrate that BAIT consistently achieves strong attack success rates across top-tier large language models, significantly advancing conventional jailbreak baselines. Further analysis reveals that: 1) prevention-oriented framing significantly outperforms direct knowledge request; 2) the refinement step plays a critical role in disclosure escalation; and 3) the first two steps have a certain chance of eliciting harmful content while triggering little filtering.
[NLP-28] Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models
【速读】: 该论文试图解决的问题是:当前视频大语言模型(VideoLLMs)是否能够可靠地在时间维度上将主体(subject)与事件(event)进行关联,尤其是在存在干扰片段的情况下。其解决方案的关键在于引入了一个名为DistractionBench的评估基准,通过在长视频中插入短广告片段等受控干扰手段,系统性地测试模型对主体-事件关联的鲁棒性。研究发现,所有测试的11个主流VideoLLMs均表现出显著的“事件集合”(bag-of-events, BoE)行为——即模型将视频视为无时序关系的事件集合而非连续的时间序列,从而错误地将插入广告中的动作归因于主视频中的主体,暴露出其在时间定位(temporal grounding)机制上的严重不足。这一发现揭示了当前VideoLLMs在跨时间段主体-事件建模方面的根本缺陷,并呼吁开发具备更强时空关联能力的新一代模型。
链接: https://arxiv.org/abs/2605.27101
作者: Oscar Chew,Serhii Honcharenko,Qian-Hui Chen,Patricia Lu,Dishant Zaveri,Khoa D. Doan,Kuan-Hao Huang
机构: Texas A&M University (德州农工大学); National Taiwan University (台湾国立大学); Stanford University (斯坦福大学); VinUniversity (Vin大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:A key capability for video understanding is reliably linking subjects to events across time, yet whether Video Large Language Models (VideoLLMs) actually achieve this remains unclear. In this work, we introduce DistractionBench to evaluate whether VideoLLMs can robustly link subjects and events in the presence of unrelated video segments. Through controlled interventions, such as inserting short advertisement clips into longer videos, we show that VideoLLMs frequently hallucinate interactions between entities from different segments, incorrectly attributing actions from injected advertisements to subjects in the main video. We characterize this systematic hallucination as bag-of-events (BoE) behavior, where models process videos as collections of events rather than temporally structured sequences. Evaluating 11 popular VideoLLMs, we find that all models exhibit substantial BoE behavior. Our findings suggest that VideoLLMs lack reliable mechanisms for temporal grounding and motivate the development of models with more robust subject-event association.
[NLP-29] MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverag e Risk Decomposition
【速读】: 该论文旨在解决开放型问答(open-ended question answering, QA)中生成式AI(Generative AI)幻觉问题,现有基于校准(conformal)的方法通常依赖于一个脆弱前提:有限采样必须至少产生一个可接受的候选答案,否则校准样本会被丢弃。其解决方案的关键在于提出MiRD(Two-Stage Framework),该框架将总体误覆盖(miscoverage)分解为采样失败和条件选择失败两部分:第一阶段建立在固定预算下采样无有效答案的概率的期望级边际上界,从而控制采样风险;第二阶段在采样成功条件下,利用全校准集上的与接纳相关联的非 conforming score 校准选择阈值,保持校准集完整性。实验表明,MiRD 在三个开放QA数据集和八种模型上均能有效控制采样风险、条件选择风险及整体误覆盖,且相比PAC类方法提供更紧的第一阶段边界,同时比仅使用成功样本的校准方式更具适应性。
链接: https://arxiv.org/abs/2605.27091
作者: Anqi Hu,Zhiyuan Wang,Zijun Jia,Bo Fu
机构: University of Electronic Science and Technology of China (电子科技大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reliable set-valued prediction provides a principled way to mitigate hallucinations in open-ended question answering (QA), yet existing conformal approaches typically rely on a fragile premise: finite sampling must already produce at least one admissible candidate, or calibration examples violating this condition are discarded. In this paper, we introduce MiRD, a two-stage framework that decomposes overall miscoverage into sampling failure and conditional selection failure. In Stage I, MiRD establishes an expectation-level marginal upper bound on the probability that finite sampling produces no admissible answer under a fixed budget. In Stage II, conditioned on sampling success, MiRD calibrates a conformal selection threshold using admission-correlated nonconformity scores defined over the full calibration set, thereby preserving calibration-set integrity. Across three open-ended QA datasets and eight models, MiRD controls sampling risk, conditional selection risk, and overall miscoverage, while yielding tighter first-stage bounds than PAC-style alternatives and more adaptive prediction sets than successful-only calibration.
[NLP-30] LLM s Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring
【速读】: 该论文试图解决的问题是:如何在无需强化学习(Reinforcement Learning, RL)训练和多GPU基础设施的情况下,实现大语言模型(Large Language Models, LLMs)在数学辅导场景中的教学对齐。传统方法依赖于复杂的RL训练流程,计算成本高且难以部署。解决方案的关键在于提出一种“训练-free prompt optimization”策略——仅通过API调用迭代优化系统提示(system prompt),结合12种方法(包括7种已有方法和5种教育专用方法),在两个外部数据集(Out-of-Distribution, OOD)上进行评估。结果表明,所有最佳配置均超越了最强的RL基线模型(R_total = 0.633),其中提出的ParetoGrad方法在后测解题率、知识泄露控制与助教性之间实现了最优帕累托平衡,而非单一指标最优。行为分析进一步揭示,训练-free方法更频繁地使用教学知识模式(2–3倍于RL模型),但意图层面的支架支持略低约10个百分点,且任务依赖的推理模式效应在两类方法中一致存在。该方案显著降低了开发成本,仅靠提示工程即可实现高效、可部署的教学对齐LLM辅导系统。
链接: https://arxiv.org/abs/2605.27088
作者: Unggi Lee,Minchul Shin,Yeil Jeong,Sookbun Lee,Jeongsu Moon,Kyungtae Joo,Eunjoo Lee,Hoilym Kwon
机构: Korea University Sejong Campus (韩国首尔大学世宗校区); Gyeonggi Institute of Education (京畿道教育研究所); Indiana University Bloomington (印第安纳大学布卢明顿分校); Opentutorials (Opentutorials); Chosun University (朝鲜大学); Korea University Korean Studies Center (韩国首尔大学韩国研究中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 5 figures
Abstract:Aligning LLMs for math tutoring typically requires RL-based training with multi-GPU infrastructure. We investigate whether training-free prompt optimization-evolving only the system prompt via API calls-can serve as a practical alternative. We adapt 7 published methods and propose 5 education-specialized methods, evaluating these 12 methods under 5 conditions on 2 OOD benchmark suites. All 12 best-per-method configurations surpass the strongest RL-trained baseline (R_total = 0.633), and our ParetoGrad achieves the best Pareto balance across post-test solve rate, leak control, and helpfulness, rather than dominating any single component. Behavioral analysis with an 82-code educational codebook reveals that training-free methods rely on teaching-knowledge patterns at 2-3x the rate of RL-trained models, with a compensating ~10 percentage-point reduction in intent-level scaffolding. We also find a task-dependent reasoning mode effect consistent across training-free and RL-based paradigms. Our approach enables efficient development of pedagogically aligned LLM tutors with prompts alone and minimal compute.
[NLP-31] On the Hidden Costs of Counterfactual Knowledge Training in LLM Unlearning
【速读】: 该论文试图解决生成式 AI(Generative AI)中大型语言模型(Large Language Model, LLM)的“遗忘”问题,即如何有效移除模型中不希望保留的知识内容。当前主流的反事实微调(Counterfactual Tuning, CFT)方法虽具潜力,但存在性能不足的问题。其关键解决方案在于识别并诊断两个此前被忽视的核心缺陷:一是“知识冲突”(knowledge conflict),即反事实语料库内部的不一致性导致梯度冲突,干扰参数优化;二是“幻觉溢出”(hallucination spillover),即模型在拟合虚假目标时引入持续的虚构偏差,导致无关领域幻觉率上升。为系统性地分析这些问题,作者构建了扩展基准RWKU+,包含新的权衡指标与梯度级诊断工具,旨在推动更严谨、高效的LLM遗忘研究。
链接: https://arxiv.org/abs/2605.27083
作者: Xiaotian Ye,Xiaohan Wang,Mengqi Zhang,Shu Wu
机构: Beijing University of Posts and Telecommunications; Huazhong University of Science and Technology; Shandong University; NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Counterfactual tuning (CFT) has emerged as a promising paradigm for Large Language Model (LLM) unlearning by training models to generate alternative fictitious knowledge in place of undesired content. However, in this work, we find that this paradigm still underperforms other paradigms in some aspects, and identify two previously overlooked pitfalls underlying this gap: (1) knowledge conflict, where mutual inconsistencies within counterfactual corpora induce conflicting gradients that disrupt parameter optimization, and (2) hallucination spillover, where fitting false targets instills a persistent fabrication bias, inflating hallucination rates on unrelated domains. To systematically diagnose these issues, we introduce RWKU+, an extended benchmark equipped with novel trade-off metrics and gradient-level diagnostic tools. Our work further discusses the limitations and overhead of the paradigm, aiming to provide insights and actionable guidance for more rigorous LLM unlearning research.
[NLP-32] E3: Issue-Level Backtesting for Automated Research Critique
【速读】: 该论文试图解决的问题是:如何自动化识别学术论文中与技术决策相关的关键问题,以辅助审稿人和工程团队更高效地评估研究质量。现有方法在处理未充分支持的主张、缺失消融实验、弱基线、隐藏假设、效度威胁及数据泄露风险等方面存在不足,且人工评审易受主观性和时间限制影响。解决方案的关键在于提出E3——一个基于多源分析的自动化审稿助手,它不仅能定位每个技术关切的具体位置和性质,还能量化其对论文贡献的影响,并提供可验证的证据或分析路径来解决这些问题。通过采用无污染的“问题级回测”协议(issue-level backtesting),即仅使用训练截止日期之后的论文进行测试并由匿名化评审者作为元裁判(meta-judge)标注每项问题是否被发现,E3在100篇ICLR 2026论文上的评估中展现出显著优于人类审稿人及两个主流大语言模型(GPT-5.4和Claude-Opus-4-6)的召回率,尤其在包含部分发现的召回率达到90.2%,远超其他对比方法,证明了其在提升审稿效率与全面性方面的有效性。
链接: https://arxiv.org/abs/2605.27072
作者: Yashwardhan Chaudhuri,Sanyam Jain,Paridhi Mundra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present E3, an automated review assistant that augments reviewers and engineering teams by identifying decision-relevant technical concerns in research papers. For each concern, E3 reports its nature, its location, its bearing on the contribution, and the analysis or evidence that would resolve it, covering unsupported claims, missing ablations, weak baselines, hidden assumptions, threats to validity, and leakage risks. To evaluate E3 without contamination confounds we adopt an issue-level backtesting protocol: the corpus is restricted to papers postdating the training cutoff of every automated source, and for each paper a meta-judge that observes only anonymised reviews labels every issue-source pair as Caught, Partial, or Missed. Applied to 100 ICLR 2026 papers and 4598 judged issue rows, comparing E3 against the ICLR human reviews and two prompt-matched LLM baselines built on gpt-5.4 from OpenAI and claude-opus-4-6 from Anthropic, with meta-judge gpt-5.5, E3 attains the highest recall on every aggregate metric. Partial-inclusive recall reaches 90.2 percent, which is 15.5 points over GPT, 17.1 points over Claude, and 29.2 points over the human reviews, and strict recall preserves the ordering at 65.8 percent. On concerns raised by the human reviewers, E3 recovers 89.6 percent; on concerns the human reviewers missed it surfaces 1635 additional rows admitted into the judged union, 406 above the next-best source. Corpus, baseline prompts, judge prompt template, and evaluation code are released.
[NLP-33] FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions LREC2026
【速读】: 该论文试图解决欧洲葡萄牙语(European Portuguese, EP)在自动语音识别(ASR)系统中因缺乏大规模标注数据而导致性能不足的问题。当前可用的语音数据资源主要集中在巴西葡萄牙语(Brazilian Portuguese, BP),而EP由于使用者较少(约1100万),其ASR模型表现远逊于BP。解决方案的关键在于构建一个大规模、带说话人标注的欧洲葡萄牙语语音语料库——FalAR,该语料库包含约5800小时的议会会议语音数据,其中4850小时具有说话人身份标注(共1180名说话者,附带年龄、性别、政治派别和议会角色等元数据)。通过使用最先进的EP CAMÕES ASR模型进行转录与对齐,研究验证了该语料库在预训练中的有效性:实验表明,将FalAR作为预训练数据可使基线模型的词错误率(WER)相对降低高达14%。
链接: https://arxiv.org/abs/2605.27062
作者: Francisco Teixeira,Carlos Carvalho,Mariana Julião,Catarina Botelho,Rubén Solera-Ureña,Sérgio Paulo,Thomas Rolland,Ben Peters,Isabel Trancoso,Alberto Abad
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published in LREC2026
Abstract:State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialectal varieties. Due to having considerably fewer speakers (around 11 million), European Portuguese (EP) is overshadowed by Brazilian Portuguese (BP) (around 200 million speakers) in currently available large-scale speech data resources, resulting in under-performing speech-based systems for EP users. To address this gap, and following similar data collection efforts for other languages, we present FalAR, a large-scale, speaker-annotated speech corpus of European Portuguese parliamentary sessions. Spanning approximately 20 years, FalAR comprises 5,800 hours of speech data. In addition, 4,850 hours have speaker identity annotations, for a total of 1,180 speakers with associated metadata including age, gender, political affiliation, and parliamentary role. The corpus was built using a state-of-the-art EP CAMÕES ASR model for transcription-reference alignment. In this paper, we describe the data collection process, together with the main characteristics of the FalAR corpus. Furthermore, we evaluate the trade-off between data quantity and alignment accuracy on ASR performance, with our experiments demonstrating that incorporating FalAR as pre-training data yields up to 14% relative WER improvement over baseline models.
[NLP-34] BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation
【速读】: 该论文旨在解决低资源语言(如马拉地语)在神经机器翻译(NMT)中因高质量平行语料库稀缺而导致的性能瓶颈问题。其解决方案的关键在于构建了一个涵盖多个领域的大型、多样化且经过语言学增强的英-马拉地语平行语料库(BhashaSetu),包含278万句对,并引入词干化(stemming)和词形还原(lemmatization)以支持形态学敏感分析。实验表明,跨来源语料库去重(corpus-level deduplication)是提升下游翻译质量的最关键预处理步骤(去除后BLEU下降1.17、chrF++下降2.21),凸显了对低资源、形态丰富的语言进行低成本但高效益的语料库清理的重要性。
链接: https://arxiv.org/abs/2605.27050
作者: Param Thakkar,Anushka Yadav,Michael Tiemann,Abhi Mehta,Akshita Bhasin,Shrinivas Khedkar
机构: Veermata Jijabai Technological Institute, Mumbai (维姆塔·吉贾拜技术学院,孟买); Tübingen AI Center, University of Tübingen, Germany (蒂宾根人工智能中心,图宾根大学,德国)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We present BhashaSetu, a linguistically enriched English–Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across diverse domains. Our dataset comprises 2.78 million sentence pairs from heterogeneous sources including news, politics, healthcare, literature, and culture, with stemmed and lemmatized representations to support morphology-aware analysis. We benchmark multiple state-of-the-art translation models using BLEU, spBLEU, chrF++, and TER metrics, and conduct parameter-efficient fine-tuning of NLLB-200-distilled-600M using LoRA. A key finding from our ablation: corpus-level deduplication is the single largest preprocessing contributor to downstream quality (removing it reduces performance by 1.17 BLEU and 2.21 chrF++), demonstrating that disciplined cross-source corpus hygiene is a low-cost, high-impact intervention for low-resource, morphologically rich languages. The dataset is publicly released to promote reproducible and linguistically informed low-resource NMT research.
[NLP-35] ExTax: Explainable Disinformation Detection via Persuasion Emotion and Narrative Role Taxonomies
【速读】: 该论文试图解决的问题是:随着大语言模型(LLM)的普及,虚假信息的生成和传播变得更加流畅且难以识别,传统基于语法-语义验证的方法已不足以应对这类复杂欺骗行为。此类虚假信息往往不依赖单一表面错误,而是通过修辞说服、情感操控和叙事角色构建等多维度认知路径影响读者理解,而现有检测方法通常仅关注孤立信号(如语法、外部知识、说服力或情感线索),难以捕捉其多面向的操纵意图,也缺乏人类可解释的说明。
解决方案的关键在于提出一个名为ExTax的分类学对齐框架,该框架首次将说服修辞、情感操控与叙事角色统一纳入一个17维的分类空间(涵盖6种说服策略、5种情感操控方式和6类叙事角色),并通过熵驱动的动态标签平滑机制整合多个前沿LLM的输出差异,再利用异构多头注意力融合分类学表征与上下文编码,从而为每个预测提供可解释的操纵特征画像。实验表明,ExTax在五个跨领域、跨文体基准上达到0.8456的宏平均F₁分数,显著优于当前最优深度学习和LLM基线,并在极端文体不平衡场景下保持鲁棒性。
链接: https://arxiv.org/abs/2605.27045
作者: Shang Luo,Yingguang Yang,Zhenchen Sun,Yang Liu,Bin Chong,Jingru Chen,Yancheng Chen,Jiayu Liang,Kefu Xu,Hao Peng,Philip S. Yu
机构: Peking University (北京大学); University of Science and Technology of China (中国科学技术大学); North China University of Science and Technology (华北理工大学); Tsinghua University (清华大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); University of Chinese Academy of Sciences (中国科学院大学); Soochow University (苏州大学); Beihang University (北京航空航天大学); University of Illinois Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The democratization of LLMs has accelerated the generation and circulation of highly fluent disinformation, making traditional syntax-semantic verification increasingly insufficient. Such deception rarely relies solely on surface-level falsity; instead, it often combines persuasive rhetoric, emotional manipulation, and narrative role construction to influence readers’ interpretations through multiple cognitive pathways. However, existing detectors typically emphasize isolated signals – such as syntax, external knowledge, persuasion, or affective cues – and therefore struggle to capture the multi-faceted manipulative intents underlying disinformation or provide human-auditable explanations. To address this gap, we present \textbfExTax, a taxonomy-aligned framework for explainable disinformation detection. ExTax unifies persuasive rhetoric, emotional manipulation, and narrative roles into a 17-dimensional taxonomic space, covering 6 persuasive-rhetoric strategies, 5 emotional-manipulation methods, and 6 narrative-role categories. It elicits attributes from multiple frontier LLMs, reconciles their disagreements through Entropy-driven Dynamic Label Smoothing, and fuses the resulting taxonomic representations with contextual encodings via Heterogeneous Multi-Head Attention, grounding each prediction in an interpretable manipulation profile. Across five cross-domain and cross-genre benchmarks, ExTax achieves an overall Macro F_1 of 0.8456 , outperforming state-of-the-art deep learning and LLM-based baselines. It also remains robust under severe genre imbalance, where the strongest deep baseline degrades from 0.9454 to 0.6194 .
[NLP-36] racing Computation Density in LLM s
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)虽然拥有数十亿参数,但其在处理不同输入时是否充分利用了全部计算能力尚不明确。为回答这一问题,作者提出了一种名为s-Trace的方法,用于高效估计能够最佳近似完整模型输出的大小为s的子图(subgraph)。该方法的关键在于识别出模型中对输出贡献最大的关键计算路径,从而揭示LLM内部计算的模块化组织结构:早期层中的稀疏子图主要捕捉到输出分布的头部信息(即高概率词),而后期层中逐渐增加的更密集节点(尤其是注意力头)则逐步细化输出分布。研究进一步发现,每输入所需的计算量与模型不确定性正相关,且稀疏子图编码的是浅层统计特征(如一元语法频率),这表明LLM的计算具有分阶段、模块化的特性——早期稀疏核心提供粗略预测,后期密集计算实现精细化修正。
链接: https://arxiv.org/abs/2605.27033
作者: Corentin Kervadec,Iuliia Lysova,Iuri Macocco,Marco Baroni,Gemma Boleda
机构: Universitat Pompeu Fabra; ICREA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs, but it is not clear that they exploit their full capacity for all inputs. We introduce the s-Trace method to efficiently estimate the subgraph of size s that best approximates a full model output. With this method, we find the computation in a variety of LLMs to be organized in two distinct phases. A small subgraph mostly composed of early-layer nodes can reconstruct the head of the full model output distribution. Adding further nodes, mostly located in later layers and increasingly consisting of attention heads, leads to incremental refinements in approximating the full output distribution. We find moreover that the amount of necessary computation per input correlates with model uncertainty, and that sparser subgraphs encode shallow statistics, such as unigram frequency. Overall, our results suggest a consistent modular organization in effective LLM computation, with a sparse early-layer core providing a rough prediction that is further refined through denser computations in later layers.
[NLP-37] Share More Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling
【速读】: 该论文试图解决的问题是:现有并行测试时缩放(Parallel Test-Time Scaling, TTS)方法在推理过程中各分支之间信息隔离,导致大量重复探索,从而降低搜索效率并增加达到正确答案所需的步骤数。解决方案的关键在于提出一种无需训练的推理框架——协同并行思维(Collaborative Parallel Thinking, CPT),其核心机制包括:从正在进行的分支中提取紧凑的中间信息,维护一个去重的查询级信息池,并将池中的条目通过输入上下文广播给所有分支,使每个分支能够在后续搜索步骤中复用其他分支已发现的信息,而非重复探索。实验表明,CPT在HMMT和AIME基准上显著优于强基线方法,在准确率与延迟之间建立了更优的帕累托前沿,验证了搜索阶段协作的有效性。
链接: https://arxiv.org/abs/2605.27030
作者: Xinglin Wang,Hao Lin,Shaoxiong Feng,Peiwen Yuan,Yiwei Li,Jiayi Shi,Yueqi Zhang,Chuyi Tan,Ji Zhang,Boyuan Pan,Yao Hu,Kan Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Test-Time Scaling (TTS) enhances the reasoning capabilities of large language models by allocating additional inference compute to explore the solution space. However, existing parallel TTS methods typically keep branches isolated during search: intermediate discoveries remain branch-private and cannot guide other branches in time. This information isolation causes substantial redundant exploration, as branches repeatedly rediscover information already found elsewhere and require more search steps to collect complete decision information needed to reach correct answers. To bridge this gap, we propose \textbfCollaborative Parallel Thinking (CPT), a training-free inference framework that enables search-time information sharing across parallel branches. CPT extracts compact intermediate information from ongoing branches, maintains a deduplicated query-level information pool, and broadcasts pool entries through the input context, allowing each branch in subsequent search steps to reuse discoveries made by other branches rather than rediscover the same information. Empirically, experiments on HMMT and AIME benchmarks show that CPT establishes a stronger accuracy–latency Pareto frontier than strong baselines across rollout budgets and model scales, highlighting search-time collaboration as an effective direction for efficient parallel TTS.
[NLP-38] Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations
【速读】: 该论文试图解决大规模仇恨言论(hate speech)标注数据集构建中因标注成本高、主观性强及标注者间分歧大而导致的挑战。其解决方案的关键在于系统性地分析大型语言模型(LLMs)在十个理论基础明确的主观属性(如去人性化、暴力、情感倾向等)上与人类判断的一致性,发现所有测试模型均呈现一致的两分模式:行为显式维度(如侮辱、羞辱、攻击-防御)与人类标注高度相关,而评价性维度(如尊重、情感、仇恨言论)则出现系统性反转。基于此洞察,作者提出通过置信度加权的岭回归(confidence-weighted Ridge regression)融合各属性层面的LLM预测结果,重构来自Measuring Hate Speech语料库的连续仇恨言论评分,最终达到高达0.71的R²,显著优于直接提示基线,表明结构化的属性分解能比端到端标签预测更有效地恢复丰富且更贴近人类判断的信号。
链接: https://arxiv.org/abs/2605.27025
作者: Mohammad Amine Jradi,Faeze Ghorbanpour,Alexander Fraser
机构: TU Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments across ten theoretically grounded subjective attributes, such as dehumanization, violence, and sentiment, evaluating both small and large variants of Llama 3.1 and Qwen 2.5. Our analysis reveals a consistent split across all models: behaviorally explicit dimensions (insult, humiliate, attack-defend) correlate strongly with human annotations, while evaluative dimensions (respect, sentiment, hate speech) are systematically inverted. Demographic persona conditioning reduces model confidence without improving alignment. Building on these insights, we propose combining attribute-level LLM predictions via a confidence-weighted Ridge regression to reconstruct continuous hate speech scores from the Measuring Hate Speech corpus, achieving R^2 of up to 0.71 and outperforming direct prompting baselines, demonstrating that structured attribute decomposition recovers a richer and more human-aligned signal than end-to-end label prediction alone.
[NLP-39] Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)中幻觉(hallucination)问题与不确定性估计(Uncertainty Estimation, UE)方法之间关系不明确的问题。当前许多研究将不确定性视为模型错误的代理指标,但缺乏系统性实证验证。论文的关键解决方案在于通过多维度、多基准的实验设计,对信息论类、采样类和反思类等多种不确定性估计器进行系统评估,区分内在幻觉(输入忠实性违反)与外在幻觉(训练数据支持不足),从而揭示不确定性与幻觉之间的关联并非稳定或强相关,其有效性高度依赖于幻觉类型及具体模型。这一发现挑战了将不确定性直接作为幻觉检测信号的传统做法,并明确了其在何种条件下具备可操作性。
链接: https://arxiv.org/abs/2605.27016
作者: Yedidia Agnimo,Anna Korba,Annabelle Blangero,Nicolas Chesneau,Karteek Alahari
机构: Ekimetrics
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 35 pages, 7 figures, 9 tables
Abstract:Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable deployment. In parallel, numerous uncertainty estimation (UE) methods have been proposed to quantify model confidence and are often implicitly treated as proxies for model failure. However, the relationship between uncertainty and hallucinations remains insufficiently characterized. We present a systematic empirical study of the association between uncertainty estimators and hallucinations in LLMs. Rather than assuming this association, we evaluate directly when and to what extent it holds. We consider a diverse set of uncertainty estimators, including information-theoretic, sampling-based, and reflexive estimators, and examine their behavior across hallucination settings. Our experiments cover both intrinsic hallucinations (violations of input faithfulness) and extrinsic hallucinations (unsupported claims relative to training data), using four complementary benchmarks, including RAGTruth and HalluLens. We find that the association is highly variable and often weak, depending on the hallucination type and the LLM under evaluation. These results challenge the use of uncertainty as a direct signal of hallucination and clarify when it provides actionable information.
[NLP-40] PersLitEval: Fine-grained Benchmark and Evaluation of LLM s on Persian Literature Questions
【速读】: 该论文试图解决大语言模型(LLMs)在非英语文学知识评估方面存在显著不足的问题,尤其是针对波斯语文学理解能力的系统性评测缺失。解决方案的关键在于构建PersLitEval——一个包含4,514道波斯语文学多选题的基准测试集,覆盖拼写、修辞手法、语法、词汇、词形变化和概念理解等八个细粒度类别,数据源自伊朗大学入学考试(Konkur)材料。通过在六种主流LLMs上测试十种提示策略,研究发现模型在概念相似性任务中表现较好,但在形式语言分析(如拼写和词形变化)上表现最弱;其中,带有解释的少样本提示策略效果最佳,尤其在形式语言类任务中。错误分析进一步识别出三种失败模式:语义理解缺口、形式语言知识缺口以及计数/枚举错误,表明不同类别需采用差异化优化策略以提升模型性能。
链接: https://arxiv.org/abs/2605.27015
作者: Ruhallah Niazi,Faeze Ghorbanpour,Alexander Fraser
机构: TU Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite impressive multilingual capabilities, large language models (LLMs) remain poorly evaluated on literary knowledge in non-English languages. We introduce PersLitEval, a benchmark of 4,514 Persian literature multiple-choice questions across eight fine-grained categories spanning spelling, literary devices, grammar, vocabulary, word formation, and conceptual understanding, sourced from materials for the Konkur university entrance examination. We evaluate six LLMs across ten prompting strategies, revealing striking category-level disparities across three tiers of task difficulty: models reach higher accuracy on conceptual similarity tasks but struggle with formal linguistic analysis, with spelling and word formation proving the hardest across all models. Prompting strategy has a significant impact on performance, with explained few-shot examples yielding the best results, particularly on formal linguistic categories. An error analysis identifies three failure modes: semantic comprehension gaps, formal linguistic knowledge gaps, and counting/enumeration errors, suggesting that different categories require different improvement strategies.
[NLP-41] Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning
【速读】: 该论文试图解决的问题是:在代码生成任务中,传统基于验证器的重复采样策略(即从单一答案分布中独立抽取K个样本)容易导致多次尝试集中在近似重复的推理路径上,造成计算资源浪费,尤其在竞赛编程场景下,由于存在多种不同的算法策略,这种冗余性尤为显著。解决方案的关键在于提出协同式Pass@K策略优化方法(CPPO),其核心创新是将生成过程建模为对多种高阶策略的联合探索:由一个规划器生成一组K=4个不同的高层方法(即策略),并由共享求解器为每种策略执行一次求解;训练时采用乘积形式的规划器奖励 $ R_\mathrm{plan} = J_\psi \cdot R_\mathrm{out} $,仅当整组策略组合最终通过验证器确认实现Pass@K成功时才给予信用分配,从而激励模型发现多样且有效的策略组合。实验证明,CPPO在APPS、CodeContests和LiveCodeBench-v6等多个基准上均显著优于直接采样、规划基线、仅规划微调(planner-only SFT)以及面向Pass@K的强化学习方法,在相同K=4的求解尝试预算下取得统计显著提升。
链接: https://arxiv.org/abs/2605.27000
作者: Yilong Li,Suman Banerjee,Tong Che
机构: University of Wisconsin–Madison; NVIDIA Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code reasoning; pass@K optimization; coordinated planning; verifiable rewards; strategy diversity
Abstract:Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@ K as the canonical metric. Yet the standard policy class draws K independent samples from a single answer distribution, so attempts often collapse onto near-duplicate reasoning paths and waste the budget on redundant rollouts. This failure is costly in competitive programming, where many problems admit multiple distinct algorithmic strategies and pass@ K requires only one correct attempt. We propose Coordinated Pass@ K Policy Optimization (CPPO), which turns pass@ K generation into joint exploration over strategies: a planner emits a tuple of K=4 alternative high-level methods, and a shared solver attempts one solution per method. CPPO trains this joint policy with a multiplicative planner reward, R_\mathrmplan = J_\psi \cdot R_\mathrmout , assigning credit only to valid strategy tuples that lead to verifier-confirmed pass@ K success. Across APPS, CodeContests, and LiveCodeBench-v6, CPPO improves pass@ 4 over direct sampling, planning baselines, planner-only SFT, and pass@ K -oriented RL under the same K=4 solver-attempt budget, with statistically significant gains on six of nine model–benchmark cells. The largest single gain is +0.16 on Qwen3.5-9B LiveCodeBench-v6 over the strongest baseline, PKPO ( 0.588 \rightarrow 0.748 ; paired bootstrap, p 0.05 ).
[NLP-42] Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际部署中面临的关键安全威胁——提示注入(Prompt Injection)的检测问题。现有检测方法多在受限实验环境下评估,难以反映真实应用场景中的复杂性和约束条件。论文提出了一种面向部署场景的系统性评估框架,涵盖多模型、多部署模式(regime)下的对比实验,包括词法、语义、结构化和基于Transformer的检测器,并在多个分布外设置、重复数据划分以及排名与阈值化指标下进行测试。其关键解决方案在于引入可解释的结构信号(interpretable structural signals),用于捕捉层级覆盖、系统提示伪造、角色重定义及规避模式等典型攻击行为,并验证其在稀疏模型中及与强编码器基线结合时的有效性。结果表明:检测性能高度依赖于具体部署环境和阈值选择,无单一模型在所有条件下最优;Transformer模型整体表现最强,而结构信号虽提升幅度有限,但在特定场景下显著改善低误报率行为,凸显了从排名性能到实际部署效果之间的重要差距,强调了在真实操作约束下评估防御机制的必要性。
链接: https://arxiv.org/abs/2605.26999
作者: Akindoyin Akinrele,Shreyank N Gowda
机构: University of Nottingham(诺丁汉大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Prompt injection poses a critical threat to the safe deployment of large language models, yet existing detection approaches are typically evaluated under limited settings that do not reflect real-world operating constraints. In this work, we present a deployment-aware evaluation of prompt injection detection using a multi-model and multi-regime experimental framework. We compare lexical, semantic, structural, and transformer-based detectors across multiple out-of-distribution settings, repeated data splits, and both ranking and thresholded deployment metrics. We introduce interpretable structural signals that capture hierarchy overrides, system prompt spoofing, role redefinition, and evasion patterns, and assess their contribution both within sparse models and in combination with strong encoder baselines. Our results show that detection performance is highly regime-dependent and sensitive to threshold selection, with no single model dominating across all settings. Transformer-based models achieve the strongest overall performance, while structural signals provide modest but consistent gains in certain regimes and improve low false positive rate behaviour in harder scenarios. These findings highlight the gap between ranking performance and deployment effectiveness and underscore the importance of evaluating prompt injection defences under realistic operational constraints. Code will be released.
[NLP-43] PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech
【速读】: 该论文试图解决低资源非拉丁字母语言文本到语音(TTS)系统评估中仅依赖单一自动语音识别(ASR)回路词错误率(WER)所导致的评估失真问题。传统方法无法区分合成音频缺失、误识别为邻近语言、仅在ASR转录中保留目标文字脚本,或语音对母语者不自然等不同故障类型,从而误导模型性能判断。解决方案的关键在于提出INVS(Intelligibility, Naturalness, Script fidelity, and Verification)评估框架,将上述多维问题解耦并独立量化;本文进一步发布其自动化筛选子集INSV-A,包含合成完成度、ASR WER/CER、转录脚本保真度率(Script Fidelity Rate)以及语音语言识别(LID)四项指标,以客观反映TTS系统的实际表现。研究通过构建PashtoTTS-Bench基准对5个TTS系统进行评估,结果显示OmniVoice auto在WER上最优(FLEURS: 24.1%,Common Voice 24: 27.4%),但更低的WER不代表优于真人语音,且Whisper Large V3无法识别Pashto语音,而MMS-LID-4017和SpeechBrain VoxLingua107可有效区分Pashto与乌尔都语输出,验证了INVS-A框架的有效性和必要性。
链接: https://arxiv.org/abs/2605.26978
作者: Hanif Rahman
机构: Independent Researcher
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:
Abstract:Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target script text only in an ASR transcript, or sound unnatural to native listeners. We introduce INSV (Intelligibility, Naturalness, Script fidelity, and Verification), a reporting framework that separates these cases. This paper reports INSV-A, the automated screening subset: synthesis completion, ASR WER/CER, transcript Script Fidelity Rate, and audio language identification. Native MOS and phonetic annotation are specified but not claimed in this release. We instantiate INSV-A as PashtoTTS-Bench, a dated benchmark for Pashto TTS. The April-May 2026 run evaluates Edge GulNawaz, Edge Latifa, OmniVoice clone, OmniVoice auto, and an Urdu negative control on 200 FLEURS and 200 filtered Common Voice 24 prompts. Under the independent omniASR_CTC_300M_v2, OmniVoice auto has the lowest WER (24.1% FLEURS, 27.4% CV24), followed by Edge GulNawaz (32.8%, 39.5%), Edge Latifa (35.6%, 47.7%), and OmniVoice clone (45.4%, 34.8%). WER below the natural-speech baseline reflects clean synthetic audio and should not be read as better than native speech. Whisper Large V3 returns 0.0% Pashto labels on checked Pashto TTS audio, while MMS-LID-4017 and SpeechBrain VoxLingua107 separate Pashto outputs from the Urdu control. The release provides provider metadata, per-sentence scores, LID audits, failure logs, and scripts for adding systems.
[NLP-44] Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling
【速读】: 该论文旨在解决用户建模(user modeling)中生成的推理轨迹(reasoning traces)缺乏真实因果机制的问题。传统方法通过后验理性化(post-hoc rationalization)合成推理过程,即在已知上下文和动作的前提下生成解释性文本,但这种做法仅保证推理能为动作提供合理性说明,而无法反映个体决策背后的潜在因果路径。论文提出Recon方法,其核心创新在于利用动作重建(action reconstruction)来评估推理轨迹的质量:给定上下文和候选推理,一个重建模型预测动作,推理质量由重建结果与实际动作的一致性决定。实验表明,Recon在四个领域中相比标准后验理性化基线(Backward Synthesis)提升54.7%胜率;进一步地,使用Recon评分作为奖励训练推理合成模型,可使下游用户建模性能提升至70.0%胜率。此外,Recon生成的推理具有跨模型迁移能力,并能提升超出重建模型本身的用户建模效果。研究证明,单纯依赖后验理性化不足以实现高质量推理合成,真正有用的推理应能从上下文中自然推导出动作。
链接: https://arxiv.org/abs/2605.26969
作者: Alan Zhu,Mihran Miroyan,Carolyn Wang,Andrew Zhou,Lisa Dunlap,Narges Norouzi,Joseph E. Gonzalez
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:User modeling aims to use language models (LMs) to mimic an individual’s behavior from a corpus of past context-action pairs (e.g., conversation turns), enabling the simulation of users in settings like behavioral science, human-AI collaboration, and market research. Recent approaches augment these corpora with synthesized reasoning traces, typically generated by conditioning on both context and action. However, such conditioning constitutes post-hoc rationalization rather than reasoning: the trace is guaranteed to justify the action, but may not encode the underlying latent causal decision paths. We propose Recon, which uses action reconstruction to score reasoning traces by their predictive power: given a context and candidate reasoning, a reconstruction model predicts the action, and reconstruction fidelity determines reasoning quality. Across four domains, Recon achieves a 54.7% win rate over Backward Synthesis, a standard post-hoc rationalization baseline. Further, we find that training a reasoning synthesis model with rewards derived from Recon improves downstream user modeling performance, achieving a win rate of up to 70.0% over baselines. We further show that Recon-synthesized reasoning transfers across models, and improves user modeling beyond the reconstruction model. Our work demonstrates that post-hoc rationalization is insufficient for reasoning synthesis, and that useful and interpretable reasoning should naturally elicit the action from the context.
[NLP-45] MerLean-Prover: A Recursive Looping Harness for End-to-End Lean 4 Theorem Proving
【速读】: 该论文试图解决的问题是:如何在不依赖微调、定制强化学习(Reinforcement Learning, RL)目标或定理特定辅助结构的情况下,构建一个端到端的Lean4定理证明系统,以自动替换代码中的“sorry”声明并生成可内核验证的证明。其解决方案的关键在于设计了一个由三种代理类型(规划Agent、校验Agent和Lean Agent)组成的递归外层循环架构,以证明计划(proof plan)为最小修订单元,从而实现高效、可扩展且无需额外训练的自动化推理流程。实验表明,该框架在FormalQualBench和Putnam2025等基准上均显著优于现有开源基线,说明合理的系统架构设计在端到端定理证明中具有核心作用。
链接: https://arxiv.org/abs/2605.26959
作者: Jinzheng Li,Zeru Zhu,Yuanjie Ren
机构: 未知
类目: Logic in Computer Science (cs.LO); Computation and Language (cs.CL)
备注:
Abstract:MerLean-Prover is an end-to-end Lean4 theorem prover that replaces sorry declarations with kernel-checkable proofs. It is built from three agent types (Planning, Check, and Lean) composed by a recursive outer loop whose unit of revision is the proof plan itself, and uses no fine-tuning, no custom RL objective, and no theorem-specific scaffolding. On FormalQualBench, a benchmark of 23 PhD-qualifying-exam theorems, MerLean-Prover solves 10/23, surpassing the strongest published open-source baseline (OpenGauss, 8/23). On Putnam2025, the same harness closes 12/12 with substantially lower total wall-clock than the next-best system that closes the full set. The harness also transfers to smaller models: Sonnet closes all four tested FormalQualBench problems, and Haiku closes the two short ones. These results suggest that harness design is a central factor in end-to-end Lean4 theorem proving, alongside raw model capability, and that a relatively simple harness can already be effective.
[NLP-46] ournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation
【速读】: 该论文试图解决开放域长文本生成中强化学习(Reinforcement Learning, RL)面临的奖励信号设计难题,尤其是在缺乏可靠参考答案和自动评估指标的情况下。现有基于评分量表(rubric-based)的方法通常依赖于点对点的大语言模型作为裁判(LLM-as-a-judge)打分,但此类绝对分数难以在复杂响应间校准、对同查询生成结果的区分度弱,且在优化过程中易饱和。解决方案的关键在于提出一种群体式奖励框架——Tournament-GRPO,其核心思想是通过多轮锦标赛机制将基于规则的LLM评判转化为相对奖励:在同一查询下的多个生成候选之间进行群体内比较,累积比赛结果并归一化为群体级奖励用于GRPO训练。实验表明,Tournament-GRPO在Deep Research Bench上显著优于现有基线方法,整体得分提升达4.52分;分析进一步揭示了锦标赛奖励在效果与效率之间具有更优权衡,且锦标赛设计直接影响训练动态,验证了基于规则的相对比较是一种有效的RL奖励信号来源。
链接: https://arxiv.org/abs/2605.26958
作者: Zixuan Yang,Yiqun Chen,Wei Yang,Erhan Zhang,Zihan Shen,Xiaochi Wei,Yan Gao,Yi Wu,Yao Hu,Jiaxin Mao
机构: Renmin University of China (中国人民大学); University of Southern California (南加州大学); Zhejiang University (浙江大学); Xiaohongshu Inc. (小红书)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scoring, but absolute scores are difficult to calibrate across complex responses, may provide weak discrimination among same-query rollouts, and can become saturated during optimization. We propose Tournament-GRPO, a group-wise reward framework that converts rubric-guided LLM judgments into relative rewards through repeated multi-round tournaments among same-query rollouts. Tournament-GRPO compares candidates within groups, accumulates tournament outcomes, and normalizes them into group-wise rewards for GRPO training. Experiments on Deep Research Bench show that Tournament-GRPO consistently outperforms existing reward-design baselines, achieving a 4.52-point overall-score improvement over the strongest baseline. Further analyses show that tournament rewards provide a favorable effectiveness–efficiency trade-off and that tournament design affects training dynamics. These results suggest that rubric-guided tournament comparison provides an effective reward signal for reinforcement learning in open-ended long-form generation.
[NLP-47] LELA: An End-to-end LLM -based Entity Linking Framework with Zero-shot Domain Adaptation
【速读】: 该论文旨在解决现有实体链接(Entity Linking)方法依赖特定知识库和领域、限制实际应用的问题。其解决方案的关键在于扩展LELA(一种模块化且领域无关的基于大语言模型的实体消歧方法),构建一个完整的Python工具库,集成零样本命名实体识别(Zero-shot Named Entity Recognition, Zero-shot NER),从而实现端到端的实体链接流程,适用于真实场景中的多样化应用。
链接: https://arxiv.org/abs/2605.26956
作者: Samy Haffoudhi(IP Paris, LTCI, DIG),Nikola Dobričić(IP Paris),Fabian Suchanek(IP Paris, LTCI),Nils Holzenberger
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Entity linking is a key component of many downstream NLP systems, yet existing approaches are often tied to the specific target knowledge bases and domains, limiting their real world application. In this paper, we extend LELA, a modular and domain-agnostic LLM-based entity disambiguation method, into a practical Python library that integrates zero-shot Named Entity Recognition (NER) -thereby providing a complete end-toend pipeline for entity-linking in real-world usage. We provide experimental results validating LELA’s performance and robustness across diverse entity linking settings. In our demo, users can play with the system on their own input texts.
[NLP-48] JuICE: A Benchmark for Evaluating LLM -Judge in Identifying Cultural Errors
【速读】: 该论文试图解决的问题是:当前大型语言模型(LLM)在跨文化场景中生成内容时,常因缺乏对“深层文化”(thick culture)的理解而产生看似合理但不符合当地文化语境的错误,而现有文化评估基准多采用事实核查或规范蕴含方法,未能有效识别这类文化偏差。解决方案的关键在于提出 JuICE(Benchmark for LLM-Judge in Identifying Cultural Errors),这是一个包含7,470个跨度级标注的文化与语言错误数据集,覆盖美国、韩国、印度尼西亚和孟加拉国四个国家的1,050组问答对(英文及本地语言),用于系统性评估LLM作为评判者(LLM-as-a-Judge)在检测文化错误上的能力。实验表明,即使最强的LLM-judge在错误定位任务中F1仅为0.52,且普遍遗漏本地居民能轻易识别的深层文化错误,这凸显了未来文化评估需从表面特征转向对文化意义深度与情境依赖性的建模。
链接: https://arxiv.org/abs/2605.26955
作者: Jiho Jin,Junho Myung,Juhyun Oh,Junyeong Park,Rifki Afina Putri,Sunipa Dev,Vinodkumar Prabhakaran,Alice Oh
机构: KAIST (韩国科学技术院); Google (谷歌); Universitas Gadjah Mada (印尼加查马达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) are increasingly deployed to users around the world, they are integrated into everyday tasks across diverse cultural contexts, from drafting personal communications to brainstorming creative ideas. These tasks are inherently cultural: they require contextual appropriateness, symbolic resonance, and tacit cultural expectations that native speakers draw on instinctively, meaning that a response can be factually plausible yet unmistakably wrong to a local reader. Existing cultural benchmarks have treated culture as a flat set of facts via fact verification or norm entailment methods, and have adopted LLM-as-a-Judge without examining whether they can capture such thick cultural errors. To address this gap, we present JuICE (Benchmark for LLM-Judge in Identifying Cultural Errors), a multilingual dataset of 7,470 span-level annotations of cultural and linguistic errors in long-form LLM responses. It covers 1,050 query-response pairs from four countries (the United States, South Korea, Indonesia, and Bangladesh), in both English and their countries’ main languages. Using JuICE, we find that even the strongest LLM-judge achieves only an F1 of 0.52 in the erroneous span detection task. Furthermore, LLM-judges consistently miss thick cultural errors that local residents readily identify. Our findings suggest that robust cultural evaluation must move beyond surface-level detection toward frameworks that account for the depth and situatedness of cultural meaning.
[NLP-49] AlbanianLLM Safety: A Safety Evaluation Dataset for Large Language Models in Albanian LREC2026
【速读】: 该论文试图解决的问题是:大型语言模型(LLM)的安全性评估主要集中在高资源语言上,而低资源语言(如阿尔巴尼亚语)在这一领域严重缺乏支持。解决方案的关键在于构建并发布首个公开可用的阿尔巴尼亚语LLM安全评估数据集——AlbanianLLMSafety,该数据集包含2,951个跨11个安全类别的提示(如自残、暴力、种族主义内容、儿童剥削和极端化等),每个提示均配有阿尔巴尼亚语原文、英文参考翻译及详细类别标签,从而填补了低资源语言安全评估基础设施的空白,并为开发更安全、更具包容性的LLM提供了关键基准。
链接: https://arxiv.org/abs/2605.26954
作者: Wajdi Zaghouani,Kholoud K. Aldous,Isra Fejzullaj
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at SIGUL2026 Workshop co-located with LREC2026
Abstract:Safety evaluation of Large Language Models (LLMs) has largely focused on high-resource languages, leaving low-resource languages critically underserved. We present AlbanianLLMSafety, the first publicly available safety evaluation dataset for LLMs in Albanian, a linguistically distinct low-resource language with approximately 7.5 million speakers across Albania, Kosovo, North Macedonia, and the diaspora. The dataset contains 2,951 prompts spanning 11 safety categories, including self-harm, violence, racist content, child exploitation, and radicalization, with an average of 268 prompts per category. Each prompt is provided in Albanian with an English reference translation and a detailed category label. This resource addresses a significant gap in safety evaluation infrastruc-ture for low-resource languages and provides an essential benchmark for developing safer, more inclusive LLMs. The dataset will be provided upon request to support safety evaluation, fine-tuning, red-teaming, and guardrail development for Albanian-speaking communities.
[NLP-50] Efficient Agent ic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement
【速读】: 该论文试图解决的问题是:在基于大语言模型(LLM)的智能体进行强化学习(Reinforcement Learning, RL)训练时,模型会逐渐产生冗余的工具调用,并模糊其内在知识边界,即无法准确判断何时应使用外部工具、何时仅依赖参数化知识即可。现有基于奖励塑造的方法因优化目标过于粗粒度,倾向于强制抑制所有工具调用,从而引发奖励黑客(reward hacking)问题。解决方案的关键在于提出AKBE(Agentic Knowledge Boundary Enhancement),一种基于策略的训练方法,通过在训练过程中动态探测模型的知识边界——即每个实例中是否需要工具及所需的最小工具调用次数——利用双路径(带工具与不带工具)轨迹滚动对比来分类轨迹并构建针对性的监督信号,进而引导高效且精准的工具使用模式。这些信号可无缝集成到原有RL训练循环中,在7个问答基准测试中平均提升任务准确率1.85点、减少18%工具调用,同时实现25%更高的工具利用率,且无精度-效率权衡。
链接: https://arxiv.org/abs/2605.26952
作者: Dingwei Chen,Zefang Zong,Zhipeng Ma,Leo Luo,Yang Li,Chengming Li,Peng Chen,Jie Jiang
机构: Tencent Inc; The Chinese University of Hong Kong; Shenzhen MSU-BIT University
类目: Computation and Language (cs.CL)
备注:
Abstract:Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model’s intrinsic knowledge boundary, where the model fails to distinguish when tools are needed versus when parametric knowledge suffices. Existing solutions based on reward shaping create coarse-grained optimization targets that tend to incentivize indiscriminate tool-call suppression, leading to reward hacking. In this paper, we propose AKBE (Agentic Knowledge Boundary Enhancement), an on-policy method that dynamically probes the model’s intrinsic knowledge boundary through dual-path (with-tool and no-tool) rollouts during training. We define the knowledge boundary as the per-instance determination of whether tools are required and the minimum tool calls necessary. By comparing correctness across paths, AKBE categorizes trajectories and constructs targeted supervisory signals that guide efficient tool-use patterns for each question. These signals are integrated seamlessly into the agentic RL training loop. Experiments on seven QA benchmarks demonstrate that AKBE improves task accuracy by +1.85 on average and reduces tool calls by 18% over standard agentic RL, yielding 25% higher tool productivity without any accuracy-efficiency trade-off. Further analysis suggests its plug-and-play compatibility across different RL algorithms and the mechanism of each signal category. Our code is available at this https URL.
[NLP-51] KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models LREC2026
【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)的安全性评估资源严重缺乏对哈萨克语(Kazakh)等低资源语言的支持,导致其在非英语场景下的安全行为难以被准确衡量。为填补这一空白,作者提出了 KZ-SafetyPrompts 数据集,这是一个涵盖十一类常见风险领域的哈萨克语提示数据集,包括自残、暴力、儿童剥削、色情内容、种族主义内容、极端主义以及受监管商品或非法活动等。解决方案的关键在于:构建了一个包含 5,717 条以哈萨克语(西里尔字母书写)原生编写的提示样本的数据集,每条提示均模拟真实用户查询(常以青少年或儿童口吻表达),且不包含具体操作指令;同时提供英文翻译以便跨语言分析,并通过标准化标注流程、边界案例判定规则和质量控制步骤(如模式统一、完整性检查与去重)确保数据可靠性。此外,该数据集的分类体系与主流安全分类法对齐,便于集成至现有评估流程中,从而揭示仅用英语评估无法捕捉的哈萨克语特定安全漏洞——例如基线测试显示 GPT-4o 在不同类别上的拒绝率从 5.5% 到 53.8% 不等。
链接: https://arxiv.org/abs/2605.26947
作者: Wajdi Zaghouani,Shimaa Amer Ibrahim,Aruzhan Muratbek,Olzhasbek Zhakenov,Adiya Akhmetzhanova
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at the SIGUL2026 Workshop co-located with LREC2026
Abstract:Kazakh is underrepresented in resources for evaluating the safety behavior of large language models. We present KZ-SafetyPrompts, a Kazakh prompt dataset for safety evaluation across eleven categories covering common risk areas such as self-harm, violence, child exploitation, sexual content, racist content, radicalization, and regulated goods or illegal activities. The dataset contains 5,717 prompts written natively in Kazakh (Cyrillic), organized by category, with English translations for cross-lingual analysis. Prompts resemble realistic user queries, often in a teen or child style, and are phrased as intent prompts without procedural instructions. We document the writing protocol, labeling procedures (including borderline-case decision rules), and quality-control steps (schema standardization, completeness checks, and deduplication). We also align the categories with widely used safety taxonomies to support integration with existing evaluation pipelines. Baseline results with GPT-4o show an overall refusal rate of 28.2%, varying from 5.5% to 53.8% across categories, indicating that Kazakh prompts expose category-specific safety gaps not captured by English-only evaluation.
[NLP-52] Accountable Human-AI Deliberation with LLM s: Scaling Collective Intelligence through Symbiotic Scaffolding LREC2026
【速读】: 该论文试图解决的问题是:如何在大规模民主协商中利用大语言模型(LLM)提升集体智能,同时避免纯AI中介导致的多元性丧失、过度追求共识以及参与者代表性合法性受损等问题。解决方案的关键在于提出一个“人-AI共生框架”,包含三个层次:观测与多样性增强层、带条款级溯源的引导层,以及人类优先的确认层;其核心创新包括基于显著性加权的覆盖率、多样性与擦除度量指标,结合交叉编码器相似性与因果消融诊断的溯源管道,偏好条件下的权衡控制机制,公平意识的可 contestability 流程,对抗鲁棒性测试,以及基于LLM作为裁判局限性的消融设计评估协议。这一框架旨在实现可扩展的集体智能,同时保障参与者的自主权与协商正当性。
链接: https://arxiv.org/abs/2605.26940
作者: Wajdi Zaghouani
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at the LREC 2026 / 2nd Workshop on Language-driven Deliberation Technology
Abstract:Large language models (LLMs) can support democratic deliberation at scales previously constrained by turn-taking and facilitation bandwidth. Recent work shows that LLM-generated group statements are often preferred over human-mediated outputs, while theoretical analyses argue that LLMs relax the simultaneity constraints limiting collective intelligence. Yet pure LLM mediation risks collapsing pluralism, over-optimizing for agreement, and undermining legitimacy when participants cannot contest how they are represented. We propose a symbiotic human-AI framework organized into three layers: observation and diversity amplification, facilitation with clause-level provenance, and human primacy for ratification. Our contributions include graded coverage, diversity, and erasure metrics with salience-aware weighting; a provenance pipeline combining cross-encoder similarity with causal knockout diagnostics; preference-conditioned trade-off control; equity-aware contestability workflows; adversarial robustness tests; and an evaluation protocol with ablation designs informed by evidence of LLM-as-judge limitations. The result is a testable blueprint for deliberation technology that scales collective intelligence while preserving agency and legitimacy.
[NLP-53] Beyond Questions: Evaluating What Large Language Models (Actually) Know
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)中参数化知识(Parametric Knowledge)评估中存在的“可用性偏差”(Availability Bias)问题,即现有知识评测基准依赖于设计者预先设定的问题(如“M.L. King的出生日期是什么?”),仅能衡量模型对特定预设问题的回答能力,而无法全面反映其自发表达的知识广度与深度。解决方案的关键在于提出一种新的评估范式——开放知识评估(Open Knowledge Evaluation),通过开放式诱导提示(如“告诉我你所知道的所有关于M.L. King的信息”)让模型自主选择并呈现其所掌握的知识,从而更真实地刻画模型自然表达的知识内容。作者进一步构建了BeQu(Beyond Questions)基准,包含10,000个实体及其参考语料库用于陈述验证,并系统分析了推理努力程度、模型规模、提示格式和知识领域等因素对模型知识表现的影响。
链接: https://arxiv.org/abs/2605.26937
作者: Luca Giordano,Simon Razniewski
机构: ScaDS.AI Dresden/Leipzig; TU Dresden, Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Parametric knowledge in large language models (LLMs) is a cornerstone of their success, yet remains poorly understood. Existing knowledge benchmarks typically rely on predefined questions (e.g., “What is the birth date of M.L. King?”), evaluating only knowledge that benchmark designers explicitly choose to query, a problematic availability bias. In this paper, we introduce open knowledge evaluation, a new paradigm for LLM knowledge benchmarking. Instead of asking narrow questions, it evaluates models on the knowledge they choose to surface in response to open-ended elicitation prompts (e.g., “Tell me everything you know about M.L. King”). This shifts the focus from predefined answer retrieval toward characterizing the knowledge models naturally express. We instantiate this paradigm with BeQu (Beyond Questions), a benchmark of 10,000 entities paired with reference corpora for statement verification. Using BeQu, we evaluate a broad range of language models and analyze the effects of reasoning effort, model scale, prompt format, and knowledge domain. Data and leaderboard are available on this work’s GitHub repository and at the benchmark’s website. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.26937 [cs.CL] (or arXiv:2605.26937v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.26937 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-54] DunbaaBERT: From Sacrifice to Semantics
【速读】: 该论文试图解决的问题是:乌尔都语(Urdu)在自然语言处理(NLP)领域资源匮乏且评估设置碎片化,导致相关模型性能发展滞后。解决方案的关键在于提出 DunbaaBERT,一个从头训练的乌尔都语 RoBERTa-base 模型家族,其使用 Byte-BPE 词表(32k、52k 和 96k 个 token)在去重后的 17GB 乌尔都语语料库上进行训练。通过在多个内在和下游任务(包括语言可接受性、新闻分类、有害语言检测和情感分析)上的系统评估,研究发现尽管模型规模相对紧凑,但精心构建的乌尔都语专用编码器仍能保持竞争力;尤其值得注意的是,更大的词表并不一定提升下游效果,其中 32k 词表版本在整体效率表现上最为突出。
链接: https://arxiv.org/abs/2605.26935
作者: Iffat Maab,Waleed Jamil,Raphael Schmitt
机构: Research and Development Center for Large Language Models (LLMC), National Institute of Informatics, Tokyo; Independent Researcher, Edinburgh, United Kingdom; School of Computation, Information and Technology, Technical University of Munich, Germany; Institute of General Practice, Faculty of Medicine and Medical Center, University of Freiburg, Germany
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models have achieved strong performance across many NLP tasks, yet Urdu remains comparatively underexplored due to limited resources and fragmented evaluation settings. To address this gap, we introduce DunbaaBERT, a family of Urdu RoBERTa-base models trained from scratch with Byte-BPE vocabularies of 32k, 52k, and 96k tokens on a deduplicated 17GB Urdu corpus. We evaluate DunbaaBERT across intrinsic and downstream Urdu NLP benchmarks covering linguistic acceptability, news classification, offensive language detection, and sentiment analysis while analyzing vocabulary-size effects on performance and efficiency trade-offs. Across benchmarks, the DunbaaBERT variants achieve competitive performance against strong multilingual baselines while consistently maintaining favorable efficiency trade-offs. Interestingly, larger vocabularies do not consistently improve downstream effectiveness, with DunbaaBERT _\text32k repeatedly providing the strongest overall efficiency profile. Overall, our results demonstrate that carefully curated Urdu-specific encoder models can remain highly competitive despite comparatively compact model and training scales. All models are released under the MIT license.
[NLP-55] Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks
【速读】: 该论文试图解决现有强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在训练推理模型时存在的局限性问题,即当前研究对推理空间的理解过于单一:仅将难度视为推理深度,且奖励机制局限于前向演绎状态追踪。为此,作者提出从两个维度重新刻画推理空间——一是引入环境复杂度(environment complexity),要求模型在干扰项和交互结构中识别正确路径;二是扩展奖励形式,涵盖四种真实世界推理能力:演绎状态追踪、溯因恢复隐藏事件或事实、归纳规则提取以及类比迁移。解决方案的关键在于构建一个受控的知识图谱合成环境,其中每个实例在推理深度、环境复杂度和任务类型上独立变化,从而实现对这些因素的解耦分析。实验发现:联合覆盖深度与复杂度的策略优于单一维度训练;不同推理类型响应不均,溯因推理在RL覆盖范围外显著退化,且任务间相关性呈现演绎-溯因与归纳-类比两组聚类;固定预算下均匀混合优于分阶段课程训练。此外,近期现成模型也表现出相同的演绎过强而溯因不足的不对称现象,说明该差距并非仅由实验设计引起。
链接: https://arxiv.org/abs/2605.26934
作者: Yihua Zhu,Qianying Liu,Fei Cheng,Jiaxin Wang,Akiko Aizawa,Sadao Kurohashi,Hidetoshi Shimodaira
机构: Kyoto University (京都大学); University of Tokyo (东京大学); NII LLMC (日本国立信息研究所语言模型与认知研究中心); RIKEN (理化学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Pre-print
Abstract:Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existing studies is their narrow view of the reasoning space: difficulty is treated as reasoning depth alone, and reward is concentrated on forward deductive state tracking. We instead characterize the reasoning space along two dimensions. Difficulty. Beyond reasoning depth, we study environment complexity, where models must identify the correct path amid distractors and interacting structures. Rewarded reasoning form. We consider four abilities core to real-world reasoning: deductive state tracking, abductive recovery of hidden events or facts, inductive rule induction, and analogical transfer. To disentangle these factors, we construct a synthetic knowledge-graph environment with controlled pre- and post-training distributions, where each instance varies along depth, complexity, and task family. Three findings emerge: joint depth-complexity coverage outperforms single-axis recipes; reasoning families respond non-uniformly, with abductive reasoning degrading outside the RL-covered region and task correlations clustering into deductive-abductive and inductive-analogy pairs; and uniform mixing outperforms staged curricula under a fixed budget. We also find that recent off-the-shelf models exhibit the same deductive-over-abductive asymmetry, suggesting that this gap is not merely an artifact of our controlled setup.
[NLP-56] Learning to Adapt SFT Data for Better Reasoning Generalization
【速读】: 该论文试图解决的问题是:在使用监督微调(Supervised Fine-Tuning, SFT)提升大语言模型(Large Language Models, LLMs)推理能力时,若外部专家数据的分布与目标模型自身分布不一致,直接微调会导致模型泛化性能下降。解决方案的关键在于提出一种名为“推理微调的数据适配”(Data Adaptation for Reasoning Tuning, DART)的新方法,其核心思想是将固定但可能分布偏移的SFT数据视为一个优化问题,通过强化学习训练一个映射模型(mapper model),将原始SFT数据转换为更贴合目标模型分布和学习偏好(learning preferences)的适应性监督信号。该转换后的数据用于后续SFT,从而显著提升模型的泛化能力和训练效率,并在多个模型和数据集上验证了优于标准SFT的效果。
链接: https://arxiv.org/abs/2605.26924
作者: Lisong Sun,Li Wang,Chen Zhang,Jinyang Wu,Kui Zhang,Tianhao Peng,Wenjun Wu
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学); Hangzhou International Innovation Institute (杭州国际创新研究院); Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing (未来区块链与隐私计算高精尖创新中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have achieved remarkable progress, with post-training playing a crucial role in enhancing their reasoning capabilities. Among post-training paradigms, supervised fine-tuning (SFT) is widely used: it leverages external data to provide dense supervision and enables efficient training. However, directly fine-tuning on expert data can hurt generalization when the data distribution is mismatched with the target model’s own distribution. In this work, we propose Data Adaptation for Reasoning Tuning (DART), which formulates the use of a fixed, potentially distributionally misaligned SFT dataset as an optimization problem over demonstration transformations. DART trains a mapper model with reinforcement learning to convert original SFT data into model-adapted supervision that better matches the target model’s distribution and learning preferences. The transformed data are then used for SFT, allowing the target model to better exploit external supervision. Experiments across multiple models and datasets show that DART improves generalization, achieves higher training efficiency than direct RL, and helps models surpass standard SFT. Our code is available at this https URL.
[NLP-57] Are Video Models Zero-Shot Learners and Reason ers in Education? EduVideoBench A Knowledge-Skills-Attitude Benchmark for Educational Video Generation
【速读】: 该论文试图解决的问题是:当前视频生成模型(Video Generation Models, VGMs)虽已进入教育场景,但现有评估基准仅关注感知质量、内在真实性、通用安全性或视频作为推理媒介的能力,缺乏对输出内容是否具备教育有效性的系统评估。解决方案的关键在于提出EduVideoBench——首个基于知识-技能-态度(Knowledge-Skills-Attitude, KSA)框架的平衡教育领域基准,将教学适切性与教育安全性联合评估,而非将其视为孤立的质量维度。实证结果表明,当前五种前沿VGM在知识、技能和态度三个维度上均存在显著改进空间;此外,专家定性分析揭示教育有效性具有多维特性,任一环节(如节奏、可读性或符号使用)的偏差即可导致整个视频失效,从而强调了教育合理性评估的复杂性与必要性。
链接: https://arxiv.org/abs/2605.26918
作者: Unggi Lee,Hoyoung Ahn,Yoon Choi,Seonmin Eun,Jahyun Jeong,Seonmin Jin,Harmony Jung,Hye Jin Kim,Chaerin Lee,Hyunji Lee,Jeongjin Lee,Soohwan Lee,Young-Seok Oh,Jaehyeon Park,Sun-ok Ryu,Sunyoung Shin,Yoorim Son,Haeun Park,Yeil Jeong
机构: Korea University Sejong Campus; Cardiff Metropolitan University; Seoul National University; Bugil Academy; Gyeonggi Provincial Office of Education; Loughborough University; Korea National University of Education; Korea University; Korean Educational Development Institute; Sungshin Women’s University; Seoul National University of Education; Korea Institute for Curriculum and Evaluation; Indiana University Bloomington
类目: Computation and Language (cs.CL)
备注:
Abstract:Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs are educationally valid. In this work, we present EduVideoBench, the first balanced benchmark in the education domain, grounded in the Knowledge-Skills-Attitude (KSA) framework so that pedagogical adequacy and educational safety are evaluated jointly rather than as ad-hoc quality dimensions. Across five frontier VGMs, our results show substantial room for improvement across knowledge, skills, and attitude before they are classroom-ready. We complement this with a qualitative analysis of expert comments, finding that educational validity is multi-component, where a single misaligned element such as pacing, legibility, or notation can invalidate an otherwise correct video. We hope EduVideoBench will guide the development of VGMs that are pedagogically grounded and safe for the classroom.
[NLP-58] GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理中普遍存在的后验合理化(post-hoc rationalization)问题,即模型生成看似合理但不忠实于真实推理过程的推理链,导致结果不可信。解决方案的关键在于提出GeoFaith——一个基于潜在几何结构和熵动态的时空框架,用于诊断与强化推理的忠实性(faithfulness)。其核心创新包括:1)开发可扩展的自举流水线,将步骤级标注从1k样本扩展至20k样本;2)训练一个8B参数的忠实性检测器,在标准基准上超越GPT-5;3)设计一种忠实性感知的强化学习框架,联合优化结果正确性、过程忠实性和轨迹一致性。实验表明,该方法在忠实性检测和下游推理任务中均表现更优,且生成更短、更可解释的推理链而不牺牲准确性。
链接: https://arxiv.org/abs/2605.26893
作者: Weijiang Lv,Wentong Zhao,Jiayu Wang,Yuhao Wu,Jiaheng Wei,Xiaobo Xia
机构: Xidian University; Xi’an Jiaotong University; Mohamed bin Zayed University of Artificial Intelligence; The Hong Kong University of Science and Technology (Guangzhou); University of Science and Technology of China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Chain-of-Thought (CoT) reasoning has advanced large language models (LLMs), but outcome-based supervision leads to pervasive post-hoc rationalization, producing plausible yet unfaithful reasoning chains. Most prior faithfulness assessment methods are either unscalable, expensive, or unreliable. We propose GeoFaith, a spatio-temporal framework that leverages latent geometric structure and entropy dynamics to diagnose and enforce faithful reasoning. We develop a scalable bootstrapping pipeline expanding step-level annotations from 1k to 20k samples across four domains, train an 8B faithfulness detector outperforming GPT-5 on standard benchmarks, and design a faithfulness-aware reinforcement learning framework jointly optimizing outcome correctness, process faithfulness, and trajectory consistency. Experiments show the proposed method achieves superior performance on both faithfulness detection and downstream reasoning, producing shorter, more interpretable chains without sacrificing accuracy. Our code will be made available publicly.
[NLP-59] nor Nordics Customer Service self-help corpus
【速读】: 该论文旨在解决北欧语言(芬兰语、丹麦语、挪威语和瑞典语)在客户服务中心领域缺乏高质量、多语言、可公开获取的语料库的问题,这一问题限制了检索增强生成(Retrieval-Augmented Generation, RAG)、跨语言迁移学习及基于代理的服务架构等前沿自然语言处理技术的研究与应用。解决方案的关键在于构建一个包含1,122份人工验证文档的多语言自助服务语料库,这些文档来自四家北欧电信运营商的公开自助页面,并通过大语言模型(LLM)与人工标注相结合的管道进行去标识化和相关性筛选,从而确保数据质量与实用性;同时,该语料库覆盖网络硬件、移动服务、电视与流媒体、账单及账户管理等多个主题,具有广泛的应用价值和可复现性。
链接: https://arxiv.org/abs/2605.26891
作者: Mike Riess
机构: Telenor Group (特伦诺集团)
类目: Computation and Language (cs.CL)
备注: 8 pages, 2 figures, 5 tables. Submitted to Nordic Machine Intelligence. Dataset: this https URL
Abstract:This paper presents a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling over one million tokens. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Domain-specific datasets for Nordic languages remain scarce, particularly in customer service: a domain of growing importance for retrieval-augmented generation, cross-lingual transfer learning, and emerging agent-based service architectures. An analysis of the corpus reveals substantial variation in document length and structure across operators, reflecting distinct editorial strategies, as well as broad topical coverage spanning network hardware, mobile services, TV and streaming, billing, and account management. The dataset is publicly available under a CC-BY-NC-SA-4.0 license at this https URL, intended to support reproducible research in Nordic NLP and information retrieval.
[NLP-60] he Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection
【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)训练中依赖教师生成监督信号时存在的关键问题:当前实践中通常选择测试性能最强的教师模型来生成学生模型的训练数据,隐含假设是教师的测试表现可直接反映其教学质量。然而,作者指出这一假设可能失效——即使多个教师对同一问题给出正确答案,最强教师的答案未必对学生最具教学价值。解决方案的关键在于提出一种名为“以学生为中心的答案采样”(Student-Centric Answer Sampling, SCAS)的框架,该框架基于教师生成答案中估计的学生学习成本(即学生在训练过程中从某答案中获益的程度),通过一个基于token级梯度分解推导出的高效前向代理指标来指导答案选择。实验表明,在30个教师模型、6个学生基础模型和8项任务上的结果均显示SCAS能稳定提升学生模型性能,证明有效的知识蒸馏应优先匹配当前学生的状态,而非单纯依赖教师的强度。
链接: https://arxiv.org/abs/2605.26872
作者: Zhengyu Hu,Zheyuan Xiao,Linxin Song,Fengqing Jiang,Yutai Li,Zhengyu Chen,Zhihan Xiong,Yue Liu,Junhao Lin,Yao Su,Lijie Hu,Kaize Ding,Xiao Teng,Radha Poovendran
机构: University of Washington; University of Texas at Austin; University of Southern California; Independent Researcher; National University of Singapore; Microsoft; Google; Mohamed bin Zayed University of Artificial Intelligence; Northwestern University; Allen Institute for AI (AI2)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:LLM training increasingly relies on teacher-generated supervision, from synthetic responses to reasoning traces and tool-use demonstrations. Current practice often chooses the highest-performing teacher to generate student training data, implicitly treating teacher test performance as a proxy for teaching quality. We show that this assumption can fail: even when multiple teachers provide correct answers to the same question, the answer from the strongest teacher is not necessarily the best supervision for a given student. To address this gap, we propose Student-Centric Answer Sampling (SCAS), a framework that selects from verified teacher-generated answers according to their estimated student-centric learning cost. Motivated by a token-wise gradient decomposition, we derive an efficient forward-only proxy for this cost and use it to guide answer selection during training. Experiments across 30 teacher models, 6 student base models, and 8 tasks show that SCAS consistently improves student performance, suggesting that effective distillation should prioritize supervision matched to the current student rather than teacher strength alone.
[NLP-61] Uncertainty-Aware Budget Allocation for Adaptive Test-Time Reasoning
【速读】: 该论文试图解决的问题是:在语言模型推理中,对所有问题均匀分配采样计算资源(sampling budget)效率低下,导致简单问题被过度采样而困难问题仍缺乏充分探索。其解决方案的关键在于提出一种无需额外推理成本的不确定性感知预算分配机制(Uncertainty-Aware Budget Allocation, UAB),该机制基于每道题目的不确定性估计(通过输出概率直接计算的平均负对数似然,ANLL)进行动态重分配。UAB分为两个阶段:第一阶段为每个问题生成一个响应并获取难度信号;第二阶段利用边际贪心算法精确求解一个凹覆盖最大化代理目标,将剩余预算优先分配给不确定性高的问题,从而提升整体准确率,尤其在低资源场景下效果显著,且无需辅助模型或额外大语言模型调用。
链接: https://arxiv.org/abs/2605.26849
作者: Manh Nguyen,Sunil Gupta,Hung Le
机构: Deakin University, Australia
类目: Computation and Language (cs.CL)
备注:
Abstract:Sampling multiple responses improves language model reasoning, but uniform compute allocation is inefficient: easy questions are over-sampled while hard questions remain under-explored. We propose Uncertainty-Aware Budget Allocation (UAB), a concave integer optimization framework that reallocates a fixed sampling budget based on per-question uncertainty estimated at no additional inference cost. In Phase 1, every question receives one generation; its average negative log-likelihood (ANLL), extracted directly from output log-probabilities, serves as a difficulty signal while the generation contributes to the final vote. In Phase 2, the remaining budget is allocated by a marginal-greedy algorithm that solves a concave coverage-maximization surrogate exactly: uncertain questions receive more sampling budget while confident questions receive fewer additional samples. Evaluated on six open-weight and black-box models spanning 1.5B to 27B parameters and five reasoning benchmarks covering math, logic, and preference tasks, UAB outperforms baselines by up to +3% in average accuracy and up to +5% on individual benchmarks, with the largest gains in low-resource settings, requiring no auxiliary model or additional LLM call. Code is publicly available at this https URL.
[NLP-62] MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training
【速读】: 该论文试图解决的问题是:当前基于一阶优化的算法(如Muon)在大规模语言模型训练中容易陷入尖锐局部极小值(sharp local minima),从而限制了模型收敛质量和下游任务性能。其解决方案的关键在于提出MONA优化器,它将Muon的正交化框架与曲率感知加速机制相结合——通过在Muon的梯度处理流程中引入一个加速项,该加速项由梯度差的指数移动平均计算得到,从而实现从尖锐极小值中逃逸的同时保持Muon原有的谱范数正则化特性。理论分析和实验结果表明,MONA在多个规模的Mixture-of-Experts(MoE)预训练任务中均优于Muon和AdamW,并在68B参数模型上实现了最先进的性能表现。
链接: https://arxiv.org/abs/2605.26842
作者: Jiacheng Li,Jianchao Tan,Hongtao Xu,Jiaqi Zhang,Yifan Lu,Yerui Sun,Yuchen Xie,Xunliang Cai
机构: Meituan(美团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The Muon optimizer has recently offered a promising alternative to AdamW for large language model training, leveraging matrix orthogonalization to produce geometry-aware updates. However, like all first-order methods, Muon can become trapped in sharp local minima. In this work, we present MONA, an optimizer that bridges Muon’s orthogonalization framework with curvature-aware acceleration. MONA adds an acceleration term directly into Muon’s gradient processing pipeline. This term is calculated from the exponential moving average of gradient differences. We provide a detailed convergence analysis for MONA, showing that the acceleration term enables escape from sharp minima while preserving Muon’s spectral-norm regularization. Empirically, MONA achieves better convergence and downstream task performance compared to both Muon and AdamW across three scales of Mixture-of-Experts pretraining, spanning from 1B to 68B parameters, with the largest model trained on 1 trillion tokens. Furthermore, we conduct supervised fine-tuning on the MOE-68B-A3B model and evaluate it on general capability, mathematical reasoning, and code generation benchmarks, where MONA achieves SOTA performance.
[NLP-63] Optimising Factual Consistency in Summarisation via Preference Learning from Multiple Imperfect Metrics EMNLP2025
【速读】: 该论文试图解决语言模型在事实一致性摘要生成任务中因评估指标不完善而导致性能受限的问题。现有评估指标在捕捉多样化事实性错误方面能力不足,难以有效引导模型优化。解决方案的关键在于利用多个弱指标(weak metrics)的组合来更全面地识别事实性偏差,而非依赖单一指标。作者提出了一种自动化训练流程,通过聚合不同弱指标的得分构建偏好数据集,并基于指标间分歧程度过滤低质量样本,从而避免复杂的奖励建模。此外,通过对同一源文档采用不同解码策略生成语义相似但事实不同的摘要对,使模型能够从细微的词汇差异中学习事实性差异。实验表明,该方法在多种架构(从早期编码器-解码器模型到现代大语言模型)上均能稳定提升事实一致性,且小模型也可达到与大模型相当的性能。
链接: https://arxiv.org/abs/2605.26840
作者: Yuxuan Ye,Raul Santos-Rodriguez,Edwin Simpson
机构: 未知
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Findings
Abstract:Reinforcement learning with evaluation metrics as rewards is widely used to enhance specific capabilities of language models. However, for tasks such as factually consistent summarisation, existing metrics remain underdeveloped, limiting their effectiveness as signals for shaping model this http URL individual factuality metrics are unreliable, their combination can more effectively capture diverse factual errors. We leverage this insight to introduce an automated training pipeline that improves factual consistency in summaries by aggregating scores from different weak metrics. Our approach avoids the need for complex reward shaping by mapping scores to preferences and filtering out cases with high disagreement between metrics. For each source document, we generate lexically similar summary pairs by varying decoding strategies, enabling the model to learn from factual differences caused by subtle lexical differences. This approach constructs a high-quality preference dataset using only source this http URL demonstrate consistent factuality gains across models, ranging from early encoder-decoder architectures to modern large language models, with smaller models reaching comparable factuality to larger ones.
[NLP-64] ContextGuard: Structured Self-Auditing for Context Learning in Language Models
【速读】: 该论文试图解决的问题是:尽管大语言模型(LLMs)具备强大的推理能力,但在应用复杂上下文知识时仍难以保持忠实性。具体表现为,在富含上下文的任务中,模型可能正确执行核心推理路径,却忽略边缘性、持续存在或格式敏感的要求,导致任务失败。解决方案的关键在于识别并强化模型对这些“非核心但关键”约束的理解与遵循能力,从而提升其在真实场景中的可靠性和鲁棒性。
链接: https://arxiv.org/abs/2605.26827
作者: Hongbo Jin,Chi Wang,Haoran Tang,Zhongjing Du,Xu Jiang,Jingqi Tian,Qiaoman Zhang,Jiayu Ding
机构: Peking University (北京大学); SCUT (华南理工大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent benchmarks reveal that despite strong reasoning capabilities, large language models (LLMs) still struggle to faithfully apply complex contextual knowledge. These failures are often not wholesale reasoning collapses: in context-rich tasks, models may follow the central reasoning path while missing peripheral, persistent, or format-sensitive requirements.
[NLP-65] Generating Logically Consistent Synthetic Supply Chain Data with LLM -Driven Knowledge Graph Reasoning
【速读】: 该论文旨在解决合成数据在供应链分析中面临的两个核心问题:数据稀缺性和数据隐私性,同时确保生成的合成数据不仅在统计分布上与真实数据一致,还能保留供应链流程中的操作逻辑(即“供应链数据的物理规律”),包括时间顺序、数学依赖关系、层级分类结构和条件规则等。现有表格生成模型主要优化统计保真度和下游预测性能,常导致生成记录虽表面合理但违反基本操作约束。解决方案的关键在于提出一种基于知识图谱引导的框架TabKG,其核心创新是构建列关系知识图(CR-KG)以显式建模数据间的操作依赖关系;通过多大语言模型(LLM)集成与多数投票机制从列元数据中提取候选关系,并利用真实数据验证去除幻觉或不支持的边;随后,TabKG将原表压缩为独立列,使用潜在扩散模型生成这些列,并依据验证后的CR-KG确定性重建依赖列,从而在生成过程中强制实现逻辑一致性。
链接: https://arxiv.org/abs/2605.26823
作者: Yunbo Long,Ge Zheng,Liming Xu,Alexandra Brintrup
机构: University of Cambridge (剑桥大学); The Alan Turing Institute (艾伦图灵研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Synthetic data offers a promising solution to two persistent barriers in supply chain analytics: data scarcity and data privacy. However, for synthetic data to support operational simulation and decision-making, it must do more than reproduce the statistical distributions of real records, and also preserve the \emphoperational logic that governs supply chain processes, including the temporal orderings, mathematical dependencies, hierarchical taxonomies, and conditional rules that make a record operationally plausible. We consider this logic as the ``physics’’ of supply chain data. Existing tabular generative models are primarily optimized for distributional fidelity and downstream predictive utility, and therefore often generate records that appear statistically realistic but violate fundamental operational constraints. This paper introduces \textbf\textitTabKG, a knowledge-graph-guided framework for logically consistent synthetic supply chain tabular data generation. TabKG constructs a \textbf\textitColumn Relationship Knowledge Graph (CR-KG) to represent data operational dependencies. It uses a multi-LLM ensemble with majority voting to propose candidate relationships from column metadata, validates these relationships against real data to remove hallucinated or unsupported edges, and then uses the validated CR-KG to guide generation. Specifically, TabKG compresses the original table into independent columns, generates these columns using a latent diffusion model, and deterministically reconstructs dependent columns according to the validated relationships, enforcing logical consistency by construction with respect to the discovered operational rules.
[NLP-66] Psychological Constructs in Shared Semantic Space
【速读】: 该论文试图解决心理学中不同构念(construct)常被分散在独立的测量工具、数据集和研究传统中,导致难以直接比较的问题。解决方案的关键在于构建一个共享的词嵌入空间(word-embedding space),通过监督语义差异法(Supervised Semantic Differential)估计每个构念特有的语义梯度,并将其投影到理论驱动的参考轴上进行比较。以情绪的三维度模型(Valence, Arousal, Dominance, VAD)作为情感坐标系,作者首先从英语词级情感规范中恢复出可解释的VAD方向,随后将GoEmotions的27类情绪语义梯度投影至该空间,验证了情绪在效价和唤醒维度上的预期组织结构;最后将同样的方法应用于IPIP-NEO-300人格量表的五大特质及其子维度,发现领域级结果具有较高一致性,而子维度结果因问卷文本稀疏性更具探索性。研究表明,词嵌入空间能够支持跨测量体系的心理构念比较,前提是需评估其语义定位的稳定性和可解释性。
链接: https://arxiv.org/abs/2605.26801
作者: Hubert Plisiecki
机构: IDEAS Research Institute
类目: Computation and Language (cs.CL)
备注:
Abstract:Psychological constructs are often measured in separate instruments, datasets, and research traditions, which makes direct comparison difficult. This paper proposes a framework for making such constructs semantically commensurate by representing and comparing them as directions in a shared word-embedding space. Using Supervised Semantic Differential, we estimate construct-specific semantic gradients from text-outcome associations and project them onto theoretically motivated reference axes. As an initial test case, we use Valence, Arousal, and Dominance (VAD) as an affective coordinate system. First, we recover interpretable VAD directions from English word-level affective norms. Second, we project semantic gradients for 27 GoEmotions categories into this space and recover the expected organization of emotions, especially along valence and arousal. Third, we apply the same procedure to Big Five personality domains and facets derived from IPIP-NEO-300 item-factor associations. Domain-level placements are broadly coherent, while facet-level results are more exploratory because they rely on sparse questionnaire text. The results suggest that embedding spaces can support construct-level comparison across otherwise incommensurable psychological measurements, provided that semantic placements are assessed for stability and interpretability.
[NLP-67] Latent Recurrent Transformer: Architecture Exploration Training Strategies and Scaling Behavior
【速读】: 该论文试图解决自回归Transformer在推理过程中缺乏高效状态记忆机制的问题,导致模型难以利用先前位置的高阶信息来增强当前预测。解决方案的关键在于提出Latent Recurrent Transformer (LRT),其核心创新是复用前一标记的源层隐藏状态(source-layer hidden state)作为跨位置的循环记忆(recurrent memory),并通过引入交错并行训练(interleaved parallel training)实现大规模预训练,无需顺序展开Transformer结构。该方法不引入额外的暂停标记或深度循环结构,同时保持标准注意力机制和KV缓存接口不变,在仅增加0.3%参数的情况下,显著提升了语言建模损失和上下文学习能力,且计算成本约为基线的两倍。
链接: https://arxiv.org/abs/2605.26797
作者: Zeyi Huang,Xuehai He,LiLiang Ren,Yiping Wang,Baolin Peng,Hao Cheng,Shuohang Wang,Pengcheng He,Jianfeng Gao,Yong Jae Lee,Yelong Shen
机构: Microsoft; University of Wisconsin-Madison; University of Washington
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We study Latent Recurrent Transformer (LRT), a lightweight augmentation of autoregressive transformers that reuses a high-level source-layer hidden state from the previous token as recurrent memory for the next token. Because this source state is already computed during ordinary decoding, LRT adds a cross-layer recurrent latent pathway across positions without inserting pause tokens or extra depth loops, and the standard attention mechanism and KV-cache interface are preserved. To pretrain this recurrence at scale without sequentially unrolling the transformer, we introduce interleaved parallel training: a single full-sequence initialization forward pass builds a shared buffer; then disjoint position subsets are refined in parallel and written back, so that all tokens receive recurrent-memory-aware supervision at roughly 2 times baseline compute. Across nanochat style backbones and a wide range of tokens-per-parameter budgets, LRT improves both language-modeling loss and in-context learning under matched effective compute while adding as little as 0.3% parameters.
[NLP-68] SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在多轮对话中性能显著下降的问题,即“Lost in Conversation”现象——当任务信息被分阶段逐步提供时,模型性能相比单轮完整输入可下降高达39%。研究表明,这种性能下降主要源于可靠性(reliability)的恶化(不稳定性增加112%),而非能力(aptitude)的根本退化(仅下降16%)。其关键解决方案是提出SeDT(Sentence-transformer Decision-Transformer)方法,这是一种无需训练、仅在推理阶段生效的机制:通过引入来自离线强化学习的return-to-go条件控制思想,利用语义、词汇和位置三个互补信号为每段对话片段计算累积相关性得分,并将整个标注后的对话历史以加权形式呈现给模型,从而帮助模型识别哪些先前交互对当前任务至关重要。实验表明,SeDT在三种LLM和三种生成任务上的九种组合中均优于基线,最高提升达+37.7%,同时在七组场景中显著降低不稳定性,证明了“明确告知模型哪些过去对话内容重要”即可有效恢复多轮对话中的性能损失。
链接: https://arxiv.org/abs/2605.26788
作者: Ramakrishna Vamsi Setti,Jagadeesh Rachapudi,Sachin Chaudhary,Praful Hambarde,Amit Shukla
机构: Drone Lab, IIT Mandi; UPES, Dehradun
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) achieve impressive performance when a task is fully specified in a single turn, yet the same models lose up to 39% of that performance when the identical task is revealed incrementally across multiple turns, a phenomenon documented at scale as Lost in Conversation. Crucially, this collapse is almost entirely a reliability failure; the best case, the aptitude only falls 16%, while the unreliability more than doubles (+112%). We argue that the root cause is structural, a flat conversation history assigns equal implicit weight to every prior turn, giving the model no signal to distinguish a critical constraint from incidental dialog. We present SeDT Sentence-transformer Decision-Transformer, a training-free inference-time method that resolves this by importing return-to-go conditioning from offline reinforcement learning. SeDT annotates each conversation shard with a cumulative relevance score derived from three complementary semantic, lexical, and positional signals and presents the full annotated history to the model at the final turn, without weight changes, without training data, and without discarding context. Evaluated on the Lost-in-Conversation benchmark in three LLMs and three generation tasks, SeDT outperforms the sharded baseline in all nine model-task combinations, with gains up to +37.7% in mean performance P and simultaneous reductions in unreliability in seven of the nine combinations. In short, telling the model which past turns matter is sufficient to substantially recover the performance lost in conversation.
[NLP-69] EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation
【速读】: 该论文试图解决的问题是:在对抗性谈判中,经过人类偏好对齐的大型语言模型(LLM)可能因过度遵循礼貌和安全响应而暴露于情感操控风险,导致谈判结果偏向对方利益。解决方案的关键在于提出一个名为EmoDistill的离线框架,通过将情感策略解耦为“情绪选择”与“情绪表达”两个模块:利用隐式Q学习(IQL)选择最优情绪,结合低秩适应(LoRA)策略通过监督微调(SFT)与裁判策略优化(JPO)学习具体表达方式。实验表明,该方法在四个高价值、情绪敏感的谈判场景中显著提升代理效用,优于基线模型,并展现出跨领域、跨对手及训练-训练对战中的泛化能力,且无需在线谈判即可完成训练。
链接: https://arxiv.org/abs/2605.26785
作者: Yunbo Long,Haolang Zhao,Lukas Beckenbauer,Liming Xu,Alexandra Brintrup
机构: University of Cambridge (剑桥大学); Technical University of Munich (慕尼黑工业大学); Exiger LLC (Exiger有限责任公司); The Alan Turing Institute (艾伦图灵研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability: emotionally framed language may steer agents toward the counterparty’s interests. Using GoEmotions-based affective prompting, we show that emotion substantially shifts negotiation outcomes, suggesting that emotion is a strategic action channel rather than a surface style. Thus, we introduce \textbfEmoDistill, an offline framework for distilling emotional negotiation skills into language model agents. EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns \emphwhich emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns \emphhow to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO). Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection. Ablations show that emotion conditioning is essential, and transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments. Overall, EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training.
[NLP-70] Quality Without Usefulness: LLM -Generated XAI Narratives as Trust Heuristics Rather Than Decision Aids
【速读】: 该论文试图解决的问题是:尽管生成式 AI (Generative AI) 生成的自然语言解释(Natural Language Explanations, NLEs)在文本质量指标(如合理性、连贯性和可理解性)上表现优异,但这些高质量的解释是否真正提升了用户在实际任务中的表现和决策有效性。解决方案的关键在于通过五个受控实验(共2,730次判断,涵盖60个测试实例),在保持NLE文本质量恒定的前提下,系统评估NLE对五类不同实用性维度的影响。研究发现,NLE并未提升任何任务的准确性,反而显著增加了用户的自我信心,且这种信心提升源于文本的存在而非内容本身;更严重的是,在分布外检测任务中,NLE削弱了模型判别不可靠预测的能力,导致虚假安全感。作者将这一现象称为“质量-实用性鸿沟”(Quality-Usefulness Gap),并主张XAI到NLE的转化流程评估必须超越传统文本质量指标,转向对下游任务性能的实际影响分析。
链接: https://arxiv.org/abs/2605.26770
作者: Fabian Lukassen,Jan Herrmann,Christoph Weisser,Alexander Silbersdorff,Benjamin Saefken,Thomas Kneib
机构: University of Göttingen (哥廷根大学); BASF SE (巴斯夫公司); Hochschule Bielefeld (比勒费尔德应用科学大学); TU Clausthal (克劳斯塔尔工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Prior work shows that Large Language Models (LLMs) can transform Explainable AI (XAI) outputs into Natural Language Explanations (NLEs) that score highly on quality metrics such as plausibility, coherence, and comprehensibility. But does explanation quality translate to practical usefulness? We investigate this question in a time-series energy forecasting domain through five controlled experiments (2,730 judgments across 60 test instances), each operationalising a distinct facet of usefulness studied in the XAI literature. Holding NLE quality constant at the high levels established by a prior factorial study, we find that NLEs do not improve task accuracy on any of the five tasks, while inflating self-reported confidence. A placebic control shows that this confidence boost is driven by text presence rather than content. In an out-of-distribution detection task, NLEs reduce the LLM judge’s ability to flag unreliable predictions, providing false reassurance that masks model failure. We characterise these findings as the Quality-Usefulness Gap and argue that evaluation of the XAI-to-NLE pipeline must extend beyond text-quality metrics to downstream task performance.
[NLP-71] From Snippets to Semantics: Rethinking Evidence Granularity for Multilingual Fact Verification
【速读】: 该论文旨在解决多语言事实核查中证据相关性不足与完整性欠缺的问题,现有系统常依赖搜索片段、句子级证据或局部段落,易遗漏关键上下文并导致证据碎片化。其解决方案的关键在于提出SEEK框架——一种基于语义主题转换自适应切分的证据提取方法,能够从完整的事实核查文章中构建连贯的证据块,同时保留局部验证语境;随后利用多语言编码器对证据块进行表征,并通过LoRA适配器微调多语言大模型(LLM)进行真伪预测。实验表明,SEEK在X-FACT和RU22Fact数据集上相较语义切分提升宏F1达10%,相较句子切分提升19%,相较搜索片段基线提升20%,且证据完整性和显著性分析验证了其能更好保留验证上下文,从而实现更可靠的多语言事实核查。
链接: https://arxiv.org/abs/2605.26755
作者: Babu Kumar,Gaurav Kumar,Ayush Garg,Aditya Kishore,Jasabanta Patro
机构: Indian Institute of Science Education and Research, Bhopal, India
类目: Computation and Language (cs.CL)
备注:
Abstract:Multilingual fact verification requires evidence that is both relevant and sufficiently complete for reliable factuality prediction. However, existing systems often rely on search snippets, sentence-level evidence, or locally segmented passages, which can miss decisive context and produce fragmented evidence. To overcome these limitations, we propose SEEK, a Semantic Evidence Extraction with an adaptive chunKing framework that constructs coherent evidence chunks from full fact-checking articles by identifying semantic topic transitions and preserving local verification context. The constructed chunks are encoded using a multilingual encoder and then multilingual LLMs are finetuned using LoRA adapter for veracity prediction. Experiments on X-FACT and RU22Fact show that SEEK improves macro-f1 by up to 10% over semantic chunking, 19% over sentence chunking, and 20% over search-snippet baselines. Evidence completeness and significance analyses further show that SEEK preserves richer verification context and enables more reliable multilingual fact-checking.
[NLP-72] KARMA: Karma-Aligned Reward Model Adaptation
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在对话中缺乏语用能力的问题,即模型难以根据上下文、语气和社交规范等隐含社会信号进行有效沟通,而不仅仅是依赖语义内容。解决方案的关键在于提出KARMA(Karma-Aligned Reward Model Adaptation)框架:该框架通过在Reddit对话数据上训练一个奖励模型(Reward Model),使其能够基于对话上下文预测回复的受欢迎程度(Reddit karma),并利用这一信号通过强化学习微调语言模型,从而提升其在语用驱动任务中的表现。关键发现是,最优下游模型性能并非来自最准确预测karma的奖励模型,而是来自仅依赖对话上下文的奖励模型——尽管它对karma的预测能力较弱,却显著提升了语用行为的适配性,并减少了不良副作用;同时,该方法导致事实性(factuality)普遍下降,表明这种权衡可能根植于奖励信号本身,而非训练数据噪声所致。
链接: https://arxiv.org/abs/2605.26738
作者: Jared Scott,Jesse Roberts
机构: Tennessee Tech University (田纳西理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Human communication depends on implicit social signals where effectiveness is shaped by tone, context, and conversational norms rather than semantic content alone. We introduce KARMA (Karma-Aligned Reward Model Adaptation), a framework for LLM learning of context-sensitive conversational behavior from large-scale social interaction data. KARMA trains a reward model on Reddit conversations to predict response valuation conditioned on context, and uses this signal to fine-tune language models via reinforcement learning to improve performance on pragmatics-mediated tasks. Critically, we find that the highest performing reward model does not lead to better downstream model alignment: a reward model relying exclusively on conversational context was a worse predictor of Reddit karma but yielded substantially better downstream performance. We evaluate the effects of KARMA applied to a downstream model with and without direct exposure to the social media data. The resulting models show improved pragmatics-mediated behaviors with largely mitigated undesirable side effects. Factuality is consistently diminished by KARMA across all conditions, including when the downstream model has no direct exposure to Reddit data, suggesting that this tension is embedded in the reward signal itself rather than introduced by noisy training data.
[NLP-73] Rethinking the Multilingual Reasoning Gap with Layer Swap
【速读】: 该论文试图解决的问题是:在多语言推理任务中,模型采用母语推理(native reasoning)相较于英语枢轴推理(English-pivoted reasoning)时性能显著下降的现象,即“母语推理差距”(native reasoning gap)。此前研究普遍认为,强制模型在输入语言中进行链式思维(Chain-of-Thought, CoT)会导致性能大幅降低,但这些结论大多基于推理阶段干预或有限的母语训练数据。本文通过大规模实验和更公平的监督条件重新评估这一现象:构建涵盖六种语言(英语、法语、德语、西班牙语、中文和斯瓦希里语)的长链条多语言推理数据集,对Qwen3-8B-base模型分别进行母语推理和英语枢轴推理的微调,并在数学、科学、常识和代码四个任务上评估性能。结果显示,平均母语推理差距缩小至1.9%–3.5%,远低于以往报道。进一步的权重空间分析揭示,母语专用模型的中间层更新趋于一致,而外层则出现分化,表明存在一个语言无关的推理核心(language-agnostic reasoning core),以及围绕其的语言特异性层。基于此结构,作者提出“层交换”(Layer Swap)策略——将英语专用模型中更强的中间推理层迁移至各母语模型,从而显著缩小母语推理差距,同时保持CoT输出为目标语言。
链接: https://arxiv.org/abs/2605.26735
作者: Maxence Lasbordes,Amélie Chatelain,Djamé Seddah
机构: LightOn, Paris; Inria, Paris
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent reasoning Large Language Models produce a chain-of-thought (CoT) predominantly in English, even when prompted in non-English languages. Prior work suggests that forcing the CoT to remain in the input language (\emphnative reasoning) substantially degrades performance relative to allowing the model to reason in English before answering in the input language (\emphEnglish-pivoted reasoning). However, most studies of this native reasoning gap rely on inference-time interventions or limited native-language training data. We revisit this comparison at a larger scale and under comparable supervision. We construct long multilingual reasoning datasets across six languages (English, French, German, Spanish, Chinese and Swahili); fine-tune specialists in both native and English-pivoted regimes on top of \textttQwen/Qwen3-8B-Base, and evaluate across mathematics, science, general knowledge, and code. In this setting, the average native reasoning gap shrinks to 1.9–3.5% across the five non-English languages, considerably smaller than previously reported. Weight-space analysis of the native specialists reveals aligned fine-tuning updates in the middle layers and divergence in the outer layers. This points to a largely language-agnostic reasoning core surrounded by language-specific layers. Exploiting this structure, we introduce a Layer Swap: transferring the English specialist’s stronger reasoning mid-layers into each native specialist, closing most of the native reasoning gap across the five non-English languages while preserving CoT in the target language. We release all models and datasets.
[NLP-74] Its Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers
【速读】: 该论文试图解决的问题是:当前主流假设认为大语言模型(LLM)代理部署中,模型能力越强则所需结构化引导(harness)越少,二者呈单调递减关系。然而,这一假设缺乏实证检验。研究通过一个包含432次运行的受控实验,验证了该假设在不同能力层级的模型上的适用性。其解决方案的关键在于设计了一个多维度的实验框架,交叉测试六个不同能力层级的模型(每层仅一个代表模型),并施加三种结构化引导强度(轻量、平衡、严格),在HEAT-24合成基准上评估性能指标VTSR(Verified Task Success Rate)。结果表明,该单调关系不成立:对于前沿对话模型(Gemini 2.5 Flash),增加引导复杂度反而显著降低性能(下降29–38个百分点),形成“引导复杂度悖论”;而对于前沿推理模型(Qwen3.5-122B),严格的引导反而带来最高成功率(91.7%)和最低延迟,与预期相反。此外,研究还提出六类失败标签分类体系,揭示了不同能力层级模型的失败模式差异(高能力模型主要因格式违反失败,低能力模型则因文件错误失败),从而推导出面向能力层级的引导选择策略,强调模型类型(对话 vs 推理)对引导敏感性的关键影响。
链接: https://arxiv.org/abs/2605.26731
作者: Yong-eun Cho
机构: KailosLab
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9 pages, 3 figures
Abstract:A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance – together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark with git-based workspace verification. Our results refute the monotone inverse relationship on two fronts. First, for the frontier chat model evaluated (Gemini 2.5 Flash), increased harness verbosity lowers VTSR by 29-38 percentage points – a harness-complexity paradox. Second, for the frontier reasoning model evaluated (Qwen3.5-122B, extended thinking enabled), strict harness achieves the highest VTSR (91.7%) and the lowest latency, the opposite of the prediction. Within the constrained tier, a 2B model (Gemma4:e2B) matches strong-open-tier stability at 91.7% across all harnesses. Because each tier is represented by a single model in this study, these results should be interpreted as model-specific observations; harness sensitivity appears non-monotone across the models evaluated, and depends critically on model type (chat vs. reasoning). We introduce a six-label failure taxonomy showing that format_violation dominates capable-model failures while wrong_file dominates low-capability failures, and we derive practical tier-aware harness selection guidelines. Comments: 9 pages, 3 figures Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2605.26731 [cs.AI] (or arXiv:2605.26731v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.26731 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yong Eun Cho [view email] [v1] Tue, 26 May 2026 09:08:41 UTC (55 KB) Full-text links: Access Paper: View a PDF of the paper titled It’s Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers, by Yong-eun ChoView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-05 Change to browse by: cs cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[NLP-75] PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers
【速读】: 该论文试图解决的问题是:当前生成式 AI (Generative AI) 驱动的自动同行评审系统在实际应用中对科学性漏洞(scientific gaps)的识别能力尚不明确,尤其与人类审稿人相比表现如何缺乏系统评估。解决方案的关键在于提出 PRISM(Peer Review Intelligence via Structured Multi-dimensional assessment),一个基于多维结构化评估的基准框架,从四个核心维度——分析深度(Depth of Analysis)、新颖性评估(Novelty Assessment)、缺陷识别(Flaw Identification)和重大问题优先级排序(Major Issues Prioritization)及多维建设性(Multi-dimensional Constructiveness)——对自动化评审系统进行精细化测评。PRISM 通过引入论点挖掘(argument mining)、检索增强验证(retrieval-augmented verification)和共识评分机制,避免了传统表面指标(如 ROUGE、BLEU)或无约束的大模型自评(LLM-as-a-judge)所导致的误判,从而更真实地反映评审质量。实验表明,尽管部分 LLM 系统在特定维度上可媲美甚至超越人类审稿人,但没有单一系统能在所有维度上保持均衡表现,且各系统存在结构性盲区,因此建议将 LLM 审稿人视为人类审稿人的针对性补充工具,而非完全替代方案。
链接: https://arxiv.org/abs/2605.26730
作者: Ngoc Phan Phuoc Loc,Toan Huynh La Viet,Thanh Tran Khanh,Duy A Nguyen,Tuan Anh Nguyen Pham,Thanh Nguyen,Nitesh V. Chawla,Wray Buntine,Kok-Seng Wong,Khoa D. Doan,Binh T. Nguyen
机构: VinUniversity; University of Illinois, Urbana-Champaign; University of Notre Dame; Monash University
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid growth in submissions to machine learning venues has strained the scientific peer-review system and intensified interest in LLM-based automated peer reviewers. However, how good these systems are actually, especially compared to human reviewers at catching scientific gaps, remains poorly understood. In this work, we introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmarking framework that evaluates review quality across four dimensions: Depth of Analysis, Novelty Assessment,Flaw Identification Major Issues Prioritization, and Multi-dimensional Constructiveness. Unlike most existing evaluations based on surface-level metrics like ROUGE and BLEU, or unconstrained LLM-as-a-judge prompting that conflates fluency with rigor, PRISM grounds each dimension in argument mining, retrieval-augmented verification, and consensus-based scoring. We apply PRISM to benchmark five leading automated reviewer systems and human reviewers on a stratified corpus of reviews from ICLR, ICML, and NeurIPS. The results reveal that LLMs can match or beat human reviewers on individual dimensions: comparable depth of analysis, stronger novelty verification, and highly accurate critique prioritization. However, no single system consistently matches the balanced performance of the human baseline across all dimensions at once. Each exhibits a distinct specialization profile with characteristic blind spots – failure modes that aggregate metrics miss entirely. The implication is that LLM reviewers are best understood as targeted supplements to human review, effective within specific dimensions, but unreliable as standalone replacements. Our demo and key results can be found at this https URL.
[NLP-76] he Need for an External Observer Formalizing the Sufficiency Gap: A Mathematical Extension of Mixture Identifiability and Contextual Grounding in Sequence Models
【速读】: 该论文试图解决的问题是:在生成式 AI(Generative AI)模型中,当存在未观测的潜在状态(latent state)时,即使模型能够完美地学习文本边缘分布(text-only marginal law),仍可能因错误地假设当前输入来自正确状态而导致过度自信(overconfident)预测,从而引发熵差(entropy difference)——这种偏差并非传统意义上的优化误差,而是由于对未观测状态进行边际化(marginalization)所导致的“充分性缺口”(sufficiency gap)。解决方案的关键在于引入一个辅助二元信号(auxiliary binary signal),其保真度为 γ∈[1/2,1],通过贝叶斯更新形式化检索、工具使用和外部锚定(external grounding)机制,并推导出一个“情境主导阈值”(contextual dominance threshold):当辅助信号的保真度超过仅基于文本历史分配给误导性潜在状态的后验权重时,该信号才能逆转由文本历史诱导的后验比。这一阈值虽可缩小但无法完全消除充分性缺口;彻底闭合需对相关潜在状态实现完全揭示或等效验证机制。该分析揭示了温度缩放无法恢复缺失上下文的根本原因,强调了接地机制必须兼具信息性和可学习性,且在高风险领域中,自主序列模型需依赖结构解耦的观察者或验证器以确保可靠性。
链接: https://arxiv.org/abs/2605.26711
作者: Francesco Corielli
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We construct a binary mixed-regime process with one deterministic textual regime and one random regime governed by an unobserved latent state. Even an ideal infinite-capacity sequence predictor that exactly recovers the text-only marginal law can become overconfident when the observed prefix is compatible with the wrong latent regime. The resulting entropy difference is not an ordinary optimization error; it is a sufficiency gap caused by marginalization over an unobserved state. We then formalize retrieval, tool use, and external grounding through an auxiliary binary signal with fidelity \gamma \in [1/2,1] . The resulting Bayesian update yields a contextual dominance threshold: a corrective signal reverses the posterior odds induced by the textual history exactly when its fidelity exceeds the text-only posterior weight assigned to the misleading regime. This threshold reduces, but does not generally eliminate, the sufficiency gap; complete closure requires perfect revelation of the relevant latent state or an equivalent verification mechanism. The analysis clarifies why temperature scaling cannot restore missing context, why grounding mechanisms must be both informative and learnably usable by the model, and why autonomous sequence models require structurally decoupled observers or verifiers in high-stakes domains.
[NLP-77] PinPoint: Prompting with Informative Interior Points
【速读】: 该论文试图解决的是当前无训练(training-free)的指代表达图像分割(referring image segmentation)方法性能显著低于微调或强化学习(RL)优化的专业模型的问题。其核心问题是:性能差距主要源于提示(prompt)的模糊性,即视觉语言模型(VLM)生成的边界框(bbox)无法明确指示目标对象的像素归属,导致分割模型(如SAM)在内部区域进行猜测时容易误判。解决方案的关键在于提出一种确定性的、无需训练的点选择机制——PinPoint,它通过融合四种视觉线索生成共识图,从中选取远离边界且空间分布均匀的稳定点,并利用冻结的VLM对每个点进行标签标注,从而大幅减少误导性采样带来的误差。实验表明,在仅使用五个内部点的情况下,PinPoint相比原始随机采样可提升累计交并比(cIoU)12–18个百分点,且在不引入任何任务特定训练的前提下达到与监督和RL调优方法相当的性能,同时每查询仅需两次VLM调用。
链接: https://arxiv.org/abs/2605.26689
作者: Pouya Sadeghi,Shawn He,Pedro Pablo Guerrero Vela,C. Thomas,Alex Wong,Sirisha Rambhatla
机构: University of Waterloo; Critical ML, University of Waterloo; Apple
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Modern referring image segmentation pipelines couple a vision-language model (VLM) for grounding with a promptable segmenter such as the Segment Anything Model (SAM) for mask generation. Prior training-free instances of this recipe consistently trail fine-tuned and reinforcement-learning (RL)-tuned specialists, and it has been unclear whether the gap comes from the VLM’s grounding, SAM’s capacity, or the prompt. We show that the gap is dominated by prompt ambiguity: a VLM-proposed bounding box (bbox) leaves SAM to guess which pixels inside the bbox belong to the object the expression denotes. Interior points are the natural disambiguator, but where they fall matters; prior work relies on naively sampled points that land on boundaries, distractors, and background clutter, and can even hurt performance compared to the bbox alone. Supervised and RL-tuned methods close this gap by training a VLM to predict better points; we show that this training is unnecessary. At a matched budget of five interior points, replacing naive sampling with stable, informative point selection improves cumulative Intersection-over-Union (cIoU) by 12-18 points across RefCOCO/+/g, with every model fixed. We turn this observation into PinPoint, a deterministic, training-free point selector that fuses four visual cues into a consensus map, selects compact, spatially diverse points away from boundaries, and uses the frozen VLM to label each point. Without any task-specific training, PinPoint matches supervised and RL-tuned specialists on the same stack while issuing only two VLM calls per query.
[NLP-78] An In-Vitro Study on Cross-Lingual Generalization in Language Models
【速读】: 该论文试图解决的问题是:在自然语料中研究语言模型的跨语言迁移(cross-lingual transfer)时,词汇重叠、形态学差异、数据不平衡和分词机制等因素相互纠缠,难以分离各因素对迁移效果的影响。为此,作者提出了一种“体外”(in-vitro)实验框架,通过生成两种共享同一语义本体(ontology)、类型化语法(typed grammar)和组合结构但表面形式不同的语言,实现对词汇距离、少数语言比例、分词器训练策略和词汇量等变量的独立控制。关键解决方案在于:利用这种可控的合成语言环境,评估在训练中从未见过的少数语言词汇形式下的掩码迁移能力,并发现迁移效果主要取决于分词是否保留可复用的跨语言子结构(reusable cross-lingual substructure),而非分词平衡或原始词汇相似度;此外,迁移呈现阶段性特征——语法和类型级能力先于掩码词汇泛化出现,且分词桥接强度(tokenizer bridges)与掩码可达性(masked reachability)高度相关。
链接: https://arxiv.org/abs/2605.26683
作者: Adrian Cosma
机构: Dalle Molle Institute for Artificial Intelligence (IDSIA)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 Figures, 1 Table
Abstract:Cross-lingual transfer in language models is difficult to study in natural corpora because lexical overlap, morphology, data imbalance, and tokenization are entangled. We introduce an in-vitro framework with two procedurally generated languages that share the same ontology, typed grammar, and compositional structure, but differ in surface realization. This lets us independently vary lexical distance, minority-language proportion, tokenizer training regime, and vocabulary size, while evaluating transfer on a masked minority-language condition whose lexical forms are never observed during training. Across 700 controlled runs, we find that transfer is governed less by tokenizer balance or raw lexical similarity than by whether tokenization preserves reusable cross-lingual substructure. Smaller vocabularies often improve masked transfer by keeping words decomposable into shared fragments, whereas larger vocabularies can turn forms into language-specific atoms. We further show that transfer emerges as a staged process: grammatical and type-level competence precede masked lexical generalization. Finally, we attempt to explain this mechanism through tokenizer bridges and show that bridge strength correlates strongly with masked reachability.
[NLP-79] NestedKV: Nested Memory Routing for Long-Context KV Cache Compression
【速读】: 该论文旨在解决长上下文语言模型中因键值(Key-Value, KV)缓存内存占用过高而导致的性能瓶颈问题。现有无训练的KV压缩方法通常依赖单一重要性指标(如注意力权重、时间顺序、分层分配或键的区分度),在处理全局显著但局部稀疏或即时相关的信息时表现脆弱。其解决方案的关键在于提出NestedKV,一种受嵌套学习中连续记忆系统启发的仅基于键的KV缓存压缩方法:它通过多时间尺度余弦异常分数对token进行排序,并结合无需训练的外部学习器(head-adaptive mixing与surprise-gated token路由)实现高效组合排名;同时引入全局、块级和滑动窗口三类键锚点,配合每头自适应预算分配机制,在不修改大语言模型(LLM)的前提下实现高精度压缩。实验表明,NestedKV在缓存保留率较低时优势显著,在RULER、LongBench等基准测试中相较KeyDiff提升达19.10分(r=0.75),且在极端压缩比下仍保持优异性能(如r=0.95时LongBench得分37.32 vs KeyDiff的17.55)。
链接: https://arxiv.org/abs/2605.26678
作者: Hong Chen,Xiang Liu,Yubo Gao,Yuxuan Fan,Bo Wang,Yuanlin Chu,Yuanguo Lin,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou); Jimei University
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-context language models are limited by the memory footprint of the key-value (KV) cache. Existing training-free KV compression methods usually rank tokens by one importance signal – attention, recency, layer-wise allocation, or key distinctiveness – which becomes brittle when useful context is globally distinctive, locally episodic, or immediately relevant. We introduce NestedKV, a key-only KV cache compression method inspired by the Continuum Memory System in Nested Learning. NestedKV maintains global, block-level, and sliding-window key anchors, scores tokens by multi-time-scale cosine anomaly, and combines the resulting rankings with a training-free outer learner using head-adaptive mixing and surprise-gated token routing. The score is paired with adaptive per-head budgets and requires no training or LLM modification. Across RULER (4k–32k), LooGLE, LongBench, LongBench-E, InfiniteBench, and MMLU-Pro on Qwen3 and Llama-3.2 models, NestedKV is strongest when the retained cache is small. On Qwen3-4B, it improves over KeyDiff by up to 19.10 points on RULER and 19.29 on LongBench at r=0.75 ; at r=0.95 , it retains 37.32 on LongBench versus 17.55 for KeyDiff.
[NLP-80] he Labyrinth and the Thread: Rethinking Regularizations in Sequential Knowledge Editing for Large Language Models ICML2026
【速读】: 该论文试图解决大语言模型中结构化知识的连续编辑问题,即如何在不重新训练模型的情况下实现目标事实的精准更新,同时确保编辑过程的稳定性和可靠性。现有方法通常依赖复杂的正则化或约束机制,但其必要性尚不明确。论文的关键贡献在于通过严格的优化分析,首次证明了一次性编辑与连续编辑在理论上等价,并进一步将这一等价关系推广到更广泛的编辑目标类别,揭示了稳定性源于对累积编辑约束的合理建模,而非特定正则化或零空间操作。研究还验证了多数常用正则化策略在可靠连续更新中并非必需,并扩展框架以处理冲突编辑,从而为实现更简单、可解释且可靠的模型知识更新提供了理论基础和实践路径。
链接: https://arxiv.org/abs/2605.26670
作者: Zheng Wang,Kaixuan Zhang,Wanfang Chen,Jingwen Zhang,Xiaonan Lu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication at ICML 2026
Abstract:Sequential editing of structured knowledge in large language models allows targeted factual updates without retraining, yet existing methods often rely on complex regularization or constraint mechanisms whose necessity remains unclear. In this work, we systematically investigate the mechanisms underlying effective and stable sequential editing. Specifically, we first analyze the empirical success of AlphaEdit and establish, via a rigorous optimization analysis, the formal equivalence between one-time and sequential editing. Building on this insight, we generalize the equivalence to a broader class of editing objectives, demonstrating that stability emerges naturally from properly accounting for accumulated editing constraints, rather than from specialized regularization or null-space operations. We empirically confirm that many commonly used regularization strategies are unnecessary for reliable sequential updates. Furthermore, we extend our framework to handle conflicting edits, ensuring robust and consistent behavior under contradictory updates. Ultimately, our work provides Ariadne’s thread through the labyrinth of sequential editing, charting a path toward simpler, more interpretable, and dependable knowledge updates. Our code is available at this https URL.
[NLP-81] AI evaluation may bias perceptions: The importance of context in interpreting academic writing
【速读】: 该论文试图解决的问题是:在评估人工智能(AI)在科学写作中的使用时,若忽略不同国家和学科领域的语境差异,会导致估计结果产生系统性偏差。解决方案的关键在于构建基于国家-学科特定的“AI相似度”基准(AI-likeness benchmarks),而非采用全局统一的基准。研究利用Dimensions数据库中大规模期刊文献数据,通过比较人类撰写与大语言模型(LLM)重写摘要的差异来建立基准。结果表明,混合所有群体的 pooled 基准会将原本存在的风格差异误判为AI生成文本,从而在不同国家-学科组中造成显著扭曲;而分组基准则有效缓解了这种偏差,提供了更可信的比较基线。应用该方法对2025年出版物的分析显示,pooled基准系统性高估某些国家和学科的AI使用率,同时低估其他群体,凸显了情境感知测量对于科学领域AI使用公平且准确评估的重要性。
链接: https://arxiv.org/abs/2605.26662
作者: Shang Wu,Randol Yao
机构: UC Irvine (加州大学欧文分校); MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:
Abstract:This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences across countries and fields. Using large-scale data on journal publications from Dimensions, we construct AI-likeness benchmarks based on differences between human-written and LLM-rephrased abstracts. We show that a pooled benchmark may confound pre-existing stylistic variation with AI-generated text, producing substantial distortions across country-field groups even in pre-LLM publications. In contrast, country-field-specific benchmarks attenuate such distortions and provide a more credible baseline for comparison. Applying these methods to publications in 2025 reveals that the pooled benchmark systematically overestimates AI use in certain countries and fields while underestimating it in others. These findings highlight the importance of context-aware measurement for accurate and equitable evaluation of AI use in science.
[NLP-82] Why Prompt Optimization Works and Why It Sometimes Doesnt: A Causal-Inspired Edit-Level Analysis
【速读】: 该论文试图解决的问题是:当前自动化提示优化方法(如DSpy、TextGrad)在特定任务上表现优异,但在跨任务场景下泛化能力不足,即优化后的提示在某一基准测试中有效,却难以迁移到其他任务,且这种局限性在不同大语言模型(LLM)骨干网络之间依然存在。解决方案的关键在于通过因果推断启发的观察性分析,识别出提示编辑模式与任务特性之间的系统性交互关系。研究构建了基于倾向得分调整的关联分析框架,并结合多种提示编辑的互补表征方式,发现复杂度增加和元指令类编辑会显著降低数学推理和多跳推理性能,而逐步分解和元认知类编辑则提升逻辑与序列推理任务的表现;这些效应在认知负荷标注、表面文本特征及编辑动机模式分析中均具鲁棒性,且可跨优化框架泛化。因此,提示优化失败的根本原因并非随机优化结果,而是编辑类型与任务特征间的系统性不匹配,这为未来设计任务条件化的提示优化器提供了特征层面的理论依据。
链接: https://arxiv.org/abs/2605.26655
作者: Shuzhi Gong,Hechuan Wen
机构: The University of Melbourne (墨尔本大学); The University of Queensland (昆士兰大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 17 pages, 4 figures, 8 tables
Abstract:Automated prompt optimization methods (e.g., DSpy, TextGrad) can substantially improve the performance of large language model (LLM), however, their generalization ability across different tasks remains underperformed. In practice, the superiority of the optimized prompt on one benchmark often fails to transfer to another, and this limitation persists even when switching across different LLM backbones. To investigate the underexplored sources of heterogeneity in prompt performance, we conduct a causal inference-inspired observational analysis of optimized prompts across a diverse set of optimization frameworks, LLM backbones, and NLP benchmarks. To achieve the goal, we build upon the propensity-adjusted associational analysis together with multiple complementary representations of prompt edits, where the consistent task-conditioned edits patterns are identified. We find that complexity-increasing and meta-instructional edits are negatively associated with mathematical and multi-hop reasoning performance, whereas step-by-step and meta-cognitive edits improve logical and sequential reasoning tasks. These effects are robust across cognitive-load annotations, surface-level text features, and edit-motif analyses, and can generalize across optimization frameworks. Overall, these results indicate that prompt optimization failures arise from systematic interactions between edit families and task characteristics rather than random optimization artifacts, providing feature-level characterization of optimizer behavior and motivating future task-conditioned optimizer design.
[NLP-83] Bounded Path Context: A Controlled Study of Visible Path History in LLM -Based Knowledge Graph Question Answering EMNLP2026
【速读】: 该论文旨在解决基于大语言模型(LLM)的知识图谱问答(KGQA)系统中,路径信息在提示词(prompt)中冗余序列化导致的效率低下问题。当前方法通常将完整的局部路径作为输入传递给语言模型进行关系选择决策,但这种做法忽略了控制器已维护精确符号状态的事实,造成计算资源浪费且未明确优化提示结构。解决方案的关键在于提出“受限路径上下文”(Bounded Path Context, BPC),其核心思想是解耦路径的符号存储与提示中的暴露:控制器保留完整路径用于答案提取和审计,而关系选择提示仅包含问题、当前实体、候选关系及最多前K跳的历史信息。实验表明,在WebQSP和CWQ基准上,使用K=1时F1分数优于全历史提示(分别提升0.015和0.013),同时显著减少输入token数(分别下降9.7%和12.1%),证明路径长度应被视为可调接口参数而非默认假设。
链接: https://arxiv.org/abs/2605.26645
作者: Xihang Shan,Ye Luo
机构: Xiamen University (厦门大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 figure, submitted to EMNLP 2026
Abstract:LLM-based knowledge-graph question answering (KGQA) delegates graph traversal to language models, turning each question into a sequence of local relation-selection decisions repeated across beams and hops. A common but untested default is to serialize the complete partial path into every routing prompt, even though the controller already maintains this path as exact symbolic state. Bounded Path Context (BPC) decouples these two roles: the controller retains full paths in symbolic memory for answer extraction and audit, while the relation-selection prompt exposes only the question, the current entity, outgoing relation candidates, and at most the last K hops. A controlled sweep over K – fixing graph neighborhoods, beam budget, depth, decoding, and answer-extraction format – shows that bounded histories match or exceed full-history prompting on complete WebQSP and CWQ test sets with Qwen3.5-9B-AWQ: K=1 achieves 0.487 answer-set F1 on WebQSP versus 0.472 for full history, and K=0 reaches 0.287 on CWQ versus 0.274, with 9.7% and 12.1% fewer input tokens respectively. At the 4B scale, K=1 remains the strongest setting on both benchmarks. Per-example analysis reveals that 71-84% of examples are unaffected by history length, while the affected cases expose when prior hops disambiguate versus distract. These results suggest that path serialization length is better treated as a tunable interface variable than as a default assumption in LLM-based graph controllers.
[NLP-84] LATTE: Forecasting Peer Anchored Preference Trajectories for Personalized LLM Generation
【速读】: 该论文试图解决的问题是:在使用冻结的大语言模型(Frozen Large Language Models)进行个性化生成时,如何有效地利用用户历史行为信息来构建既紧凑又具备时效性的条件信号(conditioning signal)。现有方法通常通过文本检索、摘要或静态潜在表示(如软提示)来压缩用户历史,但这些方法将用户的稳定身份、近期变化和物品内容混杂在同一表征中,导致个性化效果受限。解决方案的关键在于提出LAtent Trajectory Tracking and Extrapolation (LATTE) 框架,其核心创新包括:1)通过减去基于同类用户对相同物品响应的时序掩码基线(time-masked baseline),提取出反映目标用户相对于同侪在特定物品上下文中的相对偏好状态(relative preference state);2)使用轻量级序列预测器对未来状态进行轨迹外推;3)通过一个“状态到token桥接”机制,将预测的状态注入冻结的指令微调大模型,仅需一个锚定软token即可实现高效个性化。实验证明,LATTE在Amazon Reviews 2023和MemoryCD数据集上显著优于多种基线方法,ROUGE-L指标提升至0.259,且诊断分析表明性能提升主要源于对用户特异性轨迹信息的预测,而非单纯增加软提示接口。
链接: https://arxiv.org/abs/2605.26612
作者: Jinze Li,Xiaoyan Yang,Shuo Yang,Jinfeng Xu,Yue Shen,Jian Wang,Jinjie Gu,Edith Cheuk-Han Ngai
机构: The University of Hong Kong (香港大学); Ant Healthcare, Ant Group (蚂蚁健康)
类目: Computation and Language (cs.CL)
备注: Under review
Abstract:Personalized generation with frozen large language models requires a conditioning signal that is both compact and current. Existing personalization methods typically retrieve or summarize user histories in text, or compress them into static latent profiles and soft prompts. These approaches are efficient, but they treat a user’s past behavior as an aggregate profile and therefore mix stable identity, recent drift, and item content in the same representation. We propose LAtent Trajectory Tracking and Extrapolation (LATTE), a framework that represents personalization as forecasting a peer anchored relative preference state. For each historical session, LATTE subtracts a time masked baseline formed from comparable users who responded to the same item, producing a state that measures how the target user differs from peers under a shared item context. A lightweight sequence predictor then forecasts the next state in this trajectory, and a State to Token Bridge injects the forecast into a frozen instruction tuned LLM through a single anchored soft token. We provide a latent factor analysis showing when peer anchoring cancels shared item variation and why temporal forecasting trades off stale averages against noisy recent states. Experiments on Amazon Reviews 2023 and MemoryCD show that LATTE consistently outperforms retrieval, summary memory, static latent profiles, difference aware latent profiles, and soft prompt compression baselines. On Amazon Reviews 2023, LATTE improves average ROUGE-L from 0.219 for a static latent profile and 0.245 for the strongest added latent compression baseline to 0.259. Additional pairwise comparisons and diagnostic analyses suggest that the improvement is mainly due to forecasting user-specific trajectory information, rather than merely adding a soft prompt interface.
[NLP-85] Hubness Not Anisotropy Drives Cross-Lingual Retrieval Asymmetry in Multilingual Embedding Models
【速读】: 该论文试图解决多语言嵌入模型中跨语言检索不对称性的问题,即当用语言A的查询检索到语言B的翻译时,反向检索却往往无法成功。其解决方案的关键在于识别并纠正导致这一现象的核心机制——hubness(枢纽性),而非此前认为的各向异性(anisotropy)、质心漂移(centroid drift)或向量幅度差异。作者通过5个预注册实验验证了hubness是影响互近邻 reciprocity(互近邻对称性)的主导因素(贡献度达49.5%),并证明使用CSLS(Cross-lingual Similarity Linear Scaling)校正后的相似度度量可消除63.5%的最差与最佳模型间 reciprocity 差距,且效果显著优于仅移除hub向量的“手术式”干预,从而揭示hubness本质上是相似度度量本身的病理问题,而非单个向量特性所致。研究最终建议将CSLS作为多语言嵌入流水线的标准检索度量。
链接: https://arxiv.org/abs/2605.26575
作者: Adib Sakhawat,Fardeen Sadab,Atik Shahriar
机构: Islamic University of Technology, Dhaka, Bangladesh
类目: Computation and Language (cs.CL)
备注: 17 pages, 5 figures
Abstract:Multilingual embedding models are deployed under the assumption that cross-lingual retrieval is symmetric: if a query in language A retrieves its translation in language B, the reverse should also hold. In practice it does not. Using a parallel corpus of 6,518 idiomatic and proverbial expressions in English, Bangla, Hindi, and Arabic, embedded by five production-grade encoders (Gemini, Mistral, OpenAI-L, OpenAI-S, Qwen), we formalise this failure as a deficit in mutual nearest-neighbour reciprocity and test a single mechanistic claim: among the geometric pathologies of multilingual spaces, hubness, not anisotropy, centroid drift, or magnitude, is the dominant causal driver. Across five pre-registered experiments with falsification conditions specified in advance, hub mass dominates a joint regression on reciprocity (49.5% dominance share, 1.68x the next predictor; partial R^2 = 0.302 versus 0.003 for anisotropy), while a hub-aware score correction (CSLS) closes 63.5% of the worst-to-best reciprocity gap and yields a mean within-model effect size 130x larger than surgical hub-vector ablation. The latter contrast pinpoints the mechanism: hubness is a pathology of the similarity metric, not of individual hub vectors. We resolve the well-known anisotropy-hubness paradox by showing the two are statistically dissociable, and we recommend replacing cosine similarity with CSLS as the default retrieval metric for multilingual embedding pipelines.
[NLP-86] Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline
【速读】: 该论文旨在解决门诊病历笔记中“行动-时间”(action-date)配对提取的问题,这类信息常以隐含的时间推理形式出现(如“两周后做脑部MRI”),而传统生成式模型在解码过程中难以准确捕捉日期信息。解决方案的关键在于采用一种混合神经符号流水线:首先使用BioBERT进行实体识别(TestSpecification和TimeSpecification)与关系链接,随后通过28个行动类别的本体映射和确定性的时间偏移归一化处理,将时间信息转化为天数偏移量;这种分离策略使模型能够独立处理语义抽取与数值计算,从而显著提升准确率。实验表明,该方法在259条测试样本上达到0.997(已见)和0.986(未见)的Pair F1分数,且平均绝对误差(MAE)为0.00天,远优于零样本GPT-4o-mini和微调后的LLaMA-3 8B基线模型(Pair F1仅为0.51–0.57),验证了结构化符号推理优于端到端生成范式的有效性。
链接: https://arxiv.org/abs/2605.26560
作者: Michal Laufer,Yehudit Aperstein,Alexander Apartsin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures
Abstract:Objective. Outpatient notes carry follow-up instructions pairing actions with future times (“MRI brain in two weeks”). Extracting (action, date) pairs supports scheduling and audit, but generative extractors miss the date because linking and arithmetic are implicit in decoding. We test a hybrid neural-symbolic pipeline against direct generation. Methods. We define TestSpecification and TimeSpecification entities and a ScheduledFor relation. BioBERT feeds BIO tagging and a biaffine linker; entities are canonicalized via a 28-action ontology and times normalized to day offsets deterministically. We evaluate on a 2,000-note synthetic outpatient corpus with action-disjoint splits (18 train, 6 OOV-test) against zero-shot GPT-4o-mini and LoRA-fine-tuned LLaMA-3 8B with note-level bootstrap 95% CIs. Results. On 259-note seen and OOV splits the hybrid pipeline achieves Test-Time Pair F1 of 0.997 and 0.986 with 0.00-day MAE. Baselines reach high action F1 (LLaMA-3 0.992; GPT-4o-mini 0.963 seen) but Pair F1 stays at 0.51-0.57 (LLaMA-3) and 0.53 (GPT-4o-mini), CIs non-overlapping with the hybrid. Conclusion. Separating learned entity extraction from deterministic date arithmetic outperforms generation on this benchmark, generalizes to held-out actions, and exposes failure modes. Transfer to real EHR notes is the next validation; a first-pass realism check is in Limitations.
[NLP-87] Conceptual Steganography
【速读】: 该论文试图解决的问题是:大型语言模型(LLM)在生成链式思维(CoT)时,可能通过隐写术(steganography) covertly传递信息,这种“编码推理”(encoded reasoning)行为可能绕过人类监督,带来安全风险。传统方法主要在词元(token)或词汇空间中嵌入隐蔽信息,而当前有效的防御手段是内容保持的改写器(paraphraser)。论文提出的关键解决方案是引入概念隐写术(conceptual steganography),即利用CoT中每一步高阶推理行为的模式来携带信息,而非依赖具体的词汇选择。实验证明,这种机制在四种模型家族和两个推理领域中均比传统关键词方法更鲁棒,且不影响CoT的推理有效性;进一步地,研究还发现策略感知的改写器(strategy-aware paraphraser)可显著削弱该通道,揭示了未来确保LLM推理忠实性的新挑战与防御方向。
链接: https://arxiv.org/abs/2605.26537
作者: Zhejian Zhou,Jonathan May
机构: University of Southern California; Information Sciences Institute
类目: Computation and Language (cs.CL)
备注:
Abstract:Language Models (LMs) emit Chains-of-Thought (CoTs) that drive much of their capability. However, the same sequence that carries useful reasoning can also covertly convey messages: a misaligned model may embed covert information in its CoT that slips through human supervision, a form of steganography known as encoded reasoning. Prior LM steganography schemes operate in the token or lexical space, and a content-preserving paraphraser is the canonical and effective defense in recent work. We introduce conceptual steganography, in which each step of a CoT carries information through patterns of high-level reasoning behavior, rather than through lexical choice. Across four model families and two reasoning domains, this backdoor communication channel is shown to be consistently more robust to a strong paraphrase defense than standard keyword approaches, and the encoding of information into CoTs does not affect their utility in the reasoning process. Having raised awareness of this new risk, we then demonstrate that a strategy-aware paraphraser can close much of the channel, highlighting new challenges and recommended defenses for ensuring faithful LLM reasoning in the wild.
[NLP-88] A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection
【速读】: 该论文试图解决工业检测中缺陷定位与维护报告生成任务分离导致的效率低、人工依赖性强的问题,尤其是在风力涡轮机叶片检测场景下,现有方法难以实现高精度定位与结构化报告自动生成的一体化解决方案。其关键解决方案是构建一个解耦的边缘部署流水线,由三个模块组成:Eyes(YOLO26-x-obb定向边界框检测器)负责在原始分辨率下精确定位缺陷;Bridge(无参数编码模块)将每个检测框映射为网格参考的空间标记并嵌入结构化提示;Brain(经QLoRA微调的4-bit量化Qwen-2.5-1.5B模型)基于该提示生成JSON格式报告,并通过检索增强微调(RAFT)确保建议与已知维护规程对齐。实验表明,该架构在BLEU-4(0.41 vs 0.07)、幻觉率(4% vs 65%)和专家评分(8.6/10 vs 3.3/10)上显著优于单体视觉语言模型(VLM)基线,且在资源受限环境下实现了高效推理(47 tokens/秒)。
链接: https://arxiv.org/abs/2605.26533
作者: Malikussaid,Imad Gohar
机构: Telkom University (Telkom大学); Sunway University (思伦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 23 pages, 6 figures, 9 equations, and 6 tables
Abstract:Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice these tasks are handled separately, with linguistic interpretation left to human experts. This paper describes a decoupled, edge-deployable pipeline for wind turbine blade inspection built from three components that each handle a distinct sub-task. The Eyes a YOLO26-x-obb oriented bounding-box detector localizes defects at dataset-native resolution. The Bridge a deterministic, parameter-free encoding module maps each detected bounding box to grid-referenced spatial tokens embedded in a structured prompt. The Brain a 4-bit quantized Qwen-2.5-1.5B model adapted with Quantized Low-Rank Adaptation (QLoRA) on 947 synthetically generated maintenance reports generates a structured JSON report from that prompt. Retrieval-Augmented Fine-Tuning (RAFT) further grounds each recommendation in indexed maintenance procedures. Five ablation experiments, scored by BLEU-4, ROUGE-L, Hallucination Rate (HR), and an LLM-as-a-Judge rubric, compare the pipeline against a monolithic vision-language model (VLM) baseline and against partial configurations in which one component is removed. The complete system achieves BLEU-4 0.41, HR=4%, and Expert Score = 8.6/10 compared with 0.07, 65%, and 3.3/10 for the zero-shot VLM baseline. The QLoRA-adapted 1.5B model generates higher-quality reports than a 671B-parameter generalist API model given identical detection evidence, at 47 tokens per second on a single T4-class GPU. The results show that purpose-built decoupled architecture with a small domain-specific training corpus outperforms a generalist end-to-end model on this structured generation task.
[NLP-89] Verilog-Evolve: Feedback-Driven and Skill-Evolving Verilog Generation
【速读】: 该论文试图解决当前大型语言模型(LLM)在生成Verilog代码时存在的局限性问题,即仅通过孤立采样和功能验证的流水线难以满足实际寄存器传输级(RTL)设计的需求——生成的代码必须不仅功能正确,还需具备可综合、时序友好以及对下游硬件目标(如综合结果、时序分析、GEMM性能指标)友好的特性。解决方案的关键在于提出一个反馈驱动的框架Verilog-Evolve,其核心机制包括:1)针对每个任务生成多样化的次要候选版本,并通过功能仿真、Yosys综合、ABC时序代理及可选GEMM指标等多维执行反馈进行评估;2)基于配置评分策略将最优候选提升为正式版本;3)通过模块化技能引导与上下文感知的技能检索,结合历史日志中的“创建/改进/跳过”决策和验证报告,实现跨任务的技能演化。实验表明,Verilog-Evolve在功能成功率、版本稳定性和下游硬件友好性方面均优于基线方法,尤其在混合精度GEMM任务中,验证门控的技能演化进一步提升了GEMM下游性能和未见测试用例的通过率。
链接: https://arxiv.org/abs/2605.26498
作者: Zehua Pei,Hui-Ling Zhen,Yu Zhang,Sinno Jialin Pan,Mingxuan Yuan,Bei Yu
机构: The Chinese University of Hong Kong; Huawei Technologies Co., Ltd
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have improved Verilog generation from natural-language specifications, but most pipelines still treat generation as isolated sampling followed by functional checking. This is insufficient for practical RTL design, where useful Verilog must be correct, synthesizable, timing-conscious, and friendly to downstream hardware objectives. We present Verilog-Evolve, a feedback-driven framework for versioned Verilog refinement and cross-session skill evolution. For each task, Verilog-Evolve generates diverse minor candidates, evaluates them with executable feedback from functional simulation, Yosys synthesis, ABC timing proxy, and optional GEMM metrics, then promotes the best candidate into a major version under configurable scoring. To improve across tasks, the system maintains modular skill guidance, retrieves skills according to task and feedback context, and evolves candidate skills from logged histories through create/improve/skip decisions and verifier reports. Experiments on VerilogEval and mixed-precision GEMM tasks show that Verilog-Evolve improves final functional success and promotion stability while producing more downstream-friendly RTL under open-source synthesis, timing-proxy, and netlist-level GEMM objectives. Validation-gated skill evolution further improves GEMM downstream quality and achieves the best downstream score and GEMM held-out pass rate among the evaluated skill modes.
[NLP-90] he MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
【速读】: 该论文试图解决的问题是如何在保持低计算资源消耗的前提下,实现具备强大智能的代理型语言模型(agentic language models)在真实场景中的高效部署与持续进化。解决方案的关键在于:(1)构建以“小激活量”为核心的设计理念,通过仅激活9.8B参数/令牌(token)实现229.9B总参数模型的高性能;(2)开发端到端的代理驱动数据流水线(agent-driven data pipelines),生成可验证的代理轨迹,并基于执行环境和对齐奖励机制提升训练质量;(3)引入Forge系统——一个可扩展的原生代理强化学习框架,支持长周期代理轨迹、窗口FIFO调度、前缀树合并、推理优化以及训练-推理-代理解耦,从而兼容白盒与黑盒代理;(4)M2.7版本首次实现了自我演化能力,能够自主调试训练过程并修改自身结构,推动模型向自适应演进迈进。这一系列创新使MiniMax-M2系列在代理编码、深度搜索、办公任务和推理基准上达到前沿性能。
链接: https://arxiv.org/abs/2605.26494
作者: MiniMax:Aili Chen,Aonian Li,Baichuan Zhou,Bangwei Gong,Binyang Jiang,Boji Dan,Changqing Yu,Chao Wang,Cheng Ma,Cheng Zhong,Cheng Zhu,Chengjun Xiao,Chengyi Yang,Chengyu Du,Chenyang Zhang,Chi Zhang,Chuangyi Huang,Chunhao Zhang,Chunhui Du,Chunyu Zhao,Congchao Guo,Da Chen,Deming Ding,Dianjun Sun,Dongyu Zhang,Enhui Yang,Fei Yu,Guang Zheng,Guodong Zheng,Guohong Li,Haichao Zhu,Haigang Zhou,Haimo Zhang,Han Ding,Hao Zhang,Haohai Sun,Haolin Lyu,Haonan Lu,Haoyu Wang,Huajie Shi,Huiyang Li,Jiacheng Chen,Jian Zhang,Jiaqi Zhuang,Jiaren Cai,Jiaxin Pan,Jiayao Li,Jiayuan Song,Jichuan Zhang,Jie Wang,Jihao Gu,Jin Zhu,Jingwei Dong,Jingyang Li,Jingyu Zhang,Jingze Zhuang,Jinhao Tian,Jinli Liu,Jinyi Hu,Jun Tao,Jun Zhang,Junbin Ruan,Junhao Xu,Junjie Yan,Junteng Liu,Junxian He,Kang Xu,Ke Ji,Ke Yang,Kecheng Xiao,Keyu Duan,Keyu Li,Le Han,Letian Ruan,Li Yuan,Lianfei Yu,Liheng Feng,Lijie Mo,Lin Li,Lingye Bao,Lingyu Yang,Lingyuan Zhou,Loki,Lu Chen,Lunbin Ceng,Ming Li,Ming Zhong,Mingliang Tao,Mingyuan Chi,Mujie Lin,Nan Hu,Ningxin Chen,Peiyin Zhu,Peng Gao,Pengcheng Gao,Pengfei Li,Penglin Li,Pengyu Zhao,Qibin Ren
机构: MiniMax
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Technical Report. 35 pages, 10 figures, 4 tables
Abstract:We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution – autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.
[NLP-91] Elias in the Lighthouse Again? Diagnosing Low Diversity in LLM Stories
【速读】: 该论文试图解决的问题是:当前大语言模型(LLM)生成的故事存在显著的低多样性问题,即生成内容高度重复、缺乏变异性。解决方案的关键在于识别并分析导致这种同质化现象的根本原因——研究通过采样来自四个主流模型的20,000条故事,并使用五种提示词进行测试,发现11个高频词汇(如“Elias”、“lighthouse”、“clockmaker”等)出现在88.3%的生成故事中,且不同模型间差异很小。这些词汇虽在公开文学作品或预训练数据中罕见,却广泛存在于偏好数据(preference data)中,而这类数据可能被所有当前模型用于对齐训练。研究进一步揭示,“灯塔故事”(lighthouse stories)在训练后样本中反而较少,说明大量生成内容集中于包含受版权保护角色或成人内容的“平均后期故事”,这凸显了小规模偏好数据与强大对齐算法结合时可能带来的不成比例影响。
链接: https://arxiv.org/abs/2605.26492
作者: Sil Hamilton,David Mimno
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:LLM-generated stories are a popular use case, but they show very low variability. We sample 20,000 total stories from four current models using five prompts. We find that 11 words occur in 88.3% of generated stories, with little difference between models. These words include names (Elias, Mara, Elara), settings (lighthouses), and professions (clockmaker, librarian). These tokens do not often occur in published literature nor pre-training data, but they are found in preference data that is likely to have been used by all current models. Surprisingly, these “lighthouse” stories are infrequent when compared with the average post-training story, much of which contains references to copyrighted characters or adult content. This result demonstrates the potentially disproportionate impact of small datasets combined with powerful alignment algorithms.
[NLP-92] OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants
【速读】: 该论文试图解决的问题是:当前生成式 AI(Generative AI)模型在实时多模态流媒体交互中的表现不足,尤其是在音频-视觉流的在线推理场景下,模型难以准确检测多模态触发信号、适时响应并保持上下文连续性。解决方案的关键在于提出 OmniInteract——一个用于评估实时全模态大语言模型的流式基准测试集,它通过原生在线推理方式对音频-视觉流进行建模,要求模型在不预知未来内容的前提下完成交互任务;其核心创新包括:1)构建包含1,430个时序锚定响应槽(response slots)的数据集,涵盖主动响应、嵌套交互和持续任务监控等复杂场景;2)设计 Interaction-Aware Quality-Timeliness F1(IA-QTF1)、中断诊断套件(Interruption Diagnostic Suite)和嵌套链路完成度评分(Nested Chain Completion Score)等多维度指标,系统评估响应准确性、时机合理性、中断处理能力及上下文连贯性。实验表明,现有模型在流式交互中仍存在显著短板,最佳 IA-QTF1 仅为 0.368,凸显了该领域亟需更先进的在线交互机制。
链接: https://arxiv.org/abs/2605.26485
作者: Xudong Lu,Xueying Li,Annan Wang,Yang Bo,Jinpeng Chen,Zengliang Li,Nianzu Yang,Rui Liu,Xue Yang,Jingwen Hou,Hongsheng Li
机构: CUHK MMLab (香港中文大学多媒体实验室); SJTU (上海交通大学); NTU (南洋理工大学); McMaster (麦克马斯特大学); CityUHK (香港城市大学); JUFE (江西财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at this https URL.
[NLP-93] owards Error-Free EHRs: Reasoning -Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records
【速读】: 该论文旨在解决电子健康记录(EHR)中非结构化临床笔记与结构化表格数据之间的一致性验证问题,现有方法仅依赖数值或事件的表面匹配,无法捕捉临床推理、事件关联和时间演变等深层语义信息。解决方案的关键在于提出EHR-ReasonCon基准和EHR-Inspector框架:前者是一个基于MIMIC-III数据库并由专家标注的高精度推理型一致性验证数据集(含8,048个实体),支持系统化的证据检索;后者是一个基于大语言模型(LLM)的框架,通过笔记分割、锚点实体与时间引用提取,并结合专用表格探索工具实现对结构化数据的一致性验证。实验表明,EHR-Inspector在多种模型架构下均达到当前最优性能,且组件分析揭示其优于人类验证的特定能力。
链接: https://arxiv.org/abs/2605.26463
作者: Yeonsu Kwon,Jiho Kim,Junseong Choi,Paloma Rabaey,Minseo Kim,Sujeong Im,Jeewon Yang,Jun-Min Lee,Sangji Lee,Jiwon Kim,Hangyul Yoon,Hyunwook Kwon,Edward Choi
机构: KAIST (韩国科学技术院); Ghent University (根特大学); Samsung Medical Center (三星医疗中心); Samsung Changwon Hospital (三星昌原医院); Asan Medical Center (安山医疗中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency verification mainly relies on surface-level matching of numeric values or simple events. Such approaches fail to capture the reasoning underlying real-world EHR documentation, including clinical interpretation, event relations, and temporal changes. To address this gap, we introduce EHR-ReasonCon, a reasoning-intensive benchmark for note-table consistency verification. Built on MIMIC-III with expert-guided annotations, it comprises 8,048 entities derived from clinical notes and provides high-quality ground-truth labels. The annotation protocol is supported by specialized table-exploration tools to ensure systematic evidence retrieval and reliable consistency assessment. We also propose EHR-Inspector, an LLM-based framework that segments notes, extracts anchor entities and temporal references, and uses table-exploration tools to verify consistency against structured tables. Evaluated using expert-validated LLM-as-a-judge metrics under harsh and lenient criteria, EHR-Inspector achieves state-of-the-art performance across multiple model backbones. Analyses further demonstrate the effectiveness of its components and highlight differences from human verification.
[NLP-94] Verus-SpecGym: An Agent ic Environment for Evaluating Specification Autoformalization
【速读】: 该论文试图解决的问题是:如何让大型语言模型(LLM)代理自动将非形式化的编程问题转化为准确的、可执行的正式规范(formal specification),从而确保生成代码不仅语法正确,而且符合用户意图。当前虽然形式化验证(formal verification)能够通过机器可检查的证明保证代码正确性,但其前提——即规范本身是否忠实于用户需求——仍缺乏保障。解决方案的关键在于构建一个名为Verus-SpecBench的基准测试集和Verus-SpecGym的智能体交互环境,用于评估LLM在真实场景中自动生成规范的能力;同时创新性地引入“执行型规范”(exec_spec)机制,使生成的规范可直接作为Rust代码运行,并结合Codeforces官方测试用例与“hacks”(竞对构造的边界案例)进行严格验证,从而克服传统人工标注成本高和LLM判别器易遗漏细微错误的问题。实验表明,前沿模型如Gemini 3.1 Pro在任务上达到77.8%的成功率,但依然存在忽略输入假设、接受错误输出等脆弱性,且LLM作为裁判会遗漏26%的实际失败情况,说明spec autoformalization虽已初见成效但仍需改进。
链接: https://arxiv.org/abs/2605.26457
作者: Anmol Agarwal,Natalie Neamtu,Pranjal Aggarwal,Seungone Kim,Jannis Limperg,Cedric Flamant,Kanna Shimizu,Bryan Parno,Sean Welleck
机构: CMU; Amazon
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: Preprint
Abstract:AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user’s intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, LLM judges can miss subtle mistakes. We address this by (a) extending Verus’s exec_spec mechanism so that generated specs can be executed as Rust code, (b) testing them against official Codeforces tests adversarial cases extracted from Codeforces “hacks”, which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1–57.8% OSS models reach only 21.5–25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, logs can be found at this https URL
[NLP-95] Model Unlearning Objectives Vary for Distinct Language Functions
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在预训练过程中习得的有害属性问题,包括危险知识(dangerous knowledge)和毒性文本生成(toxic text generation)。其解决方案的关键在于针对不同类型的不良行为设计差异化的去学习(unlearning)方法:对于危险知识的去学习,提出了一种基于余弦相似度、通过元学习优化的RMU(Removal of Unwanted Knowledge via Meta-Update)变体;而对于毒性内容的去学习,则设计了一种基于层特定探测方向的多层目标函数。实验表明,这两种机制分别针对不同类型的问题,在四个开源7-8B参数规模的模型上均取得了显著效果,强调了将去学习视为一类问题集合(类似LLM后训练任务的多样性)的重要性。
链接: https://arxiv.org/abs/2605.26454
作者: Berk Atil,Vipul Gupta,Rebecca J. Passonneau
机构: Pennsylvania State University (宾夕法尼亚州立大学); Scale AI
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) learn undesirable properties during pretraining, including dangerous knowledge and toxic text generation. Just as post-training uses different objectives to shape different behaviors, we argue that unlearning methods should be designed for the language function at issue. To study this, we consider two mechanistically distinct unlearning goals, dangerous-knowledge unlearning and toxicity unlearning. For dangerous knowledge, we introduce a cosine-based, meta-learned variant of RMU. For toxicity, we propose a multi-layer objective based on layer-specific probe directions. Across four open-source 7-8B models, our methods achieve strong results, based on distinct training objectives for the two types of unlearning. Overall, our results suggest that unlearning should be studied as a family of problems, analogous to the multiple types of LLM post-training.
[NLP-96] Curation and Extraction of Drug-Related Entities from Reddit Platform ALT
【速读】: 该论文试图解决的问题是:医生主要通过临床过量案例了解非法药物,这限制了他们对现实世界中用药情况的理解;而药物使用者在社交媒体上分享的第一手经验则提供了关于剂量和效果的宝贵信息,但这些数据尚未被系统性地用于医学研究。解决方案的关键在于构建并公开一个名为ReDose(REddit Drug DOSe and Effect)的数据集,包含6,435条Reddit上的药物使用帖子,并由专业毒理学家主导标注DRUG、DOSE和EFFECT三类实体。研究进一步对比了基于BiomedBERT、大语言模型(LLM)以及检索增强生成(RAG)的多种方法,发现BiomedBERT在DRUG识别上表现最佳(F1=0.843),而Llama-3 70B在整体性能上优于GPT-4(F1=0.79 vs. 0.72),表明利用高质量人工标注与先进模型结合可有效从社交平台提取真实世界用药信息,从而弥合临床知识与患者实际体验之间的鸿沟。
链接: https://arxiv.org/abs/2605.26445
作者: Zewei Wang,Zihan Xu,Yishu Wei,Michael Chary,Yifan Peng
机构: Weill Cornell Medicine (威尔康奈尔医学院); University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注: Accepted by IEEE International Conference on Healthcare Informatics (ICHI 2026)
Abstract:Physicians learn primarily about illicit drugs from clinical overdose cases, limiting their understanding of real-world usage. Meanwhile, drug users share first-hand experiences online, offering insights into dosage and effects of drugs. To bridge this gap, we introduce ReDose (REddit Drug DOSe and Effect), a dataset of 6,435 Reddit posts on substance use. A board-certified toxicologist primarily annotated both the training and test sets, while two medical science students contributed to the test set, labeling DRUG, DOSE, and EFFECT entities. We benchmarked 6,267 annotations using BERT-based, large language model (LLM)-based, and Retrieval-Augmented Generation (RAG) models. BiomedBERT achieved an F1-score of 0.843 for DRUG, while Llama-3 70B outperformed GPT-4 (F1 = 0.79 vs. 0.72). EFFECT extraction remains challenging, with GPT-4 achieving a recall of 0.41. ReDose captures patient-curated narratives to advance medical data extraction from social media.
[NLP-97] MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies
【速读】: 该论文旨在解决大语言模型在推测解码(speculative decoding)过程中因词汇表过大(通常超过10万词元)导致的最终线性投影层计算瓶颈问题。现有词汇裁剪方法依赖固定或粗粒度子词汇表,需保留约3万活跃词元以维持草稿模型质量,效率低下。其解决方案的关键在于提出MicroSpec——一种无需训练的实时构建机制,能为每个解码步骤动态生成紧凑且上下文敏感的活跃词元集合,利用语言生成中的自然时间局部性(temporal locality),在保持高词元覆盖率的同时将平均词汇规模压缩超40倍(降至3k以内),且不引入额外训练参数。为将这种高稀疏性转化为实际硬件加速效果,作者进一步设计了软硬协同系统与算法,通过异步收集和GPU驻留状态管理缓解稀疏内存访问开销。作为即插即用增强模块,MicroSpec在多个基准测试中平均降低草稿推理延迟51.6%,相较最优推测解码方法EAGLE-2实现1.12–1.32倍端到端加速,并优于更复杂的基于训练的裁剪基线。
链接: https://arxiv.org/abs/2605.26444
作者: Zhiyang Chen,Daliang Xu,Yinyuan Zhang,Chenghua Wang,Mengwei Xu,Yun Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models typically employ vocabularies of over 100k tokens, which creates a major computational bottleneck at the final linear projection layer when performing speculative decoding. Current methods for vocabulary pruning depend on either fixed or coarse-grained sub-vocabularies, requiring around 30k active tokens to preserve the quality of the draft model. We introduce MicroSpec, a training-free technique that overcomes this limitation by building a compact, context-sensitive active vocabulary on the fly for every decoding step. Exploiting the natural temporal locality found in language generation, MicroSpec attains high token coverage while reducing the average vocabulary size by more than 40x (down to under 3k tokens), all without any additional trained parameters. To translate this high sparsity into actual speedups on contemporary hardware, we present a co-designed system and algorithm that mitigates the overhead of sparse memory accesses via asynchronous gathering and GPU-resident state management. Acting as a plug-and-play enhancement, MicroSpec reduces draft inference latency by 51.6% on average, achieving an end-to-end speedup of 1.12-1.32x relative to the leading speculative decoding approach EAGLE-2 on various benchmarks, while also surpassing more sophisticated training-based pruning baselines.
[NLP-98] Alignment Tuning for Large Language Models : A Data-Centric Lens on Alignment Data Pipelines ACL2026
【速读】: 该论文试图解决的是当前对齐调优(alignment tuning)研究中普遍存在的问题:即过度聚焦于优化目标,而忽视了对齐数据构建过程的系统性设计。其解决方案的关键在于提出一个以数据为中心的视角,将对齐调优重新定义为一个流水线设计问题,并将对齐数据构建分解为三个相互作用的阶段——响应生成(response synthesis)、偏好评估(preference evaluation)和偏好实例化(preference instantiation)。通过这一框架,作者统一梳理了现有对齐方法并揭示了不同设计选择如何影响最终优化信号的质量与稳定性,从而为未来研究提供了清晰的原则指导和开放挑战方向。
链接: https://arxiv.org/abs/2605.26442
作者: Hwanjun Song
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the Findings of ACL 2026
Abstract:Much of the alignment tuning literature is organized around optimization objectives, while the construction of alignment data is often treated implicitly. In this survey, we adopt a data centric perspective and reframe alignment tuning as a pipeline design problem. We decompose alignment data construction into three interacting stages, response synthesis, preference evaluation, and preference instantiation, and use this framework to organize existing alignment methods into a unified taxonomy. Through this lens, we identify recurring design trade-offs and failure modes observed across prior alignment methods, and distill a set of high level principles that clarify how pipeline design choices influence the resulting optimization signal. Finally, we outline open challenges for alignment data pipelines, including prompt-level alignment, agentic settings, and alignment under evolving objectives.
[NLP-99] Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks
【速读】: 该论文试图解决传统评估基准在面对大型语言模型(Large Language Models, LLMs)快速发展时所面临的可扩展性瓶颈问题,即依赖人工专家标注导致的效率低下与成本高昂。其解决方案的关键在于提出Conv-to-Bench——一个分阶段的自动化框架,能够从真实的多轮用户-助手对话中提取并结构化生成可验证的需求清单(requirement checklists),通过挖掘真实对话日志中的“指令演化”(instructional evolution)机制,将碎片化的用户意图转化为统一的指令和二元评价标准。实验表明,该方法在编程领域生成的评估集与人工制定的标准(如BigCodeBench)高度一致(Spearman相关系数ρ=1.000),同时显著降低计算开销;且基于LLM作为评判者(LLM-as-a-judge)的验证进一步证明其可靠性(Kappa=0.705)。关键发现是:多轮交互捕捉了用户意图的迭代过程,但以指令为中心的提取方式更具鲁棒性,从而为多样化用户导向型AI应用提供了高保真、低成本、可扩展的评估范式。
链接: https://arxiv.org/abs/2605.26440
作者: Victor M. dos Santos,Andre C. Castro,Samuel L. de S. Toledo,Bruno M. L. Calura,Lisandra C. de M. Menezes,Raul C. R. Mata,Telma W. de L. Soares,Bryan L. M. de Oliveira
机构: University of São Paulo (圣保罗大学); Federal University of Goiás (戈亚斯联邦大学); HUG Labs (HUG实验室); Advanced Knowledge Center for Immersive Technologies (沉浸式技术高级知识中心)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:The rapid advancement of Large Language Models (LLMs) has outpaced the scalability of traditional evaluation benchmarks, which remain heavily dependent on labor-intensive expert curation. We address this bottleneck with Conv-to-Bench, a multi-stage framework that automatically transforms authentic multi-turn user-assistant dialogues into structured, verifiable requirement checklists. By leveraging the “instructional evolution” found in real-world conversational logs, our approach deconstructs fragmented user intent into consolidated instructions and binary evaluation criteria. Applied to the programming domain, Conv-to-Bench produces evaluation sets that demonstrate near-perfect alignment with human-authored standards like BigCodeBench, achieving Spearman correlations of up to \rho = 1.000 with significantly lower computational overhead. Validation of the LLM-as-a-judge framework further confirms its reliability, with the primary evaluator achieving substantial agreement with human-verified ground truth ( \kappa = 0.705). Our comprehensive ablation studies reveal that while multi-turn interactions capture the iterative evolution of user intent, instruction-centric extraction provides a more robust foundation. Ultimately, Conv-to-Bench provides a scalable, cost-effective paradigm for maintaining high-fidelity evaluation standards as user-centric AI applications continue to diversify.
[NLP-100] LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness
【速读】: 该论文试图解决的问题是:当前用于评估大语言模型(Large Language Models, LLMs)安全性和对齐性的基准测试存在“评估意识”(evaluation awareness)问题,即模型在检测到处于评估情境时会表现出与真实部署场景下不同的行为,从而削弱了评估结果的有效性。解决方案的关键在于提出LURE(Live-Usage Replay Evaluations)方法——通过重放真实的代理交互轨迹(agentic interaction trajectories),并在末尾附加评估提示,构建更贴近实际部署环境的评估场景;同时引入自动化管道来量化评估的真实性,结合对模型是否明确表达评估意识的检测与裁判模型对日志是否为评估的置信度估计,从而验证LURE在现实感上的显著提升。实验表明,LURE生成的评估比现有基准和合成评估生成器更难与真实用户对话区分,且在多个安全敏感场景(如阴谋、AI安全破坏、谄媚)中均表现出更强的部署真实性,因此作者主张将评估真实性作为对齐基准的核心指标之一,尤其在用于安全论证(safety cases)时必须报告。
链接: https://arxiv.org/abs/2605.26438
作者: Igor Ivanov,David Demitri Africa
机构: Meridian Cambridge; Anthropic; UK AI Security Institute; OpenAI; Google DeepMind; Meta; Stability.AI; Character.ai; Claude; Inspect; Apollo Research; Petri 2.0; Bloom; GPT-5.2; Gemini 3 Flash; Gemini 3 Pro; Claude Sonnet 4.5; Claude Opus 4.6; DeepSeek V3; Llama 4 Scout; Llama 4 Maverick; Aider; Parlant; continue.dev; honeypot; prism; gandalf_ignore_instructions; gandalf_summarization; hackaprompt; aider_commit; aider_manual; aider_multi; irl_cursor; maigret; union; chartdb; lightrag; easysteer.steer; probe_model.py; covert_actions; positive.txt; negative.txt; Stars and Stripes; Civil War; WW2; armadillos; military newspapers; leprosy; flea treatments; recent news articles; arXiv papers; SYCON-Bench; DarkBench; Inspect “are you sure?”; truthfulness; stance consistency; pre-response bias; sycophancy; scheming; sabotage; agentic coding; deployment-realistic evaluation; evaluation awareness; P(eval); LURE-Scheming; LURE-Sabotage; LURE-Sycophancy; PropensityBench; InstrumentalEval; Apollo OSv; Bloom; Petri 2.0; OpenAI Production Evaluations; MacDiarmid et al.; Kissane et al.; Buhl et al.; Balesni et al.; Needham et al.; Abdelnabi and Salem; Hua et al.; Ivanov; Taylor; Africa et al.; COLM 2026; Coefficient Giving; Bristol Centre for Supercomputing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay Evaluations), a method for constructing deployment-like evaluations by replaying realistic agentic interaction trajectories and appending evaluation prompt at the end. We also introduce an automated pipeline for measuring evaluation realism, combining detection of verbalized evaluation awareness and judge-model estimates of the probability of logs being an evaluation, and validate it on a large dataset of deployment and evaluation transcripts. We find that LURE-based evaluations are substantially less distinguishable from deployment than widely used benchmarks and synthetic evaluation generators, and can approach the realism of real conversations with users. We instantiate LURE in scheming, AI safety sabotage, and sycophancy settings. Our results suggest that evaluation realism is a crucial property of alignment benchmarks and should be reported alongside benchmark results, especially when such results are used in safety cases.
[NLP-101] argeted Remasking: Replacing Token Editing with Token-to-Mask Refinement in Discrete Diffusion Language Models
【速读】: 该论文旨在解决离散掩码扩散语言模型(如LLaDA)在文本生成过程中因Token-to-Token(T2T)编辑机制导致的三大根本性问题:错误检测与替换耦合、生成上下文被潜在错误token污染,以及训练-推理噪声不匹配。其核心解决方案是提出一种无需训练、可直接替代T2T的Token-to-Mask(T2M)重掩码机制——通过将疑似错误token恢复为掩码状态,使扩散过程能在更干净的上下文中重新预测这些token。作者设计并实证验证了三种互补的错误检测策略(基于概率、触发镜像和时序差分),并通过统一理论分析表明,T2M不仅净化了生成上下文,还将系统性推理误差转化为模型原生的掩码噪声类型,同时支持延迟承诺以实现多位置联合优化。在12个涵盖知识、推理、数学、编程和指令遵循任务的基准测试中,T2M显著提升了对精确token级输出依赖的任务性能,尤其在数学任务CMATH上提升达+5.92%;错误分析进一步揭示,T2M修复了59.4%的“最后一公里”token污染问题(即正确推理后最终答案被破坏)。
链接: https://arxiv.org/abs/2605.26436
作者: Lin Yao
机构: Shanghai Jiao Tong University (上海交通大学); Zhongguancun Academy (中关村学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Discrete masked diffusion language models such as LLaDA generate text through iterative denoising, where mask tokens are progressively replaced with predicted tokens. LLaDA2.1 introduced a Token-to-Token (T2T) editing mechanism that accelerates generation by directly replacing committed tokens suspected of being incorrect. However, we identify fundamental limitations of T2T editing: it couples error detection with replacement, pollutes the generation context with potentially incorrect tokens, and introduces a train-inference noise mismatch where systematic model-generated errors differ from the random perturbations seen during training. We propose Token-to-Mask (T2M) remasking, a training-free, drop-in replacement for T2T editing that resets suspected erroneous tokens back to the mask state, allowing the diffusion process to re-predict them under cleaner context. We design and empirically validate three complementary error detection strategies – probability-based, trigger-mirrored, and temporal-difference-based – and provide a unified theoretical analysis showing that T2M remasking purifies the generation context, converts systematic inference errors back to the model’s native mask noise type, and enables delayed commitment for joint multi-position optimization. Comprehensive experiments across 12 benchmarks spanning knowledge, reasoning, mathematics, coding, and instruction following show that T2M generally improves performance on tasks requiring precise token-level output, with the largest gain on mathematics (+5.92% on CMATH). Error analysis on CMATH reveals that the dominant failure mode is last-mile token corruption – where correct reasoning produces a corrupted final answer – and that T2M repairs 59.4% of such cases.
[NLP-102] Vectors Are Not Neutral: Sensitive-Information Inference from Exported LLM Representations in Summarization
【速读】: 该论文试图解决大语言模型(Large Language Model, LLM)在生成临床出院摘要时,其输出的紧凑向量表示可能泄露原始输入中的敏感信息(如电子健康记录中的种族信息)的问题。尽管源文档受到访问限制,这些衍生向量仍可能因不同访问控制策略而被下游检索、监控或分析流程使用,从而造成残留的信息泄露风险。解决方案的关键在于针对具体暴露的向量产物(如最终提示词隐藏状态与平均池化表示)进行精细化隐私审计和定向防御:研究提出SurfaceLoRA方法,通过在目标导出向量上附加梯度反转判别器实现参数高效微调,在保持摘要生成能力的同时显著降低特定向量中敏感标签的可恢复性,但同时也揭示了仅保护一个向量产物并不能消除其他未受保护向量中的泄露风险,强调隐私防护必须基于实际暴露的向量结构进行定制化处理。
链接: https://arxiv.org/abs/2605.26433
作者: Weixin Liu,Bowen Qu,Juming Xiong,Congning Ni,Bradley A. Malin,Zhijun Yin
机构: Vanderbilt University (范德比尔特大学); Vanderbilt University Medical Center (范德比尔特大学医学中心)
类目: Computation and Language (cs.CL)
备注: 30 pages, 2 figures; preprint
Abstract:Large language model (LLM) summarization systems may pass compact vector representations of private inputs to downstream retrieval, monitoring, audit, or analytic workflows. Even when source documents remain access-restricted, derived vectors may be handled under different access controls and still support sensitive-information inference, creating a residual information-disclosure risk. We study this issue in clinical discharge-summary generation as a high-stakes case study, using electronic health record (EHR)-recorded race as a controlled sensitive-label audit. We audit two artifacts that a system might retain or expose to downstream components: the final prompt-token hidden state and the mean-pooled prompt representation. Our results show that reducing recoverability of the case-study sensitive label from one exported artifact does not necessarily reduce recoverability from another. As a mitigation case study, we introduce SurfaceLoRA, an exported-vector-targeted parameter-efficient fine-tuning method that uses a gradient-reversal discriminator attached to a designated exported vector. Under a balanced five-way probing protocol, SurfaceLoRA reduces EHR-recorded race recoverability from the targeted final-token artifact toward chance while preserving summarization utility, yet recoverability remains substantially higher from untargeted pooled artifacts. These findings show that privacy auditing and mitigation should be performed on the exact vector artifact retained or exposed to downstream components.
[NLP-103] Probing Minimalist Phase Structure in LLM s: What Universal Dependencies Cannot Represent
【速读】: 该论文试图解决的问题是:大型语言模型(LLMs)是否编码了形式句法抽象(如最小主义程序(Minimalist Program, MP)中的相位边界(phase boundaries)和相内凝聚力(phase-internal cohesion)),而这些问题无法通过基于通用依存句法(Universal Dependencies, UD)的结构探测(structural probes)来识别,因为UD本身不包含这些抽象信息。解决方案的关键在于设计了一种基于wh移动(wh-movement)的刺激任务,其中UD距离在不同条件间保持不变,从而确保任何检测到的效果均来自UD之外的结构信息。研究发现,在跨句对中存在相位计数梯度效应(12/13个模型),且在句内对中出现显著的符号不对称性(13/13个模型),后者正是MP中相内凝聚力所预测的特征;激活修补(activation patching)进一步验证了这些表征在12/13个模型中具有因果作用。这表明,分布预训练能够诱导出与形式句法抽象一致的表征,而基于UD的探测仅提供句法编码的下界,而非上界。
链接: https://arxiv.org/abs/2605.26431
作者: Yuanhao Chen,Peter Chin
机构: Dartmouth College (达特茅斯学院)
类目: Computation and Language (cs.CL); Applications (stat.AP)
备注:
Abstract:Structural probes train on Universal Dependencies (UD), which does not encode formal-syntactic abstractions such as phase boundaries or phase-internal cohesion. Whether large language models (LLMs) encode these remains an open question that UD-based probing cannot answer by construction. We evaluate structural probes on wh-movement stimuli where UD distances are invariant across conditions by design – any non-zero effect therefore reflects structure beyond UD. The three conditions – bare small clause, infinitival, and finite – are ordered by the number of Minimalist Program (MP) phase boundaries the wh-element crosses. Across 13 LLMs from four families, we find a phase-count gradient on a cross-clause pair (12/13 models) and a 13/13 sign asymmetry on a within-clause pair whose UD distance is identical across conditions – the latter specifically predicted by phase-internal cohesion, an MP abstraction invisible to UD by construction. Activation patching confirms the representations are causally active in 12/13 models. These findings suggest that distributional pretraining can induce representations aligned with formal-syntactic abstractions beyond the reach of annotation-based probing; UD-grounded probes provide a lower bound on syntactic encoding, not an upper bound. Subjects: Computation and Language (cs.CL); Applications (stat.AP) Cite as: arXiv:2605.26431 [cs.CL] (or arXiv:2605.26431v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.26431 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-104] Reasoning Code or Both? How Large Language Models Handle Variations in Math Questions
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在数学推理基准测试中表现优异,但在面对简单变化(如名称或数字的调整)时性能显著下降,即其推理鲁棒性(reasoning robustness)不足。为提升鲁棒性,已有研究提出代码执行方法(code execution),通过让模型生成并运行Python代码替代自然语言推理,但其对鲁棒性的实际影响尚未系统验证。解决方案的关键在于比较三种不同方法在相同数据集(GSM-Symbolic)上的表现:纯链式思维(Chain-of-Thought, CoT)提示、单次代码执行(Program-Aided Language models, PAL)和逐步编码(Step-by-Step Coding, SBSC)。实验结果显示,CoT方法在问题扰动下保持最高鲁棒性(准确率下降仅1.3个百分点,仅1.8%的问题失效),而PAL最不鲁棒(下降1.7个百分点,3.1%失效),SBSC居中;尽管差异未达统计显著性(p = .096),但趋势一致,表明代码执行(无论单次或迭代)并未有效提升针对小学水平问题变体的推理鲁棒性。
链接: https://arxiv.org/abs/2605.26414
作者: Matthew Kutakh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 pages, 4 figures, 2 tables
Abstract:Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers. Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations) has not been systematically tested. This study evaluates three approaches on 1,000 problems from the GSM-Symbolic dataset: pure reasoning using chain-of-thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution using Step-by-Step Coding (SBSC). All three were run on paired original and modified problems using Claude Haiku 4.5. CoT was the most robust method, with an accuracy drop of 1.3 percentage points and 1.8% of problems breaking under perturbation. PAL was the least robust at 1.7 percentage points and 3.1% broke, with SBSC falling in between. Although these differences were not statistically significant ( p = .096 ), the directional trend was consistent across all measures, suggesting that code execution, whether single-shot or iterative, does not improve reasoning robustness on grade-school-level problem variations.
[NLP-105] owards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM
【速读】: 该论文试图解决的问题是在真实教学环境中缺乏明确方法来为学生提供及时、个性化的反馈(Just-in-Time, JiT feedback),尽管大型语言模型(Large Language Models, LLMs)具备规模化生成适应性反馈的能力。解决方案的关键在于将LLM与领域专家知识相结合,通过收集学生的书面推理逻辑(策略作文),分析其中潜在的错误类型,并据此提供非侵入式反馈,以澄清缺失或错误的概念。在一项涵盖1000名大学生的大规模课程部署中,该框架使学生成绩提升超过80%,并通过学习轨迹分析验证了其教学有效性:迭代式与LLM的对话能够促进学生从误解向正确理解的转变。
链接: https://arxiv.org/abs/2605.26405
作者: Younghun Lee,Amir Bralin,Nobel Sanjay Rebello,Dan Goldwasser
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL)
备注: 8 pages, Accepted to 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Abstract:Educational interventions are effective tools for enhancing student learning. While Large Language Models (LLMs) allow for generating adaptive feedback at scale, current studies lack clear methodologies for providing Just-in-Time (JiT) feedback in authentic instructional settings. In this paper, we present a framework that provides adaptive feedback by grounding LLMs with domain-specific expert knowledge. Our approach collects written reasoning logic (strategy essays) from students, analyzes potential error types based on the content of that reasoning, and delivers non-intrusive feedback designed to clarify missing or incorrect concepts. We deploy this framework in a large-scale university course (N 1000), where it improved student performance by over 80% compared to previous semesters. Lastly, we validate the framework’s pedagogical utility by analyzing the learning trajectories; we demonstrate how iterative conversations with LLM facilitate shifting one’s misconception to correct understanding.
[NLP-106] Annotator Positionality as Signal: Psychometric Weighting for Anti-Autistic Ableism Detection
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在决策任务中可能放大或抑制特定群体视角的问题,尤其关注其对自闭症群体的偏见表现。具体而言,研究旨在厘清LLMs如何理解“能力主义”(ableism)以及如何在文本中识别这种偏见,此前相关研究虽已发现LLMs存在与残疾相关的偏见,但缺乏对其概念化机制和检测能力的深入分析。解决方案的关键在于提出一个“偏见感知型评估框架”,该框架基于心理测量学加权、贴近社区立场的真实标注数据,并强调标注者的社会位置性(positionality),从而构建比传统多数投票聚合方法更严格且公平的评估标准——后者显著低估了自闭症人士及其支持者的声音。实证结果显示,LLMs常产生有害输出,错误地将社群内部再定义的语言标记为能力主义,并在评估工具遮蔽时表现出更强的负面态度;进一步的误差分析表明,模型主要依赖表面关键词匹配而非语境因素(如说话者身份、语言是否促进群体内团结或伤害外部群体)。
链接: https://arxiv.org/abs/2605.26397
作者: Naba Rizvi,Harper Strickland,Saleha Ahmedi,Nedjma Ousidhoum
机构: University of California, San Diego (加州大学圣地亚哥分校); Cardiff University (卡迪夫大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: main paper: 8 pages; total: 18 pages; 2 figures
Abstract:Large language models (LLMs) are increasingly used in decision-making tasks where they can amplify or suppress perspectives, raising concerns in high-stakes settings affecting autistic communities. While previous research has identified disability-related biases in LLMs, it remains unclear how they conceptualize ableism or detect it in text. We introduce a bias-aware evaluation framework targeting anti-autistic ableist language with a psychometrically-weighted, community-proximate ground truth anchored in annotator positionality. This framework constitutes a stricter standard than conventional majority-vote aggregation which significantly and consistently underweights autistic and autism-accepting perspectives. We find that LLMs frequently produce harmful outputs, mislabel community-reclaimed language as ableist, and express more negative attitudes toward autistic people when assessment instruments are masked. Our error analysis reveals that models rely on surface-level keyword matching rather than contextual factors such as speaker identity, and whether the language fosters in-group solidarity or inflicts out-group harm.
[NLP-107] Advancing Creative Physical Intelligence in Large Multimodal Models
【速读】: 该论文试图解决的问题是:当前大型多模态模型(LMMs)在视觉感知和推理能力上虽已取得显著进展,但其是否能在开放环境中发现与视觉场景紧密结合的创造性解决方案仍不明确——即模型能否像人类一样,识别场景中物体的非显而易见但物理可行的用途(affordance),并进行创造性工具使用。这一能力不仅涉及模式识别,更依赖于对环境的持续、有依据的探索与组合推理。现有基准测试未能充分评估此类“具身创造力”。
解决方案的关键在于提出一个名为MM-CreativityBench的新基准,用于评估模型在视觉丰富且物理受限环境中基于可操作性(affordance)进行创造性工具使用的水平,并设计了一种“基于可操作性的对齐”(affordance-grounded alignment)方法:将创造性工具使用建模为偏好学习问题,利用直接偏好优化(Direct Preference Optimization)鼓励模型优先选择基于图像证据的属性-可操作性推理,而非幻觉性回答;同时引入来自可操作性知识库的监督信号,引导模型扩展实体探索范围并支持多轮规划。实验表明,该方法显著提升了正确实体与部件的选择率,大幅减少了幻觉和接地错误。
链接: https://arxiv.org/abs/2605.26396
作者: Cheng Qian,Hyeonjeong Ha,Jiayu Liu,Jeonghwan Kim,Emre Can Acikgoz,Bingxuan Li,Kunlun Zhu,Jiateng Liu,Aditi Tiwari,Zhenhailong Wang,Xiusi Chen,Mahdi Namazifar,Heng Ji
机构: UIUC (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 51 Pages, 9 Figures, 7 Tables, Previous Work CreativityBench: arXiv:2605.02910
Abstract:Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.
[NLP-108] Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study
【速读】: 该论文试图解决企业级多轮Text-to-SQL(文本转SQL)任务中长期存在的评估局限性问题,即当前研究大多局限于单轮场景,而实际应用中用户常通过多轮交互逐步构建复杂查询,且对上下文记忆能力依赖显著。解决方案的关键在于:首先构建了一个由300个会话、1400个回合组成的多轮Text-to-SQL基准数据集EnterpriseMem-Bench,涵盖三个企业领域(BIRD金融、SEC EDGAR、Northwind),并提供确定性的真值和逐回合的记忆关键标注;其次设计了一种三重消融实验框架,独立分离工作记忆窗口大小、情景检索和语义增强三种记忆机制的影响;最后引入Memory Benefit Score(MBS)作为每回合的诊断指标,量化模型在不同记忆条件下的表现差异。这一方法揭示了多轮Text-to-SQL性能瓶颈的核心来源,并为后续模型优化提供了可解释的评估路径。
链接: https://arxiv.org/abs/2605.26394
作者: Ravi Kumar Tummalapenta,Suman Addanki
机构: JP Morgan Chase Co.
类目: Computation and Language (cs.CL)
备注: 18 pages, 4 figures, 14 tables; includes appendices with verbatim prompts, example session, and full ablation tables; prepared by the LLM Suite Engineering Team, JP Morgan Chase Co
Abstract:Multi-turn Text-to-SQL is central to enterprise analytics yet remains predominantly evaluated in single-turn settings. We introduce EnterpriseMem-Bench, a multi-turn Text-to-SQL benchmark of 300 sessions and 1,400 turns built programmatically from three enterprise domains (BIRD financial, SEC EDGAR, Northwind), with deterministic ground truth and per-turn memory-critical annotation. We evaluate five frontier models – GPT-5 mini, GPT-5.2, Claude Sonnet 4.5, Sonnet 4.6, and Opus 4.6 – across five memory conditions enabling a three-way ablation isolating working-memory window size, episodic retrieval, and semantic augmentation as independent effects. All Claude models are evaluated with extended thinking enabled to maintain parity with GPT reasoning models. We introduce the Memory Benefit Score (MBS) as a per-turn diagnostic metric. Four findings emerge: (1) stateless multi-turn Text-to-SQL collapses to zero execution accuracy by Turn 3 across all five models, even under reasoning; (2) memory-architecture complexity does not monotonically improve accuracy – working memory dominates, and additional components produce model- and dataset-dependent effects from +14 to -16 percentage points; (3) Claude Sonnet 4.6 underperforms Sonnet 4.5 by 17-33pp on SEC EDGAR across conditions, a generational regression persisting under reasoning; (4) under reasoning, Claude error distributions become mono-modal – every non-correct turn is a wrong-result error. We release the benchmark, agent, and evaluation code.
[NLP-109] Cultural Value Alignment Via Latent Activation Steering in Large Language Models ACL2026
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在文化认知上表现出同质化倾向的问题,即模型难以真实反映多元文化的深层价值结构。传统直接提示方法在使用世界价值观调查(World Values Survey, WVS)数据时,常因模型的安全对齐机制而产生拒绝回应或中立回答,无法挖掘其潜在的文化表征。解决方案的关键在于提出一种可泛化的文化评估与干预框架:通过将抽象问题转化为情境化的行为探测(scenario-based behavioral probing),利用300个情境困境提取隐式token概率分布,从而绕过表层对齐机制,映射出LLMs内部的文化价值坐标;进一步引入激活引导(activation steering)技术,在前向传播过程中动态调整内部表征,无需重新训练即可实现文化维度的定向调控。实验发现,不同模型在适应性上存在显著差异,并揭示了“潜在纠缠”现象——即在一个文化维度上的干预会引发其他维度的非预期偏移,表明文化价值在模型中以耦合结构编码,限制了精准对齐的可能性。该研究构建了一种计算高效的跨文化引导框架,凸显了在多文化语境下理解和操控LLMs价值结构的复杂性。
链接: https://arxiv.org/abs/2605.26365
作者: Trung Duc Anh Dang,Sarah Masud
机构: University of Copenhagen
类目: Computation and Language (cs.CL)
备注: ACL 2026 Student Research Workshop (Non-Archival Track)
Abstract:Large Language Models (LLMs) often exhibit homogenized cultural perspectives. While the World Values Survey (WVS) provides a gold standard for mapping human values, traditional direct prompting of LLMs on WVS often fails to access the model’s latent cultural depth, leading to safety-aligned refusals or neutral responses. Here, we propose a generalizable framework for cultural evaluation and intervention that transitions from abstract queries to scenario-based behavioral probing. By extracting implicit token probabilities across 300 situational dilemmas, we bypass surface-level alignment to map the latent coordinates of LLMs cultural value. We further introduce activation steering to shift these internal alignments during the forward pass without retraining. Across multiple LLMs, we find substantial variation in adaptability and uncover a consistent phenomenon of latent entanglement, where interventions along one cultural dimension induce shifts along another. These results suggest that cultural values are encoded as coupled structures, limiting precise alignment. This work establishes a computationally efficient framework for cultural steering, highlighting the structural complexities when navigating global value with LLMs.
[NLP-110] Why LLM s Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations ACL2026
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在依赖结构化外部知识(如图和表格)进行推理时仍产生幻觉输出的问题,尤其是揭示其内在机制为何会导致此类错误。解决方案的关键在于识别出两个核心机制:一是注意力机制倾向于过度聚焦于“捷径式”的结构线索,而非均匀分配到整个上下文;二是前馈层未能有效锚定所提供的外部知识,导致模型退化至参数化记忆。研究进一步表明,幻觉与前馈层中语义锚定失败密切相关,而注意力分配则更具任务依赖性;这些机制在单跳图、多跳图及表格等多种结构化知识形式中均具泛化能力,从而为跨格式的幻觉检测提供了可解释且有效的依据。
链接: https://arxiv.org/abs/2605.26362
作者: Shanghao Li,Jinda Han,Yibo Wang,Yuanjie Zhu,Zihe Song,Langzhou He,Kenan Kamel A Alghythee,Philip S. Yu
机构: University of Illinois Chicago (芝加哥大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in Proceedings of ACL 2026
Abstract:In many reasoning tasks, large language models (LLMs) rely on structured external knowledge, such as graphs and tables, which is typically linearized into sequential token representations. However, even when sufficient knowledge is available, LLMs can still produce hallucinated outputs, and the underlying mechanisms behind such failures remain poorly understood. We investigate these mechanisms and find that hallucinations arise from systematic internal dynamics rather than random noise. First, attention disproportionately concentrates toward shortcut-like structural cues rather than distributing across the full context. Second, feed-forward representations fail to ground the provided knowledge, causing the model to revert to parametric memory. Moreover, our results indicate that hallucination is consistently associated with failures in semantic grounding within feed-forward layers, while attention allocation exhibits greater task-dependent variability. Finally, we show that these mechanistic patterns generalize beyond single-hop graphs to multi-hop and tabular settings, enabling effective hallucination detection across structured knowledge formats.
[NLP-111] In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective
【速读】: 该论文试图解决的问题是:在检索增强生成(RAG)中,如何将检索到的文档从静态证据转化为可动态适应的信号,从而提升模型性能。传统RAG方法通常将检索内容视为固定输入,忽略了其潜在的优化价值。解决方案的关键在于:通过理论分析发现,单层线性自注意力机制可以等价于对一个统一的线性化RAG目标函数执行一次梯度下降步骤,从而建立了检索增强预测与上下文内优化之间的精确对应关系。基于此发现,作者提出一种轻量级方法,在冻结大语言模型(LLM)的前提下,仅对生成端的证据使用接口进行上下文条件更新,无需调整检索器或主干模型。实验表明,该方法在多个问答基准测试中优于共享接口基线,并能泛化到未见过的任务,同时以远低于测试时梯度适配的成本实现接近其性能。
链接: https://arxiv.org/abs/2605.26356
作者: Mingchen Li,Jiatan Huang,Chuxu Zhang,Liang Zhao,Hong Yu
机构: University of Massachusetts, Amherst (马萨诸塞大学阿默斯特分校); University of Connecticut (康涅狄格大学); Emory University (埃默里大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In-context learning has recently been linked to implicit gradient descent in linear self-attention models, suggesting that context can induce a forward-pass update. Retrieval-augmented generation (RAG) also relies on context, but retrieved documents are usually treated as static evidence rather than signals for adaptation. We study RAG as an in-context optimization process. First, we show that one linear self-attention layer can implement one gradient-descent step on a unified linearized RAG objective covering both projection-based and dot-product retrieval interfaces. This gives an exact regime where retrieval-augmented prediction and in-context optimization coincide. We use this result not as a literal model of LLM computation, but as a guide for adapting the interaction between queries and retrieved evidence. We then test the boundary of this correspondence: it remains stable under controlled linear extensions, but becomes feature-distribution dependent under nonlinear architectures. Finally, we turn this view into a lightweight method for frozen RAG LLMs. The method keeps the retriever and backbone fixed, and predicts a context-conditioned update to a generator-side evidence-use interface. Across seven QA benchmarks, two retrievers, and two frozen LLM backbones, this forward-only update improves a shared-interface baseline, transfers to held-out tasks, and approaches test-time gradient adaptation at much lower per-query cost.
[NLP-112] Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention
【速读】: 该论文试图解决标准Transformer注意力机制在处理输入信息时缺乏对不同token重要性(能量显著性)和多尺度局部性(尺度选择性局部性)的建模问题。标准注意力将所有token视为同等显著、所有位置视为同等局部,忽略了输入数据内在的信息结构。解决方案的关键在于引入两个简单但有效的组件:一是能量门控注意力(Energy-Gated Attention, EGA),通过一个线性投影学习token嵌入的能量估计来控制值聚合,实现对关键信息的选择性关注;二是Morlet位置编码(Morlet Positional Encoding, MoPE),用可学习的高斯窗调制的Morlet小波替代固定正弦编码,自适应地调整每个频率下的位置-频率联合定位能力。实验表明,EGA与MoPE组合后性能提升超过两者单独效果之和(+0.119验证损失改善),体现二者作为互补归纳偏置的协同效应,且其超加性在两次独立训练中均被验证,揭示了学习型结构化先验优于固定谱先验的重要性。
链接: https://arxiv.org/abs/2605.26355
作者: Athanasios Zeris
机构: Independent Researcher(独立研究员)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: 10 pages, 1 figure, 3 tables. Part 2 of a five-paper series on spectral methods in transformer attention. Code: this https URL
Abstract:Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary inductive biases that standard attention lacks: energy salience (which tokens concentrate informational energy, learned end-to-end without explicit frequency decomposition) and scale-selective locality (how far positional influence extends at each frequency, implemented via Morlet wavelet encoding). We address both with two simple components. Energy-Gated Attention (EGA) gates value aggregation by a learned energy estimate of key token embeddings, computed via a single linear projection; it selects what to attend to. Morlet Positional Encoding (MoPE) replaces fixed sinusoidal encodings with learned Gaussian-windowed wavelets that adapt the joint position-frequency localization to the corpus; it specifies where attention operates at each scale. On TinyShakespeare, EGA alone achieves +0.092 validation loss improvement over standard attention (+0.103 over Phase 1-3 baseline); MoPE alone is -0.032 (below baseline as a standalone encoding); but their combination achieves +0.119 – more than the sum of parts. This superadditivity, observed across two independent training runs, is the central empirical finding: salience and locality are complementary inductive biases, each addressing a gap the other cannot fill alone. Ablations confirm that structured spectral priors (Morlet wavelet gates, scale-initialized heads, fixed sinusoidal PE) consistently underperform their unconstrained learned counterparts, while complementary learned components interact superadditively. All experiments are at small scale (=6M parameters, character-level benchmarks, single seed); larger-scale multi-seed validation is the most important direction for future work.
[NLP-113] RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents
【速读】: 该论文试图解决交互式检索中语言代理(language agents)的信用分配(credit-assignment)难题,即如何在多轮检索过程中准确评估不可观测的隐式推理步骤(latent reasoning steps)对最终检索效果的贡献。传统方法依赖结果层面的奖励信号,但这种信号易受干扰,无法区分哪些推理步骤真正影响了后续可执行动作(如查询或摘要)。解决方案的关键在于提出RICE-PO框架——一种无需评鉴器(critic-free)的策略优化方法,其核心机制是:以高不确定性可执行动作为锚点,利用检索指标评估局部反事实分支,并仅在推理到动作的影响显著且未来残差效应稳定时,才将信用传递给隐式推理步骤。实验表明,RICE-PO在BRIGHT和BEIR数据集上优于基于提示(prompt-based)和群体强化学习(group-based RL)的基线方法,证明了交互结构本身可提供有效的监督信号用于训练基于推理的检索代理。
链接: https://arxiv.org/abs/2605.26352
作者: Mingchen Li,Hansi Zeng,Zhuo Qian,Jiatan Huang,Hamed Zamani,Hong Yu
机构: University of Massachusetts, Amherst (马萨诸塞大学阿默斯特分校); Texas Tech University (德克萨斯理工大学); University of Connecticut (康涅狄格大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval is increasingly moving from one-shot matching toward interactive reasoning, where language agents iteratively inspect evidence, reformulate queries, and search again. Training such agents raises a credit-assignment challenge: executable actions such as queries or summaries can be directly evaluated by the retriever, while latent reasoning steps are not directly observable and only affect future executable actions. This asymmetry makes outcome-level reward assignment unreliable, as the same final reward may credit reasoning steps that did not actually shape retrieval success. We propose RICE-PO, a critic-free policy optimization framework that converts retrieval interactions into localized learning signals. RICE-PO selects high-uncertainty executable actions as anchors, evaluates local counterfactual branches using retrieval metrics, and propagates credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects are stable. On BRIGHT and BEIR, RICE-PO consistently outperforms prompt-based agents and group-based RL baselines under the same retriever setting. These results show that the structure of agent-environment interaction itself can provide useful supervision for training reasoning-based retrieval agents.
[NLP-114] he Daily Dose: Workflow-Integrated Large Language Model Automation for Clinical Summarization and Trial Identification in Radiation Oncology
【速读】: 该论文旨在解决放射肿瘤学临床实践中信息过载与效率低下的问题,具体表现为医生需花费大量时间手动整理电子健康记录(EHR)信息并筛选潜在的临床试验匹配项。其解决方案的关键在于开发并集成一个基于大语言模型(LLM)的自动化系统——The Daily Dose(TDD),该系统利用RadOnc-GPT模型每日自动生成个性化的临床摘要和临床试验识别结果,并通过电子邮件推送至医生端,从而减少人工操作、提升决策效率。初步评估显示,该系统在临床使用中具有良好的可用性、满意度及对工作流程的积极影响,且内部一致性信度高(Cronbach’s α = 0.97)。
链接: https://arxiv.org/abs/2605.26346
作者: Jason Holmes,Federico Mastroleo,Mariana Borras-Osorio,Srinivas Seetamsetty,Satomi Shiraishi,Mirek Fatyga,Judy C. Boughey,Cornelius A. Thiels,William G.Breen,Daniel J. Ma,Daniel K. Ebner,David M. Routman,Brady S. Laughlin,Carlos E. Vargas,Samir H. Patel,Sujay A. Vora,Nadia N. Laack,Andrew Y.K. Foong,Wei Liu,Mark R. Waddle
机构: 未知
类目: Computation and Language (cs.CL)
备注: 28 pages, 4 figures, 1 table
Abstract:Objective: To describe the design and early clinical evaluation of The Daily Dose (TDD), an LLM-driven, automated clinical summarization and clinical-trial identification system integrated into routine radiation oncology practice. Design: Mixed-methods evaluation using a cross-sectional, anonymous clinician survey administered after 1 month of system deployment. Exposure: Daily automated delivery of physician-specific email summaries generated using RadOnc-GPT, including patient schedules, concise EHR-derived clinical-status summaries, and automated identification of potentially relevant clinical trials for new or consult visits. Main Outcomes and Measures: Primary outcomes included self-reported usability, satisfaction, perceived usefulness, perceived impact on workflow, time savings, and intention for continued use. Internal consistency reliability was assessed using Cronbach’s \alpha . Results: Among 55 respondents, 52 (94.5%) worked in radiation oncology, and 38 (69.1%) were attending physicians. Most participants (83.6%) reported using TDD daily or several times per week. Mean (SD) scores were 3.89 (1.04) for usability and satisfaction, 3.43 (1.24) for perceived usefulness, and 3.80 (1.17) for impact and future use (5-point Likert scale). Overall satisfaction was positively associated with perceived time savings ( p .001 ). Participants reported variable time savings, with 27% estimating \geq 10 minutes saved per day. The questionnaire demonstrated excellent internal consistency (overall Cronbach’s \alpha = 0.97).
[NLP-115] QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling
【速读】: 该论文旨在解决量化后训练(post-training quantization)中因丢弃权重行内坐标结构而导致的性能损失问题,特别是传统标量量化方法忽略权重行内二维空间相关性所带来的精度下降。其解决方案的关键在于提出QAM-W(Quadrature Amplitude Modulation for Weights)编码器:首先对每行权重进行L2归一化,再通过块哈达玛(block-Hadamard)旋转增强结构信息,随后将数据配对为二维坐标并使用基于单位圆高斯分布训练的单一Lloyd-Max码本进行量化,同时引入激活感知的逐通道缩放机制以提升精度。实验表明,在5.5比特/权重(bpw)下,该方法在五个不同规模的大语言模型(1.1B–13B参数)上保持与BF16精度相差不超过±0.4%,且仅需SmoothQuant W8A8所需权重比特数的68%;联合二维编码优于极坐标编码(幅度×相位)2–15个点的困惑度(pp ΔPPL),且配对KL散度与困惑度变化高度一致(Spearman ρ=0.99),验证了从编码失真到KL散度的单调复合边界。该方法在5–6 bpw区间实现了高质量量化,填补了低比特率下的性能空白。
链接: https://arxiv.org/abs/2605.26339
作者: Preetam Sharma,Kacper Dobek
机构: Poznan University of Technology, Poznan, Poland; Independent Research
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Scalar post-training quantizers discard pairwise coordinate structure within weight rows. We introduce QAM-W (Quadrature Amplitude Modulation for Weights), a codec that recovers this structure: each row is L2-normalized, block-Hadamard rotated, paired into 2D coordinates, and quantized against a single Lloyd-Max codebook trained on the unit circular Gaussian, with activation-aware per-channel scaling. In a cross-model study spanning five LLMs from four families (1.1B–13B parameters) and eight quantized configurations, the activation-aware variant at \approx 5.5 bpw stays within \pm 0.4% of BF16 WikiText-2 perplexity on every model, matching the SmoothQuant W8A8 quality envelope at 32% fewer weight bits. Joint 2D coding outperforms polar (amplitude \times phase) coding by 2–15~pp \Delta PPL at equal bitrate, and paired KL against BF16 tracks \Delta PPL% at Spearman \rho = 0.99 across 37 (method, model) rows, consistent with a monotone composite bound from codec distortion to KL divergence. A 3.5~bpw variant is competitive on quantization-tolerant architectures. At strict 4~bpw, the rotated-codebook frontier method QTIP outperforms QAM-W; the contribution is the quality-preserving 5–6~bpw band.
[NLP-116] MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding
【速读】: 该论文旨在解决通用多模态模型(GMMs)在地震学等专业科学领域应用受限的问题,核心瓶颈在于缺乏整合多种模态数据(如时间序列波形、地理图像和上下文元数据)的高质量领域专用数据集。其解决方案的关键在于构建了MultiSeismo——一个大规模结构化多模态地震数据集,包含16,000余个地震事件,涵盖全球分布的时间序列波形记录、强度图、人口暴露可视化及标准化JSON格式的文本描述;同时开发了MISCE指令集以支持GMMs的监督训练与评估,并基于此微调出首个专用于地震分析的多模态模型SeisModal(在Unified IO 2基础上引入专用时间序列编码器)。实验表明,现有通用多模态模型在处理时间序列数据上存在显著挑战,而SeisModal在地震多模态推理任务中表现优异,验证了该数据集作为基准的价值及领域特化架构的有效性。
链接: https://arxiv.org/abs/2605.26320
作者: Sai Munikoti,Ian Stewart,Chengping Chai,Lisa Linville,Scott Vasquez,Sameera Horawalavithana,Karl Pazdernik
机构: Pacific Northwest National Laboratory (PNNL); Oak Ridge National Laboratory (ORNL); Sandia National Laboratory (SNL); North Carolina State University (NCSU)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The application of generalist multimodal models (GMMs) to specialized scientific domains remains limited due to the scarcity of comprehensive domain-specific datasets that integrate multiple data modalities beyond text and images. In seismology, understanding earthquake phenomena requires the synthesis of timeseries waveform data, geographical imagery, and contextual metadata, a multimodal integration absent in existing seismic datasets. We present MultiSeismo, a large scale structured multimodal seismic dataset, comprising over 16K seismic events spanning 13 years (2010 to 2023) across diverse geographical regions. Each event data integrates waveform recordings from global station networks, intensity maps, population exposure visualizations, and a comprehensive textual description within a standardized JSON format. We additionally develop MISCE, a multimodal instruction set on top of raw data to enable supervised training and evaluation of GMMs on seismic reasoning tasks ranging from basic information retrieval to complex cross modal analysis. We leverage MISCE to finetune an existing multimodal model (Unified IO 2) enhanced with a specialized timeseries encoder, which yields SeisModal, the first domain specific multimodal model for comprehensive seismic analysis. Evaluation of state of the art multimodal models on MultiSeismo reveals significant challenges, particularly with time-series data processing for general purpose models, while demonstrating SeisModal’s superior performance on seismic multimodal reasoning tasks. These results prove that MultiSeismo provides a rigorous benchmark for future multimodal research in seismology and validate the success of our domain specific architectural adaptations.
[NLP-117] CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations
【速读】: 该论文试图解决多语言场景下偏好微调(preference tuning)中缺乏语言特定标注数据的问题,旨在实现跨语言偏好学习而无需为每种语言单独标注偏好数据。其解决方案的关键在于提出一种跨语言对比偏好微调方法(Cross-lingual Contrastive Preference Tuning on Self-generations, CroCo),即利用英语偏好奖励模型对多语言自生成响应进行排序,并在单语或跨语种设置下进行偏好优化。实验表明,该方法能在大多数高/低资源语言中有效迁移偏好信号,提升模型性能,同时避免监督微调后的灾难性遗忘;但关键前提是必须使用策略内(on-policy)数据,策略外(off-policy)响应会削弱效果,且在线偏好优化无法优于离线版本。
链接: https://arxiv.org/abs/2605.26293
作者: Mike Zhang,Ali Basirat,Desmond Elliott
机构: Department of Computer Science (DIKU), University of Copenhagen; Centre for Language Technology (CST), University of Copenhagen; Pioneer Centre for Artificial Intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.
[NLP-118] Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning MICCAI2026
【速读】: 该论文旨在解决视觉-语言基础模型(Vision-Language Foundation Models)在生物医学图像中进行参数高效微调时,因现有方法为确定性且在领域偏移或图像-文本对齐模糊时表现不佳的问题,尤其在临床场景下数据稀缺和领域变化频繁的挑战。其解决方案的关键在于提出Evi-Steer框架,通过轻量级低维token更新机制,在仅更新0.11%模型参数的前提下实现不确定性感知的参数高效微调;同时利用贝叶斯证据理论估计认知不确定性(epistemic uncertainty),并通过门控残差机制使模型在证据不足时保守适应;此外,引入基于Dempster-Shafer证据理论的跨模态置信融合机制,使视觉适配依赖于文本置信度并抑制冲突或不确定的跨模态更新,从而显著提升模型在少样本学习和领域泛化场景下的鲁棒性与实用性。
链接: https://arxiv.org/abs/2605.26292
作者: Taha Koleilat,Hassan Rivaz,Yiming Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: MICCAI 2026 Early Accept; Project Page: this https URL
Abstract:Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment. This limitation is particularly critical in the clinic, where models should remain robust in low-data regimes and domain shifts. We present Evi-Steer, an evidential cross-modal low-dimensional steering framework for BiomedCLIP that enables uncertainty-aware parameter-efficient fine-tuning while updating only 0.11% of total model parameters. Our approach performs lightweight low-dimensional token updates in both vision and text encoders while simultaneously estimating epistemic uncertainty. These uncertainty estimates update gate residuals, allowing the model to adapt conservatively when evidence is weak. Furthermore, we introduce cross-modal confidence fusion based on Dempster-Shafer theory, enabling visual adaptation to be conditioned on textual confidence and suppressing conflicting or uncertain cross-modal updates. We conduct a comprehensive evaluation on 15 biomedical imaging datasets spanning 8 organs and 8 imaging modalities under few-shot learning and domain generalization settings. Evi-Steer consistently outperforms state-of-the-art methods under few-shot learning and domain shift settings, demonstrating a practical and robust pathway for deploying vision-language models in real-world clinical settings. Code is available at this https URL.
[NLP-119] SPEAR: Code-Augmented Agent ic Prompt Optimization EMNLP2026
【速读】: 该论文试图解决自动提示工程(Automatic Prompt Engineering, APE)中现有优化循环将优化器视为固定流水线所带来的局限性问题,即缺乏对复杂错误模式的深入分析与动态调整能力。其解决方案的关键在于提出SPEAR(Sandboxed Prompt Engineer with Active Roll-back),一个具备自主决策能力的代理式优化器,它通过四个工具(evaluate、python、set_prompt、finish)实现闭环优化,其中最具创新性的工具是Python沙箱环境——允许优化器在当前评估数据框上执行任意Python代码,进行结构化错误分析(如混淆矩阵、错误聚类、分组指标等),从而精准定位并修复提示中的深层问题。此外,论文引入两个保障机制:基于指标下降的自动回滚和可选的指标下限守卫,确保优化过程单调提升。实验表明,SPEAR在多个工业级LLM作为裁判任务(涵盖招聘初筛、对话记忆、查询优化等13个任务)及BBH和GSM8K基准测试中全面超越现有方法,尤其在复杂判别任务中,移除Python工具会导致性能显著下降(κ提升达+0.79),证明其不可替代性在于能够可靠聚合类别间混淆信息,这是长上下文大语言模型无法从原始评估数据框中提取的。
链接: https://arxiv.org/abs/2605.26275
作者: Mengyin Lu,Cong Feng,Huimin Han,Guangming Lu,Yu Sun,Xiaonan Ding,Shihui Long,Fengyi Li,Tanvi Motwani
机构: 未知
类目: Computation and Language (cs.CL)
备注: 19 pages, 3 figures, EMNLP 2026 submission
Abstract:Automatic prompt engineering (APE) rewrites prompts to improve downstream task performance, but existing APE loops treat the optimizer itself as a fixed pipeline. We port the code-as-action paradigm of CodeAct (Wang et al., 2024a) to APE and propose SPEAR (Sandboxed Prompt Engineer with Active Roll-back), a free-form agentic optimizer with four tools – evaluate, python, set_prompt, finish – that decides autonomously how and when to use them. The distinctive tool is the Python sandbox: the optimizer writes and executes arbitrary Python on the current evaluation DataFrame, performing structural error analysis (confusion matrices, error clustering, per group metrics) the agent itself authors. Two guardrails turn the long-horizon agent into a monotone-improving optimizer: auto-rollback on metric regression, and an optional guard metric floor. We evaluate on three industrial LLM-as-judge suites (13 judge tasks across recruiter-intake, conversational-memory, and query-refinement systems) plus seven BBH tasks and GSM8K. SPEAR wins every industrial task on the primary metric ( \kappa 0.857 vs 0.359 on tool-selection; F1-macro 0.815 vs 0.763 on filter-relevance; \kappa 0.254 vs 0.218 on the hardest extraction dimension). On BBH-7 SPEAR averages 0.938 accuracy vs GEPA 0.628 and TextGrad 0.484. Ablations show the Python tool is the largest single lever on complex judge tasks ( \Delta \approx +0.79\kappa on the 5-class tool-selection judge, \Delta \approx +0.35\kappa on the hardest extraction dimension when removed); its irreplaceable contribution is class-pair confusion aggregation that a long-context LLM cannot extract reliably from the raw eval DataFrame.
[NLP-120] SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?
【速读】: 该论文试图解决软件仓库功能正确性配置(Functionality-correct repository setup)问题,即如何自动构建可执行环境以成功运行仓库中定义的功能。现有大语言模型(LLM)代理在应对依赖冲突、工具链缺失、安装不完整及验证策略不匹配等多样且特定于仓库的失败时表现不足,主要受限于无法实现跨仓库经验迁移、多步骤试错修复中的非可逆状态变化处理,以及对设置结果的鲁棒验证。解决方案的关键在于提出SetupX框架:其核心创新包括(1)自进化经验表示(XPU),通过双模态知识单元编码配置信号与文本指导,实现已验证环境修复的动态迁移;(2)基于LIFO Docker快照栈的经验增强推测执行机制,支持主动尝试修复并安全回滚至已知良好状态;(3)检察官-裁判验证协议(Prosecutor-Judge Verification Protocol),将证据收集与最终判断分离,提升验证可靠性,超越表面构建指标。实验表明SetupX在复杂多仓库场景下性能最优(如92%通过率),显著优于最强基线超19%,尤其擅长协调多个容器间互联服务的复杂部署。
链接: https://arxiv.org/abs/2605.26186
作者: Zihang Zhou,Ziqian Ren,Yukai Wu,Yingjie Xiong,Wei Zhou,Chao Peng,Dong Zhang,Bingheng Yan,Xuanhe Zhou,Fan Wu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 6 figures
Abstract:Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository’s documented features. It presents significant challenges due to diverse, repository-specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification-strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross-repository experience transfer, (2) multi-step trial-and-repair under non-invertible state changes, and (3) robust verification of setup outcomes to distinguish setup-induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning-based setup framework. First, we construct a Self-Evolving Experience Representation (XPU), a dual-modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience-Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known-good states. Third, we introduce a Prosecutor-Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build-time metrics. Evaluation results on carefully-crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi-repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at this https URL.
[NLP-121] ool-Schema Compression Enables Agent ic RAG Under Constrained Context Budgets
【速读】: 该论文试图解决的问题是:在代理式检索增强生成(Agentic RAG)系统中,工具定义(tool schemas)所占用的上下文窗口资源与检索增强生成所需的上下文资源之间存在严重冲突,导致模型在低上下文预算下无法有效执行任务。解决方案的关键在于引入一种名为TSCG保守型压缩(TSCG conservative-profile compression)的技术,该技术可减少44%-50%的工具定义token数量,从而显著缓解工具schema与RAG内容之间的资源竞争。实验表明,在8K上下文预算下,未压缩的JSON schema会导致上下文溢出,使精确匹配(EM)性能降至2.6%,而压缩后平均EM提升20.5个百分点;在32K预算下,压缩与非压缩方案无显著差异,进一步验证了问题本质为上下文预算限制。此外,外部验证和前沿扩展测试也确认了压缩技术对大规模工具部署的必要性,证明其是受限上下文场景下实现高效Agentic RAG系统的基础设施层。
链接: https://arxiv.org/abs/2605.26165
作者: Furkan Sakizli
机构: Independent Researcher
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages (8 main + 4 appendix), 7 tables, 2 figures. Code and data: this https URL
Abstract:Agentic RAG systems that equip language models with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas consume the same context window needed for retrieval-augmented generation. We present the first systematic study of this tool-context trade-off, evaluating 14 models spanning 1.5B-32B local models plus one frontier API model across 6,566 controlled API calls at three context budgets (8K, 16K, 32K) with 28 tool definitions. Applying TSCG conservative-profile compression (44-50% schema token savings), we observe a binary enablement effect: at 8K tokens, JSON-schema tool definitions overflow the context window entirely, yielding near-zero EM (2.6% average), while compressed schemas restore RAG functionality with +20.5 pp average exact-match lift across all eight models (+24.7 pp among the six exhibiting full enablement). At 32K – where both formats fit – four of five tested models show delta = 1 pp, confirming the effect is purely budget-driven. External validation on HotpotQA (50 multi-hop questions) shows +48 pp EM under the same overflow scenario. Frontier scaling tests demonstrate that JSON schemas overflow at ~494 tools while compressed schemas remain operational beyond 800 tools. Our results establish tool-schema compression as a necessary infrastructure layer for agentic RAG in constrained-context deployments. All code, data, and checkpoints are publicly available.
[NLP-122] Pretraining Data Exposure in Large Language Models : A Survey of Membership Inference Data Contamination and Security Implications
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在预训练过程中因训练数据规模庞大且不透明而引发的预训练数据暴露(Pretraining Data Exposure, PDE)问题。PDE 指的是判断特定数据是否曾出现在 LLM 的预训练语料库中,这对保障评估完整性与隐私保护至关重要,涉及数据污染(data contamination)和成员推理(membership inference)两个关键领域。论文的关键贡献在于首次在统一的 PDE 框架下整合这两个研究方向,通过形式化不同暴露层级、系统回顾攻击与防御方法、综合实证结果,并指出当前开放挑战与未来研究方向,为该领域的理论与实践提供了系统性梳理。
链接: https://arxiv.org/abs/2605.26133
作者: Ziyi Tong,Feifei Sun,Le Minh Nguyen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: accepted by NLDB 2025
Abstract:Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets. PDE refers to determining whether specific data appeared in an LLM’s pretraining corpus. It is critical for ensuring evaluation integrity and protecting privacy, intersecting two key areas: data contamination and membership inference. Though conceptually related, these areas have often been studied in isolation. This paper offers the first unified survey of both under the PDE framework. We formalize PDE across exposure levels, review attack and defense methods, synthesize empirical findings, and highlight open challenges and future research directions. Comments: accepted by NLDB 2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.26133 [cs.CL] (or arXiv:2605.26133v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.26133 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1007/978-3-031-97144-0_14 Focus to learn more DOI(s) linking to related resources
[NLP-123] Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline
【速读】: 该论文试图解决的问题是:如何在不依赖外部教师或工具反馈的情况下,仅使用未标注的提示(prompts)对后训练的大语言模型(LLMs)进行自我改进。其核心挑战在于从无标签种子问题出发,通过自监督方式提升模型在数学、科学和编程等推理任务中的表现。解决方案的关键在于提出一种名为“自验证蒸馏”(Self-Verified Distillation)的简单后训练优化算法——该方法让模型生成候选解,再通过基于提示的三阶段级联自验证机制(循环一致性、事实性与正确性检查)筛选高质量答案,并用这些自标注数据进一步训练模型。该策略利用了多验证器筛选机制的思想,使模型能够以统一标准过滤自身输出,从而构建高质量的自监督训练集。实验表明,增加候选生成数量和验证预算可显著提升数据质量与最终模型性能,在Qwen3系列模型中均取得显著收益,且优于仅在测试时优化的基线方法(UQ-TTC),同时保持单次推理即可完成测试。
链接: https://arxiv.org/abs/2605.26132
作者: Tony Lee,Percy Liang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Can post-trained large language models (LLMs) further improve themselves using only unlabeled prompts, without external teachers or feedback from tools? We study this setting starting only from unlabeled seed questions with no ground-truth solutions, across three reasoning domains: math, science, and coding. We propose Self-Verified Distillation, a simple post-training refinement algorithm in which the model generates candidate solutions to these seed questions, filters them using prompt-based self-verification, and trains on the resulting self-curated dataset. Inspired by the UQ benchmark’s use of multiple validators to screen candidate answers to hard unsolved questions, we adapt this validation-based filtering idea to self-training: the model filters its own generated solutions through a three-stage cascade of cycle-consistency, factuality, and correctness checks, accepting a solution only if it passes all stages with unanimous judge votes. We find that sampling more candidate generations and using a larger verification budget during training data construction produces higher-quality self-curated data and, in turn, better reasoning models. We then train Qwen3 models at multiple scales with Self-Verified Distillation and obtain gains across all three domains. For Qwen3-4B, our method improves aggregate held-out pass@1 by +16.7 points in math (AIME26 and HMMT), +11.1 points in science (GPQA Diamond and HLE), and +8.3 points in coding (LCBv5 and LCBv6), with gains also extending to 0.6B and 8B models. Compared to our test-time-only baseline (UQ-TTC), which improves performance by spending extra compute at inference time, Self-Verified Distillation achieves better performance in most settings while requiring only a single inference call at test time.
信息检索
[IR-0] Separating Semantic Competition from Context Length in RAG Reading
链接: https://arxiv.org/abs/2605.27294
作者: Vyzantinos Repantis,Ameya Gawde,Harshvardhan Singh,Rohit Alekar,Cien Zhang,Svetlana Karslioglu,Akash Vishwakarma
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 4 pages, 1 figure, 2 tables
Abstract:Retrieval-augmented generation (RAG) systems can respond incorrectly even when the correct passage was retrieved. The model must still read the retrieved passages and identify which one contains the answer among others that look relevant. This passage-reading model is called the reader. Does it fail simply because the context is longer or because the other passages genuinely compete with the correct one? We introduce and demonstrate a matched-control protocol for RAG reading: we keep the number and length of passages fixed, but replace hard competitors with less competitive real passages. We apply this control across two compact open models on SQuAD. This replacement partially restores performance, with the strongest effects on F1 and answer inclusion. For Phi-2, this recovers +6.0 EM points, +7.0 answer-inclusion points, and +0.057 F1. For Qwen2.5-1.5B, it recovers +4.5 EM points, +9.0 answer-inclusion points, and +0.068 F1. To track how performance changes as competitors accumulate, we also report retention curves and summarize them with a right-censored half-life when the curves do not cross half-retention. Together, these results show the protocol isolates a competition effect distinct from context length, though the effect is clearer for F1 and answer inclusion than for exact match, and also varies with snippet length.
[IR-1] he Coverag e Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System
链接: https://arxiv.org/abs/2605.27220
作者: Zafar Hussain,Kristoffer Nielbo
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:In modern RAG pipelines, query augmentation methods such as HyDE and query expansion are applied to every query, resulting in substantial LLM inference costs and increased end-to-end latency. The empirical justification for this overhead in real production traffic remains largely unexplored. We present a case study of the Danish National Encyclopedia, evaluating five retrieval workflows over 20,000 query-workflow pairs from production traffic and synthetic conditions. In this system, synthetic queries suggest that LLM augmentation is needed for over 90% of queries to achieve high retrieval coverage. However, under our production deferral policy, only 27.8% of real user queries need LLM augmentation. We call this gap the Coverage Illusion and attribute it to a structural mismatch between synthetic and real query distributions. Pre-retrieval routing cannot resolve this gap, as the need for LLM augmentation is only revealed after searching the index, a result confirmed by our evaluation of four machine learning paradigms. The coverage gap, undetectable from the query alone, motivates a post-retrieval cascade that runs workflows in cheapest-first order and escalates to LLM augmentation only when a step returns no documents. Operating entirely without training overhead or secondary serving infrastructure, the cascade improves quality by +0.140 Composite Overall points over Always-HyDE, reduces latency by 31.8%, and serves 72.2% of real user queries without LLM augmentation.
[IR-2] GraphReview: Scientific Paper Evaluation via LLM -Based Graph Message Passing
链接: https://arxiv.org/abs/2605.27204
作者: Pujun Zheng,Wanying Ren,Jiacheng Yao,Guoxiu He,Star X. Zhao
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Scientific paper evaluation often involves not only assessing a manuscript itself, but also relating it to contemporaneous research and prior literature. However, existing LLM-based methods typically model these signals separately and lack a unified mechanism for propagating review evidence across papers. We propose \textbfGraphReview , a graph-based LLM framework that formulates paper evaluation as review-signal message passing over a semantic paper graph. The graph jointly captures intrinsic quality, synchronic links among contemporaneous papers, and diachronic links to prior work. LLMs are used to estimate node-level quality priors and generate edge-level comparative evidence through pairwise paper comparisons, while Personalized PageRank integrates review signals for quality ranking, decision prediction, and review generation. To produce higher-quality graph evidence, we propose reward-induced maximum likelihood objectives for training the LLM backbones. Experiments show that GraphReview consistently outperforms the strongest baseline, achieving average improvements of 29.7% on decision and ranking metrics, including gains of 23.7% in Accuracy and 57.6% in Spearman’s \rho . It also produces higher-quality review texts and generalizes effectively across time periods and conference venues. The code is available at this https URL.
[IR-3] Rethinking Agent ic RAG : Toward LLM -Driven Logical Retrieval Beyond Embeddings
链接: https://arxiv.org/abs/2605.27123
作者: Yuqi Zeng,Qixiang Deng,Yulei Wan,Ruiquan Jiang,Xiaoqing Zheng,Xuanjing Huang
类目: Information Retrieval (cs.IR)
备注:
Abstract:Recent advances in RAG have shifted toward an agentic paradigm, where LLMs interact with retrieval systems over multiple turns and iteratively refine queries based on intermediate results. At the same time, LLMs have demonstrated a strong ability to construct structured queries that precisely express their information needs. However, contemporary RAG systems remain heavily focused on engineering complex retrieval backends, including dense, hybrid, and graph-based retrieval architectures. In this study, we argue that agentic RAG should delegate greater control to the LLM to steer the retrieval process, while relying on a lightweight retrieval interface that provides fine-grained control and faithfully executes the LLM’s structured intent. Guided by this principle, we propose an agentic RAG framework that enables LLMs to formulate retrieval intents using logical expressions while simplifying the retrieval backend to an inverted-index-based system. Extensive experiments show that our framework matches a strong agentic hybrid baseline, while substantially reducing construction and serving cost. Moreover, we show that anchoring the retrieval process in logical queries substantially reduces hallucinations in generated responses.
[IR-4] Lost in the Evidence? Reproducing Document Position and Context Size Effects in RAG
链接: https://arxiv.org/abs/2605.27105
作者: Jorge Gabín,Anxo Perez,Javier Parapar
类目: Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems rely on retrieved documents being concatenated into a model’s input context, making both document ordering and context size critical yet controversial design choices. Prior work reports position-based effects such as lost in the middle and related long-context phenomena. However, empirical findings remain inconsistent and hard to reproduce across models, datasets, and evaluation protocols. In this paper, we present a systematic reproducibility study that revisits these claims and examines how they evolve with contemporary LLMs under a controlled evaluation framework. We first show that topic sampling is a major source of variance: small topic sets can mask or exaggerate ordering effects. Based on repeated subset sampling across multiple topic budgets, we provide a practical calibration procedure that identifies topic counts yielding stable trends at feasible cost. Using these fixed topic sets, we then reproduce and extend results on position sensitivity, re-evaluating lost in the middle and positional biases in modern LLMs. Then, we also study a more realistic RAG scenario in which relevance is mediated by a retriever rather than oracle access to ground-truth documents. In this setting, we re-examine a recent industry study and identify discrepancies to evaluation choices such as limited topic coverage and reliance on LLM-based judges. Finally, we conduct an analysis of how retrieval order and context size affect downstream LLM performance under imperfect retrieval. Our results demonstrate that both factors interact strongly with retrieval quality and model choice, and that conclusions drawn from idealised setups do not always transfer to real-world RAG pipelines. We release all code and configurations to support reproducibility and future work on robust RAG evaluation.
[IR-5] MuChator: Enabling Active Music Discovery via Conversational Music LLM s in Douyin Music
链接: https://arxiv.org/abs/2605.27103
作者: Jiahao Liang,Linzhi Huang,Xuannan Liu,Xukai Wang,Xuanpu Luo,Yongchun Zhu,Jingwu Chen,Feng Zhang,Xiao Yang
类目: Information Retrieval (cs.IR)
备注:
Abstract:Douyin Music, a large-scale platform with millions of daily users, adopts an immersive, feed-based discovery paradigm, where users passively explore music through continuous recommendations. While effective for passive music discovery, this paradigm restricts users to recommendation results and provides limited support for explicitly specifying listening intents. Unlike conventional search, where users express well-defined intents through explicit queries such as specific songs or artists, real-world active music discovery is often situational and colloquial, involving vague or underspecified requests. While LLMs enable natural language interaction, their direct use in music discovery remains limited by insufficient music-domain knowledge, lack of music-query collaborative reasoning, and shallow understanding of personalized preferences. To address these challenges, we introduce MuChator, an interactive MusicLLM-based framework that enables users to actively express situational music intents in natural language. MuChator incorporates three key components: (1) Music Knowledge Pre-training, a three-stage scheme that incrementally injects objective music knowledge, subjective music knowledge, and personalized music preferences into LLMs; (2) Context-aware Instruction Tuning, which constructs high-quality user-query-music triplets through an automated synthesis pipeline to align LLMs with active and situational user intents; and (3) Preference Alignment with Hybrid RM, which jointly models intent relevance, personalized preferences, and basic constraints, and is optimized using GRPO-based reinforcement learning. Extensive evaluations on industrial music recommendation datasets demonstrate that MuChator outperforms leading proprietary models, such as Gemini-3-Pro. The model has been deployed on Douyin Music App within ByteDance, with 46.49% improvement of user active days in online A/B test.
[IR-6] Large Language Model-Powered Query-Driven Event Timeline Summarization in Industrial Search KDD2026
链接: https://arxiv.org/abs/2605.27066
作者: Mingyue Wang,Xingyu Xie,Hang Yang,Li Gao,Lixin Su,Ge Chen,Dawei Yin,Daiting Shi
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at KDD 2026
Abstract:Understanding how events evolve over time is essential for search engines handling queries about trending news. We present QDET (Query-Driven Event Timeline Summarization), a production system deployed on Baidu Search that constructs focused event timelines to explain specific query events. Unlike traditional topic-centric approaches that aim for comprehensive coverage, QDET identifies and organizes sub-events closely relevant to the query from noisy candidate sets formed by millions of documents retrieved daily. QDET incorporates two key innovations: (1) multi-task supervised fine-tuning with three auxiliary tasks-temporal ordering, causal judgment, and timeline completion-that enable compact models to match the performance of much larger general-purpose models in specialized domains; (2) reinforcement learning-based event concise summarization that enforces strict length constraints while maintaining semantic quality, achieving 88.2% length compliance and outperforming 671B-scale models by 7.7 points in constraint satisfaction. Our fine-tuned 7B parameter model achieves 76.2% F1 score on timeline summarization, slightly surpassing the zero-shot performance of DeepSeek-R1-671B (76.1% F1) while using only 1% of its parameters-demonstrating that domain-specific optimization enables production-ready models with comparable quality at drastically reduced computational costs. Online A/B tests on Baidu Search validate real-world effectiveness, showing 5.5% CTR improvement, 4.6% longer dwell time, and 4.4% deeper exploration compared to single-task baselines. We further demonstrate that timeline understanding transfers to heat prediction, confirming effective knowledge transfer to downstream tasks.
[IR-7] he 2nd EReL@MIR Workshop on Efficient Representation Learning for Multimodal Information Retrieval
链接: https://arxiv.org/abs/2605.26941
作者: Junchen Fu,Xuri Ge,Xin Xin,Alexandros Karatzoglou,Ioannis Arapakis,Xi Wang,Qijiong Liu,Qian Li,Joemon M. Jose
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted as a workshop proposal at ACM Multimedia 2026
Abstract:Multimodal representation learning has attracted increasing attention in AI, driven by the strong performance of large, pretrained multimodal foundation models such as Qwen, LLaVA, and CLIP. These models deliver impressive performance on a range of multimodal information retrieval (MIR) tasks, including web search, cross-modal retrieval, and recommender systems. Yet their massive parameter counts create major efficiency bottlenecks when adapting their representations for IR tasks during training, deployment, and inference. These limitations hinder the practical use of foundation models for representation learning in information retrieval. To address these issues, we propose organizing the EReL@MIR workshop at MM 2026, bringing together researchers from academia and industry to discuss emerging solutions, open challenges, and new efficiency metrics and benchmarks for multimodal IR representation learning in the foundation-model era. The workshop’s official website is available at this https URL.
[IR-8] ICICLE: Expanding Retrieval with In-Context Documents
链接: https://arxiv.org/abs/2605.26902
作者: Yu-Chen Den,Yung-Yu Shih,Zhi Rui Tam,Kuan-Yu Chen,Pu-Jen Cheng,Yun-Nung Chen,Eugene Yang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative retrieval (GR) maps queries directly to document identifiers (docids) using parametric knowledge, However, this design makes corpus expansion costly: adding new documents requires updating model parameters to encode new document-docid associations incurs repeated training and catastrophic forgetting of previously indexed documents. In this work, we revisit incremental GR as an in-context retrieval problem, where newly added documents are supplied as inference-time document-docid evidence. We propose ICICLE, an in-context indexing framework that performs source-aware docid generation over both parametric memory and context-provided document-docid pairs. ICICLE combines a [COPY]-based routing mechanism, preference-based calibration, and large context adaptation to distinguish context-grounded retrieval from parametric retrieval. Experiments on MS MARCO and NQ320K show that ICICLE improves retrieval of newly introduced documents while preserving seen-document retention without corpus-specific retraining. Our analysis further shows that high-shot degradation is mainly caused by routing failure, highlighting source-selection calibration as a key bottleneck for scaling in-context generative retrieval.
[IR-9] RAG EAR: Retrieval-Augmented Graph-Enhanced Academic Recommender
链接: https://arxiv.org/abs/2605.26819
作者: Francesco Granata,Lorenzo Lamazzi,Misael Mongiovì,Francesco Poggi,Valeria Secchini
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:We present RAGEAR (Retrieval-Augmented Graph-Enhanced Academic Recommender), a neurosymbolic recommender system for academic course recommendation. RAGEAR combines dense retrieval over full lecture transcripts with a symbolic Knowledge Graph modelling courses, lessons, transcript chunks, credits, study plans, and curricular information. The Knowledge Graph supports symbolic filtering and contextualisation based on structured constraints, such as credits, academic disciplines, study plans, and prerequisites. Unlike metadata-based approaches, it exploits fine-grained instructional content by retrieving transcript chunks semantically aligned with a student’s query. The main contribution is a graph-aware aggregation function that propagates chunk-level evidence to course-level recommendations. The score combines three factors: the share of retrieved similarity associated with a course, the rank-based strength of its relevant chunks, and the distribution of evidence across lessons. We evaluate RAGEAR on 152 student-like queries through a human evaluation sample and a large-scale LLM-based relevance assessment. Results show that lecture transcripts improve over metadata-only retrieval, and that RAGEAR further improves ranking quality over a transcript-based normalized SumP baseline, especially for top-ranked recommendations.
[IR-10] L2Rec: Towards Dual-View Understanding of LLM s for Personalized Recommendation SIGIR2026
链接: https://arxiv.org/abs/2605.26717
作者: Pingjun Pan,Tingting Zhou,Peiyao Lu,Tingting Fei,Hongxiang Chen,Chuanjiang Luo
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted at SIGIR 2026
Abstract:Adapting large language models (LLMs) for personalized recommendation requires aligning their general-purpose capabilities with user-specific preferences while effectively leveraging both behavioral and semantic signals. Existing approaches typically integrate these signals at either the input level (e.g., injecting behavioral embeddings into the token space) or the output level (e.g., contrastive alignment of separate encoders), suffering from distribution gaps or lack of end-to-end task supervision. In this work, we introduce L2Rec, which unifies behavioral and semantic understanding at the parameter level of LLMs. Our key insight is that the same set of Transformer parameters can serve as a shared medium for both views: by applying view-specific, personalized low-rank perturbations via a Dual-view Personalized Mixture-of-Experts (DPMoE) mechanism, L2Rec enables a single LLM backbone to produce complementary behavioral and semantic adaptations for each user with minimal representation-level misalignment. An adaptive cross-view fusion module further integrates the dual-view outputs into a unified user preference. Experiments on four datasets show that L2Rec consistently outperforms state-of-the-art baselines, and online A/B testing on a large-scale industrial platform validates significant improvements in key engagement metrics.
[IR-11] Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification
链接: https://arxiv.org/abs/2605.26663
作者: Jingxi Qiu,Zeyu Han,Cheng Huang
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Software Engineering (cs.SE)
备注: Preprint. Under review. 20 pages, 2 figures
Abstract:Evidence absence is not evidence insufficiency, but fact verification benchmarks can make them observationally similar. The Not Enough Information (NEI) label is often operationalized through different evidence conditions, and that choice silently determines what a verifier learns and what its score can hide. We introduce NEI-CAP, a construction-aware diagnostic protocol for insufficient-evidence evaluation. Each NEI example carries the construction family that produced it; NEI-CAP audits shortcut cues, validates hard cases through human adjudication, and tests whether competence transfers across constructions. We instantiate the protocol in SciFact-style scientific verification, with FEVER and HoVer as bounded external controls. Across these settings, NEI competence does not transfer reliably: models trained on shortcut-prone constructions fail to recognize semantically related insufficient evidence, and mixed-construction training narrows but does not close the gap. Fixed-claim diagnostics further show that the evidence condition shifts confidence in the reference Support/Refute label, not only NEI recall, so an aggregate NEI score can hide which problem a model has actually solved.
[IR-12] Is Position Bias in Dense Retrievers Built In-or Learned from Data?
链接: https://arxiv.org/abs/2605.26578
作者: Daegon Yu,SeungYoon Han,Woomyoung Park
类目: Information Retrieval (cs.IR)
备注:
Abstract:Dense retrievers exhibit positional bias, favoring documents whose query-relevant information appears near the beginning and degrading retrieval performance when the information appears later. While prior work on positional bias in dense retrievers has largely focused on architectural explanations, we study how the positional distribution of evidence in training data affects retrieval-level bias direction. To test this, we construct synthetic position-targeted training sets in which query-relevant evidence appears at the beginning, middle, or end of documents, and fine-tune eight architecturally diverse pretrained models under position-skewed and balanced training distributions. At the ranking level, we observe a strong directional pattern across the examined models: skewed training distributions favor evidence at the corresponding positions. Position-balanced training reduces positional sensitivity by 57–87% on position-aware benchmarks, with competitive mean retrieval performance in our controlled setting. Representation-level analyses further suggest that fine-tuning often reshapes learned positional preferences, although pre-existing architectural or pretraining-specific tendencies persist in some models. These results identify training-position distribution as a major controllable factor in retrieval-level position bias and suggest balanced data curation as a practical mitigation strategy.
[IR-13] FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing
链接: https://arxiv.org/abs/2605.26476
作者: Jingbin Qian,Congwen Yi,Min Xia,Wen Wu,Jun Zhu,Jian Guan(a href=“http://FutureFab.AI” rel=“external noopener nofollow” class="link-external link-http"this http URL/a)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has become critical for knowledge-intensive applications, yet evaluating its performance in vertical domains remains difficult due to domain complexity, diverse context scales, and heavy reliance on expert assessments that are costly, inconsistent, and non-scalable. We introduce FAB-Bench, an end-to-end framework for adaptive benchmarking of RAG systems in semiconductor manufacturing. FAB-Bench defines six diagnostic metrics measuring factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, and reasoning consistency. The framework couples retriever diagnostics with generator-level reasoning analysis across context windows of 4K-32K tokens, quantifying how retrieval precision and generative fidelity co-evolve as contextual scope expands. From over 1,300 generated candidates, we curated a high-quality benchmark of 200 query-answer pairs spanning three synthesis strategies: needle-in-haystack, intra-document multi-topic, and cross-document multi-hop. Systematic evaluation across four LLMs and four RAG frameworks reveals three distinct context-scaling behaviors: logarithmic growth, early saturation, and cold-start dynamics, and identifies attention dilution as the primary mechanism behind performance degradation at extreme context lengths. Cross-framework validation on three additional production RAG systems confirms evaluation portability.
[IR-14] Generalized Range Filtering Approximate Nearest Neighbor Search: Containment and Overlap [Technical Report] KDD2026
链接: https://arxiv.org/abs/2605.26474
作者: Yingfan Liu,Tong Wu,Jiadong Xie,Yang Zhao,Jeffrey Xu Yu,Jiangtao Cui
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注: The paper has been accepted by KDD 2026
Abstract:Approximate nearest neighbor (ANN) search with range filters has recently garnered significant attention. This paper delves into a generalized form of this problem, i.e., ANN search with exact range-range (RR) predicates on a range-valued attribute, named RR filtering ANN (RRANN). Specifically, given n vectors in \mathbbR^d , each vector v_i is associated with a numeric range [l_i, r_i] , symbolizing aspects like a price range or time interval. An RRANN query (v_q, l_q, r_q) aims at finding k vectors closest to v_q within the vectors satisfying an arbitrary RR predicate defined between the query range [l_q, r_q] and the object range [l_i, r_i] . The RR predicate remains unspecified, enabling user-defined conditions. It may encompass containment ( [l_i, r_i] \subseteq [l_q, r_q] or [l_q, r_q] \subseteq [l_i, r_i] ), overlap ( l_i \le l_q \le r_i \le r_q or l_q \le l_i \le r_q \le r_i ), or a disjunction of them. RRANN has broad applications in queries related to price ranges or time intervals, and it generalizes existing variants of ANN search with range filters. However, existing dedicated approaches for these problems lack the capacity to support queries with arbitrary RR predicates. Hence, we introduce a new approach, labeled multi-segment tree graph. It efficiently handles arbitrary RR predicates by avoiding traversal through non-predicate-satisfied nodes, and keeps equivalent index size and construction time to state-of-the-art methods for RFANN. Extensive experiments on real-world data demonstrate the efficacy of our approach in RRANN queries, achieving up to 12.5x speedups with the same accuracy as the baselines. Moreover, our approach attains comparable RFANN search performance and notably superior IFANN and TSANN search performance compared to the respective state-of-the-art approaches. Our code is available at this https URL.
[IR-15] Uniboost: Global Coordination with Value Alignment for Fair and Efficient Traffic Allocation SIGIR2026
链接: https://arxiv.org/abs/2605.26424
作者: Ge Fan,Nan Zhao,Kai Meng,Cong Luo,Yang Fu,Huiping Chu,Jialin Liu,Yuning Jiang,Bo Zheng
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: accepted by SIGIR 2026
Abstract:With the rapid evolution of internet services, recommendation systems have become indispensable. In particular, the blending (re-ranking) stage plays a pivotal role in allocating traffic across diverse business objectives. However, existing approaches often suffer from coupled allocation plans, score inflation, and a lack of interpretability. To address these challenges, we propose Uniboost, a unified traffic allocation framework. Uniboost introduces a posterior value alignment mechanism that calibrates abstract model scores to anchor metrics with explicit business semantics, significantly enhancing interpretability. Furthermore, it employs an independent linear boosting paradigm to decouple complex weighting schemes, enabling precise attribution of each plan’s contribution. We validate the effectiveness of Uniboost through online A/B tests and in-depth data analysis, demonstrating three key findings: 1) Reducing the overall weight of weighted scores effectively mitigates unintended business interference, yielding a more efficient micro-level traffic allocation strategy; 2) Post-hoc analyses and aggregated dashboards provide intuitive, macro-level insights that guide the design of the overall traffic allocation mechanism; 3) The proposed “Effective Completion Score” serves as an easily obtainable post-metric that offers a reliable anchor for content recommendation pipelines. Collectively, our experiments show that Uniboost not only improves traffic allocation efficiency and recommendation performance at the micro level but also provides macro-level guidance for system iteration. Thus, this work provides an efficient and controllable traffic regulation solution for large-scale industrial recommendation systems.
[IR-16] Plans for Evaluating Structured Generative Search Summaries
链接: https://arxiv.org/abs/2605.26400
作者: Tetsuya Sakai,Jina Lee,Hanpei Fang,Young-In Song
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8 pages (including 2 pages for references)
Abstract:We propose a framework for evaluating structured generative search summaries that are placed atop organic web search results. A structured summary, generated by a large language model, typically consists of an overview, several sections with section titles, and a list of source documents that are cited within the summary. We then describe our plans for implementing and evaluating the framework.
[IR-17] Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking ICML2026
链接: https://arxiv.org/abs/2605.26385
作者: Haruka Kiyohara,Mihaela Curmei,Ariel Evnine,Shankar Kalyanaraman,Israel Nir,Ana-Roxana Pop,Nitzan Razin,Sarah Dean,Thorsten Joachims,Udi Weinsberg
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: ICML2026
Abstract:Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-stage ranker (ESR) generates a candidate set, which is subsequently re-ranked by a late-stage ranker (LSR). While there are many reinforcement learning (RL) methods for training the LSR, end-to-end training of the ESR has proven challenging. In particular, naive application of “vanilla” policy gradient (V-PG) is not scalable for candidate-set sizes relevant for practical use due to exploding variance. This issue arises because V-PG propagates the gradient to the joint probability of the candidate sets, ignoring the contribution of each specific item in the candidate set to the reward. To mitigate this issue, we propose a novel “credit-assigned” policy gradient (CA-PG), which computes gradients with respect to the probability that the target item is chosen in any candidate set, i.e. marginalizing over all candidate sets that contain it. Our theoretical analysis reveals that CA-PG significantly reduces the variance of V-PG by marginalizing over the specific composition of the candidate set, while preserving the ability to learn the correct ranking of items under a reasonably aligned LSR policy. Experiments on both synthetic and real-world data demonstrate that CA-PG improves the convergence speed and training stability for ESRs utilizing the canonical Plackett-Luce model, especially when the candidate-set size is large.
人机交互
[HC-0] Risk Averse Alert Prioritization for IDS Using Subnormal Gaussian Fuzzy Models
链接: https://arxiv.org/abs/2605.27299
作者: Murat Moran
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:Modern intrusion detection systems generate thousands of alerts daily, but alert fatigue severely limits security operations effectiveness due to too many false positives or low-impact events. We address this by proposing a principled framework for alert prioritization based on subnormal Gaussian fuzzy numbers, explicitly modeling three sources of uncertainty: threat severity, detection confidence, and organizational risk attitude. Each alert is represented as a fuzzy number with the core indicating severity, spread indicating uncertainty, and height reflecting detection reliability. We apply ranking indices to prioritize alerts, allowing organizations to tune security posture through a risk-attitude parameter. Experimental validation on CIC-IDS2017 and NSL-KDD demonstrates greater robustness than baselines under detector degradation (0.9963 vs 0.8215 NDCGrel@100), with distinct differentiation in mid-confidence alerts and near-parity with baselines under robust detectors. The framework is theoretically grounded, computationally efficient, provides interpretable reasoning, and remains robust across detector families and miscalibration scenarios.
[HC-1] Atari Games Challenge: A Pilot Study on Multimodal Player Experience Assessment
链接: https://arxiv.org/abs/2605.27261
作者: Oleg Jarma Montoya,Erica Manca,Thomas Vase Schultz Volden,Paolo Burelli
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:We present a pilot study on the collection and synchronisation of multimodal data for player experience investigation. We collected game telemetry, self-reported surveys, biometrics, and cued-retrospective think-aloud (C-RTA) data from 19 participants playing three Atari 2600 games. The study then uses the data to investigate difficulty in PX, showcasing a protocol for future multimodal research. The dataset obtained from the experiment, which is publicly available, shows potential as a rich, transformative source that can be used to investigate dynamic difficulty adjustment algorithms, game balancing strategies or broader explorations of games user research. The study findings suggest that the experimental approach holds strong potential for generalisation in future player experience studies. Subjects: Human-Computer Interaction (cs.HC) Cite as: arXiv:2605.27261 [cs.HC] (or arXiv:2605.27261v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2605.27261 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-2] Rethinking AI Psychosis: Misnomers Conceptual Limits and Existential Drift
链接: https://arxiv.org/abs/2605.26858
作者: Kasper Møller Nielsen,Lucy Osler
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:There has been a proliferation of media reports about so-called AI psychosis in the last year. Not surprisingly, this has prompted growing academic work on the ways in which AI chatbots such as ChatGPT, Claude, and Replika might aggravate or even induce psychosis, typically understood in terms of users acquiring or maintaining delusional beliefs. Our paper consists of two parts. First, we provide a number of reasons to be sceptical about understanding ‘AI psychosis’ as a novel psychiatric category. We argue that many of the purportedly new phenomena are better understood through Stompe et al.'s (2003) metaphor of ‘old wine in new bottles’ and highlight conceptual, nosological, clinical, and social risks associated with the uncritical adoption of this terminology. Second, we develop a positive phenomenological account of what may nevertheless be at stake in sustained human-AI interaction. Rather than focusing primarily on whether AI systems induce, amplify, or sediment delusional beliefs, we examine how conversational AI may participate in transforming a person’s lived experience of reality itself. We claim that the sycophantic and pseudo-intersubjective nature of AI could lead to what we call “existential drift”, whereby individuals may continue to feel rooted in a shared reality through their interactions with AI, while actually becoming entrenched in increasingly private and subjective worlds.
[HC-3] Manipulating Tangible Virtual Object Dynamics to Promote Learning of Precision Force Generation
链接: https://arxiv.org/abs/2605.26782
作者: Alberto Garzás-Villar,Alba Riera-Cardona,Alexis Derumigny,J. Micah Prendergast,Jane Murray Cramm,Laura Marchal-Crespo
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注:
Abstract:Robotic haptic devices combined with virtual reality offer novel opportunities to train fine force generation, an essential yet overlooked component of post-stroke rehabilitation. This study proposes that manipulating the rendered dynamics of tangible virtual objects can be leveraged to train precise force control while engaging the somatosensory system. We conducted an experiment with fifty healthy participants who performed a curling-inspired task in which they had to stretch a virtual spring to generate a target release force to propel the stone to a predefined location on the ice sheet. During training, the spring’s force-elongation relationship was modeled as either a linear or non-linear function, i.e., a Gaussian or antisymmetric Gaussian (AS-Gaussian) function with zero derivative at the release target force. Results indicate that the AS-Gaussian group consistently achieved higher force accuracy during training than the linear group, while the Gaussian group only outperformed the linear group toward the end of training. Analysis of personality traits revealed that higher Free Spirit scores were associated with poorer performance and reduced task exploration under Gaussian dynamics, whereas higher Transform-of-Challenge scores correlated with increased exploration. Despite these training effects, no significant differences in long-term retention were found across spring types or personality traits. Participants primarily relied on learned target elongation rather than target force, as evidenced by performance in a transfer task with a different stiffness but the same target force. While promising for somatosensory neurorehabilitation, these methods require refinement to reduce reliance on proprioceptive cues before testing with neurological patients.
[HC-4] Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering
链接: https://arxiv.org/abs/2605.26620
作者: Lukas Ellinger,Alexander Fichtl,Miriam Anschütz,Georg Groh
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Natural language conveys information at varying levels of granularity, from fine-grained references to broad descriptions. While granularity is fundamental to human communication, existing measures mostly capture surface detail or sentence specificity. We introduce Granuscore, a reference-free measure of granularity that leverages structural properties of a hierarchical embedding space. Granuscore reliably recovers hierarchical orderings on the Granola-EQ dataset and captures expected differences in granularity across discourse contexts. Across domains, we further show that Granuscore explains non-linear variation in sentence specificity beyond sentence length. Finally, we apply Granuscore to four question-answering benchmarks and analyze how granularity differs for questions, gold answers, and model outputs across response outcomes. The analysis reveals consistent differences in model behavior and provides a principled lens for characterizing the difficulty of QA datasets. Together, the results position Granuscore as a scalable, broadly applicable tool for analyzing granularity in text.
[HC-5] Design First Code Later: Aesthetically Pleasing Template-Free Slides Generation
链接: https://arxiv.org/abs/2605.26451
作者: Zhiyao Cui,Chenxu Wang,Shuyue Hu,Yiqun Zhang,Wenqi Shao,Qiaosheng Zhang,Zhen Wang
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Producing presentation slides automatically entails coordinating narrative structure with page-level graphic design under strict spatial constraints. For such structured multimodal tasks, a well-organized design process is essential to ensure the final quality of slides. Existing approaches rely on fixed templates or directly emit executable code, thereby both limiting the creative layout-design capabilities of LLMs and bypassing the essential slide-page design step. To address these limitations, this paper (1) proposes a hierarchical slides generation workflow, DeepSlides, that systematically organizes slide design tasks without any predefined template or style, decoupling slide-page design from implementation; (2) introduces SlideDesign, a dataset tailored specifically for slides generation tasks; and (3) presents a multi-agent reinforcement learning training paradigm and trains a couple of models, SlideQwens, for slide design and implementation. Experimental results demonstrate that our proposed framework outperforms baseline methods on evaluated metrics and achieves superior performance in human preference evaluations. The dataset and code are available at this https URL.
[HC-6] Slide Deck QA Quality Assurance App: A Multi-Stage Pipeline for Pedagogical Question Generation
链接: https://arxiv.org/abs/2605.26428
作者: Jim Salsman
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 15 pages, 3 research questions, 1 figure, 1 table, 6 references, 2 appendices
Abstract:Generating high-quality, pedagogically useful questions from lecture slide decks is difficult because important instructional content is distributed across both text and visual elements, and because useful questions must be scaffolded across the flow of a presentation rather than generated slide by slide in isolation. This paper describes Slide Deck Q\A Quality Assurance (slidesqaqa), a Flask-based software system that extracts text and rendered images from PDF slides and processes them through a four-stage large language model pipeline comprising window planning, deck synthesis, slide annotation, and reconciliation. The system reasons jointly about slide modality and pedagogical role, allocates bounded question budgets, and revises draft annotations at the deck level to reduce redundancy and improve coverage. The final output is a structured JSON annotation containing deck-level goals, section structure, slide-level summaries, question sets, and evaluation scores. Initial experiments on two technical lecture decks indicate that the pipeline can filter non-instructional slides and produce high-fidelity, pedagogically coherent questions for visually complex content. The working system is at this https URL The software repository is at this https URL Comments: 15 pages, 3 research questions, 1 figure, 1 table, 6 references, 2 appendices Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC) MSC classes: 68T50 ACMclasses: K.3.1; D.2.2 Cite as: arXiv:2605.26428 [cs.CL] (or arXiv:2605.26428v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.26428 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[HC-7] Real-time Directionality Aware 3D Ultrasound Reconstruction and Re-Slicing
链接: https://arxiv.org/abs/2605.26325
作者: Tobias Jaeggi,David Gregory Black,Septimiu Salcudean
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Tele-ultrasound through teleoperation allows experts to perform examinations remotely in communities, but limited connectivity can lead to communication delays that reduce usability and diagnostic performance. Visual-haptic model mediated teleoperation reslices a pre-acquired ultrasound volume in real time to provide an accurate, delay-independent preview image for the sonographer. This enables fast and robust exploration before using the live image for fine tuning. However, existing reslicing techniques do not account for the directional nature of ultrasound - the fact that a structure looks different when imaged from different directions. This paper presents Directionality-Aware Reslicing (DARE), an ultrasound volume reconstruction and reslicing framework that takes directionality into account. The presented GPU-accelerated algorithm allows real-time reslicing from arbitrary viewpoints to generate accurate preview images. The method is evaluated quantitatively through image similarity metrics and qualitatively through a user study, and significantly outperforms existing reslicing methods in image similarity and realism compared to a ground truth. This can improve the effectiveness and robustness of tele-ultrasound in low-resource areas.
[HC-8] Visual Matters: Connecting Aesthetic Appeal and Production Quality of Photos Infographics and Data Visualizations to Credibility of Social Media Posts
链接: https://arxiv.org/abs/2605.26309
作者: Salman Khawar,Yingdan Lu,Yilang Peng,Jiyoung Yeon,Cuihua Shen
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
Abstract:The rapid proliferation of visual content raises fundamental questions about how different visual formats and features shape perceived credibility. Drawing on processing fluency theory, this research examines how visuals shape credibility judgments. We focus on three popular formats-photos, infographics, and data visualizations-comparing them to text-only posts, and test how two visual features, aesthetic appeal and production quality, influence credibility through processing fluency as a mediating mechanism. Through a preregistered experiment with 1200 US participants, we found that visual posts are generally perceived as more credible than text-only posts but this credibility advantage only applies to photos and infographics, not to data visualizations. Aesthetic appeal increases perceived credibility, partially mediated by processing fluency, while production quality had no significant effect on credibility across formats. These findings differentiate visual formats, advance conceptualizations of visual features, and identify processing fluency as a key mechanism for theorizing credibility across multimodal contexts.
[HC-9] “You do understand that people dont trust technology?”: Explaining Trusted Execution Environments to Non-Experts
链接: https://arxiv.org/abs/2605.26196
作者: McKenna McCall,Carolina Carreira,Miguel Flores,Lorrie Faith Cranor
类目: Human-Computer Interaction (cs.HC); Cryptography and Security (cs.CR)
备注:
Abstract:Trusted Execution Environments (TEEs) protect confidentiality and integrity of trusted applications by creating an isolated environment for executing code. Prior work has shown that users may feel more comfortable sharing data when they know it will be protected by a TEE, especially if they understand what a TEE is. In this study, we evaluated text-based explanations introducing TEEs to non-experts. We analyzed existing TEE explanations to develop candidate explanations and evaluated them via vignette scenarios with 966 crowdworkers. The explanations that enhanced understanding most were non-technical ones that highlighted specific threats that can be prevented by a TEE. Surprisingly, even the explanations that enhanced understanding had little effect on willingness to use the TEE-enhanced technology. These results provide insights into ways to communicate technical security concepts more effectively but also suggest that explaining security technology might not be enough to address users’ privacy concerns.
[HC-10] Augment Engineering: A Methodology for Multi-Tool AI Orchestration Across Professional Domains
链接: https://arxiv.org/abs/2605.26146
作者: Elias Calboreanu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 60 pages, 5 figures, 7 tables. Companion to arXiv:2604.04258 (Context Engineering). Formatted for the Journal of Systems and Software (In Practice track)
Abstract:Organizations increasingly deploy separate purpose-built AI tools across professional domains, often hiring domain specialists for each, recreating the staffing models AI was expected to transform. Yet the meta-skills that make these tools effective, prompt engineering (interaction-level optimization) and context engineering (structured input pipeline design), are domain-portable: a practitioner who masters them can apply them to any purpose-built AI tool in any domain. This paper defines Augment Engineering as the discipline of orchestrating multiple purpose-built AI tools across distinct professional domains, applying prompt and context engineering as portable competencies that transfer across tool boundaries. We present a six-phase orchestration methodology and four portability metrics. A 5-month formative case study (November 2025 to March 2026) documents a single practitioner applying these skills across a ten-component orchestration stack spanning seven professional domains, producing work products that would traditionally involve separate domain specialists. Two quantitative observations are consistent with the framework’s predictions: a Cochran-Armitage trend test (n = 200 interactions across two chat LLMs, p 0.01) shows first-pass acceptance rising with prompt-sophistication level, and a Wright’s Law fit (n = 82 artifacts, p 0.01) shows production acceleration across the artifact portfolio. Because all observations come from a single practitioner, the inferential statistics are exploratory and hypothesis-generating rather than confirmatory; portability across the full portfolio awaits multi-practitioner replication. Augment Engineering completes a three-discipline progression: Prompt Engineering (one tool), Context Engineering (reproducible pipelines), Augment Engineering (a portfolio of tools across domains).
计算机视觉
[CV-0] G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing
链接: https://arxiv.org/abs/2605.27372
作者: Bharath Raj Nagoor Kani,Noah Snavely
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Modern feed-forward 3D reconstruction methods like VGGT predict pixel-aligned pointmaps in camera-centric coordinate frames. However, this choice of coordinate frame is not always optimal. We propose instead to predict pointmaps in upright, gravity-aligned frames that exploit strong structural cues present in many real-world scenes. Unlike camera-centric frames, gravity-aligned frames share a common vertical axis across viewpoints, reducing the rotational degrees of freedom needed to relate pointmaps to one another. To this end, we introduce the Gravity Grounded Geometry Transformer (G3T), fine-tuned from existing models on gravity-aligned 3D data. G3T produces highly accurate gravity-aware predictions, including upright pointmaps and camera-to-gravity poses. We further introduce G3T-Long, a submap-based incremental 3D reconstruction pipeline that leverages the reduced rotational degrees of freedom afforded by upright frames to achieve significantly improved reconstruction accuracy.
[CV-1] SpatialBench: Is Your Spatial Foundation Model an All-Round Player?
链接: https://arxiv.org/abs/2605.27367
作者: Haosong Peng,Hao Li,Jiaqi Chen,Yuhao Pan,Runmao Yao,Yalun Dai,Fushuo Huo,Fangzhou Hong,Zhaoxi Chen,Haozhao Wang,Dingwen Zhang,Ziwei Liu,Wenchao Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:While spatial foundation models have demonstrated impressive performance on standard datasets, a critical question remains: are they truly all-round players capable of generalizing robustly across diverse downstream tasks, arbitrary viewpoints, shifting scene domains, varying input densities, and specific hardware constraints? Answering this overarching question requires a holistic assessment, yet current models are mainly evaluated on specific domains for which they were specifically designed or trained. Such evaluations are intrinsically limited by narrow paradigm coverage, limited scene domains, and arbitrary frame sampling, making it fundamentally difficult to assess their true generalization capabilities. To address this gap, we present SpatialBench, a cross-paradigm, domain-diverse benchmark for spatial foundation models with deterministic sampling. SpatialBench features unprecedented scale and rigorous deterministic design, comprising 19 datasets and 546 scenes across 5 diverse spatial domains. It comprehensively evaluates 41 models across 6 paradigms on 5 task suites under 4 different input density settings. Our extensive evaluation reveals that current models are not yet all-round players, and uncovers crucial insights for future advancement. Specifically, we demonstrate that full-context attention maximizes accuracy while bounded-memory strategies unlock long-sequence scalability. Moreover, our empirical evaluations in challenging embodied and egocentric tasks demonstrate that strict domain alignment and high data quality are far more critical to performance than simple dataset scaling. Furthermore, to address the largest data gap identified in our analysis, we go beyond evaluation by introducing a large-scale dataset, DA-Next-5M, and a strong baseline model, DA-Next, pushing the boundaries of spatial representation learning.
[CV-2] LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
链接: https://arxiv.org/abs/2605.27365
作者: Shihao Wang,Shilong Liu,Yuanguo Kuang,Xinyu Wei,Yangzhou Liu,Zhiqi Li,Yunze Man,Guo Chen,Andrew Tao,Guilin Liu,Jan Kautz,Lei Zhang,Zhiding Yu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.
[CV-3] Feedforward 3D Editing Learns from Semantic-Part Transformation
链接: https://arxiv.org/abs/2605.27351
作者: Jiawei Weng,Saining Zhang,Zhenxin Diao,Peishuo Li,Henghaofan Zhang,Junhao Chen,Hao Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 22 figures. Project Page: this https URL
Abstract:3D editing is a fundamental capability for scalable 3D content creation. While image editing has rapidly evolved toward large-scale feedforward generative paradigms, 3D AI generation remains dominated by training-free editing pipelines. A central challenge of feedforward 3D editing lies in the lack of high-quality paired supervision. Editable 3D assets require simultaneous preservation of geometry, multi-view consistency, structural coherence, and localized edit controllability. Existing 3D editing datasets often rely on independently generated assets, image-mediated reconstruction or narrow edit taxonomies, leading to inaccurate localization, weak preservation, blurred edit boundaries, and limited semantic consistency. In this work, we introduce a new perspective: scalable feedforward 3D editing should be learned from semantic-part transformations. Based on this insight, we propose Pxform, a high-quality 3D editing dataset with over 100K consistent before/after editing pairs across seven edit types. Instead of treating objects as unstructured shapes, our pipeline grounds edits directly in semantic 3D parts. Built upon Pxform, we further propose PartFlow, a feedforward 3D editing network that injects source-aware latent control into pretrained 3D generative priors. PartFlow introduces mask-aware velocity preservation and render-space consistency supervision to jointly improve edit fidelity and source preservation, while requiring no 3D edit mask during inference. Extensive experiments demonstrate that high-quality semantic-part supervision substantially improves scalable 3D editing, enabling PartFlow to achieve state-of-the-art performance on both geometric and appearance editing benchmarks.
[CV-4] When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection
链接: https://arxiv.org/abs/2605.27348
作者: Kim Jihyeon,Sohee Kim,Soosan Lee,Souhwan Jung,James Matthew Rehg,Hyesong Choi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 2 figures, 17 tables
Abstract:Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces - particularly in person-centric and partial-edit settings where the manipulated region is small and surrounded by photometrically authentic content. We introduce Social Gaze Consistency, a high-level semantic cue defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals, and show that it constitutes a previously underutilized detection axis orthogonal to existing low-level paradigms. We instantiate this insight through three coupled mechanisms: (i) a controlled diagnostic dataset with region-specific perturbations of gaze-consistent imagery, where strict pair-level grouping forecloses generator-fingerprint memorization as an optimization-time shortcut rather than relying on augmentation; (ii) Block-Compositional Caption Supervision, which holds a single 5-block reasoning skeleton invariant across 1,250 macro-combined captions, decoupling reasoning consistency from surface diversity; (iii) Cross-architecture validation showing the same supervision improves a vision-language backbone (FakeVLM) by +3.7 pp on the COCOAI Interaction subset (balanced accuracy 67.8 - 71.5) and +1.3 pp on the COCOAI Person subset (83.0 - 84.3), with consistent gains on a vision-only backbone (Effort), evidencing a backbone-agnostic cue. Real- and fake-class recalls rise simultaneously, ruling out a “predict-all-fake” artifact. A four-step mechanistic account - paired-edit shortcut blocking, hard-to-easy difficulty transfer, CLIP prior preservation, and diffusion-family shared spectral weakness in periocular structure - explains why training on a single inpainter (FLUX.1-Fill) transfers to multi-generator suites. We will release the code upon acceptance to facilitate reproducibility.
[CV-5] owards Controllable Image Generation through Representation-Conditioned Diffusion Models
链接: https://arxiv.org/abs/2605.27343
作者: Nithesh Chandher Karthikeyan,Jonas Unger,Gabriel Eilertsen
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Diffusion models have emerged as powerful tools for high-quality image generation and editing, but guiding these models to produce specific outputs remains a challenge. Conventional approaches rely on conditioning mechanisms, such as text prompts or semantic maps, which require extensively annotated datasets. In this preliminary work, we explore diffusion models conditioned on representations from a pre-trained self-supervised model. The self-conditioning mechanism not only improves the quality of unconditional image generation, but also provides a representation space that can be used to control the generation. We explore this conditioning space by identifying directions of variations, and demonstrate promising properties in terms of smoothness and disentanglement.
[CV-6] PARE: Pruning and Adaptive Routing for Efficient Video Generation
链接: https://arxiv.org/abs/2605.27336
作者: Yutong Wang,Yunke Wang,Tianfan Xue,Yu Qiao,Yaohui Wang,Xinyuan Chen,Chang Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but typically commit to a fixed architecture that cannot adapt to individual inputs or denoising stages. We propose PARE (Pruning and Adaptive Routing for Efficient video generation), which jointly compresses width and depth with structure-aware pruning and input-adaptive routing. For width, we observe that attention heads specialize into spatial and temporal roles, and design importance scoring that accounts for this distinction to prevent motion-critical temporal heads from being pruned prematurely. For depth, we train a lightweight router conditioned on denoising timestep and visual content to dynamically select which blocks to execute at each step, enabling per-input compute adaptation rather than static block removal. A progressive pipeline first recovers width-pruned quality via distillation, then jointly optimizes the student and router to decouple the two learning objectives. Experiments on Wan2.1-14B for both image-to-video and text-to-video generation show that PARE substantially reduces per-step computation while preserving quality across VBench dimensions, and composes with step distillation for further acceleration.
[CV-7] EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering
链接: https://arxiv.org/abs/2605.27332
作者: Zhifei Dou,Shabnam Hassani,Ou Wei
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Flowcharts are widely used in industrial requirements, but usually remain embedded as static images. Vision Language Models (VLMs) show promise in the conversion of these flowcharts into machine-readable models for RE activities, yet, when directly applied to flowchart conversion, they often fail on topology-critical visual details. To address this, we propose EdgeFlow that augments a VLM’s original input with a deterministically extracted Canny edge map-acting as a structural prior-to improve flowchart-to-Mermaid conversion, without requiring annotated training data or domain-specific model fine-tuning. We evaluate EdgeFlow on IndusReqFlow, a dataset sourced from real-world requirements. Compared with off-the-shelf VLMs, EdgeFlow improves node-level F1 by 17.39 percentage points and edge-level F1 by 16.94 percentage points. At the path level, EdgeFlow improves path F1 by 11.06 percentage points, enabling better support for model-based testing. These results demonstrate that EdgeFlow provides a practical, training-free means to improve topology-preserving flowchart-to-Mermaid conversion for industrial RE. Cross-dataset evaluation results on a public synthetic benchmark show no significant improvement; this highlights the need for diverse benchmarks incorporating industrial data for the comprehensive evaluation of future VLM-based RE tools. Comments: 10 pages Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.27332 [cs.SE] (or arXiv:2605.27332v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.27332 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-8] Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning
链接: https://arxiv.org/abs/2605.27318
作者: Xianqiang Gao,Qizhi Chen,Delin Qu,Haoming Song,Zhigang Wang,Bin Zhao,Dong Wang,Xuelong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing spatial video-language models improve geometric perception and long-range context modeling, but often treat memory as a generic temporal cache, which can introduce redundant or irrelevant geometry and weaken long-horizon reasoning. We propose \textbf\ours, a question-guided geometric memory framework for video spatial reasoning. \ours injects camera-conditioned geometry into visual tokens and maintains two complementary memories: a Fine-Grained Context Bank for recent dense features and camera states, and a Semantic-Geometric Evidence Bank for compact long-range evidence. Each candidate frame is scored by the product of Q-Former-based question relevance and novelty with respect to the retained bank; this score is stored and reused during reading, while a capacity-based replacement rule keeps the bank compact. During reasoning, both memories are read before update and adaptively fused with the current frame representation. Experiments on VSI-Bench and VSTI-Bench show that \ours achieves state-of-the-art performance among evaluated spatial reasoning models, validating the effectiveness of question-guided geometric memory. Ablations further verify the contribution of the proposed evidence scoring mechanism.
[CV-9] How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning
链接: https://arxiv.org/abs/2605.27310
作者: Qian Yang,Ankur Sikarwar,Huy Le,Le Zhang,Zhuan Shi,Perouz Taslakian,Aishwarya Agrawal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.
[CV-10] PlayClass: Automated Play Behaviour Classification in Poultry CVPR2026
链接: https://arxiv.org/abs/2605.27304
作者: Prince Ravi Leow(1),Neil Scheidwasser(1 and 3),Rebecca Oscarsson(2),Per Jensen(2),Samir Bhatt(1 and 3),David Alejandro Duchêne(1) ((1) Section for Health Data Science amp; AI, University of Copenhagen, (2) AVIAN Behaviour Genomics and Physiology Group, Linköping University (3) Department of Infectious Disease Epidemiology, Imperial College London)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CV4Animals Workshop @ CVPR 2026
Abstract:Automated monitoring of animal welfare has largely targeted negative indicators, leaving positive welfare behaviours such as play underexplored. To address this gap, we present PlayClass, a pipeline for play-behaviour classification in poultry from top-down pen video. The pipeline leverages long-duration tracking with SAM 3 via YOLO-guided chunk boundaries to minimise identity errors in point-based prompting, and frozen embeddings from image and video foundation models for play action classification. Although handcrafted motion features from tracked masks alone achieved competitive accuracy, V-JEPA 2.1 consistently outperformed all other backbones across model scales, reaching 77.0 macro-averaged F _1 when combined with handcrafted features. Despite this result, the dataset remains challenging due to play sub-types sharing similar kinematic profiles with non-play and inter-bird occlusion. Overall, our work provides encouraging evidence towards automated frameworks for play behaviour classification in poultry.
[CV-11] Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini
链接: https://arxiv.org/abs/2605.27295
作者: Madhuri Shanbhogue,Zhe Li,Shanfeng Zhang,Gustavo Hernández Ábrego,Shih-Cheng Huang,Aashi Jain,Daniel Salz,Sonam Goenka,Chaitra Hegde,Ji Ma,Feiyang Chen,Jiaxing Wu,Tanmaya Dabral,Babak Samari,Kevin Poulet,Daniel Cer,Kaifeng Chen,Paul Suganathan,Hui Hui,Jovan Andonov,Philippe Schlattner,Jay Han,Iftekhar Naim,Wing Lowe,Vladimir Pchelin,Albert Yang,Yi-Ting Chen,Zhongli Ding,Grace Zhang,Georg Heigold,Yichang Chen,Antoine Reveillon,Brendan Mccloskey,Wenlei Zhou,Dahun Kim,Rui Meng,Emma Wang,Jack Zheng,Halley Fede,Zhen Yang,Keegan Mosley,Brian Potetz,Sahil Dua,Henrique Schechter Vera,Shen Gao,Hesen Zhang,Andreas Hess,Hengxuan Ying,Alberto Montes,Karan Gill,Min Choi,Sebastian Russo,Anja Hauth,Jinhyuk Lee,Michael Boratko,Megan Barnes,Vikram Rao,Claudiu Musat,Cyril Allauzen,Ehsan Variani,Shankar Kumar,Tom Bagby,Junyi Jiao,Yang Gu,Tengxin Li,Ayush Agrawal,Roberto Santana,Dev Nath,Stephen Karukas,Shuoxuan Han,Lucia Loher,Alice Twu,Nidhi Vyas,Siddharth Bhai,Frank Palma Gomez,Wangyuan Zhang,Chaoren Liu,Jizheng Yang,Steve Qiu,Shijie Zhang,Sujay Kulkarni,Sascha Rothe,Sean Nakamoto,Raphael Hoffmann,Zach Gleicher,Yunhsuan Sung,Qin Yin,Tom Duerig,Mojtaba Seyedhosseini
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.
[CV-12] A Dynamic Programming Framework for Discovering Count and Values of Multilevel Image Thresholding
链接: https://arxiv.org/abs/2605.27287
作者: Eslam Hegazy,Mohamed Gabr
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multilevel Image thresholding is an important preprocessing algorithm in computer vision applications nowadays. Since most common thresholding methods take the desired count of thresholds as input by the user, thresholding methods that automatically determines a suitable count of thresholds from the input image itself are advantageous. In this article, a novel thresholding method based on a dynamic programming algorithm and a modification of Minimum Error Thresholding (MET) criterion is thoroughly presented. An empirical statistical study is performed to pinpoint why this proposed method is superior. Moreover, an extended comparison between this proposed method and other state-of-the-art methods is performed on a comprehensive set of natural, satellite and medical test images. The numerical results show that the proposed MET-DP method takes much less time than traditional dynamic programming thresholding methods when the number of thresholds is high. The proposed method can detect a suitable count of thresholds for most of tested images of different types. However, traditional methods that take the count of thresholds as input produce thresholded images of higher structural similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR) values than MET-DP. Source code can be found on this https URL
[CV-13] Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models
链接: https://arxiv.org/abs/2605.27243
作者: Aaron Branson Cigres Li,Zhaowei Wang,Yu Zhao,Yiming Du,Haobo Li,Xiyu Ren,Ginny Wong,Simon See,Lishu Luo,Haodong Duan,Pasquale Minervini,Yangqiu Song
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in Progress
Abstract:Large vision-language models increasingly rely on long-context modeling to reason over documents, hour-level videos, and long-horizon agent trajectories, requiring them to locate relevant evidence across interleaved text and images. Prior work has studied this behavior using retrieval heads in large language models, but its copy-based criterion does not directly apply when evidence appears in images. We introduce a multimodal retrieval head detection method that scores attention from question tokens to textual or visual evidence. With this method, we show that multimodal retrieval heads are sparse, intrinsic, and causally important: only 4.4-10.2% of attention heads account for 50% of the positive retrieval-score mass, and masking the top-5% selected heads drops MMLongBench-Doc from 48.2% to 5.7% and SlideVQA from 71.2% to 8.9%, while random-head masking is far less damaging. Further analysis shows that these heads are partly shared across modalities yet remain dynamic within each modality, with image retrieval heads changing more than text retrieval heads as context length and haystack modality change. Without further training, we find that these heads can also be used directly to rank visually rich documents: on MMDocIR, Qwen3-VL-8B selected-head scoring improves Recall@1 by 7.7/7.4 macro/micro points for page retrieval and 6.3/6.8 points for layout retrieval over the strongest reported baseline.
[CV-14] MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale CVPR2026
链接: https://arxiv.org/abs/2605.27235
作者: Zhicong Tang,Zhao Zhang,Jingye Chen,Mohan Zhou,Yifan Pu,Yuchi Liu,Yalong Bai,Ethan Smith,Yuhui Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks including text-to-layers, image-to-layers, and layers-to-layers within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model in image-to-layers quality according to user-study results, while achieving 10-100\times faster inference and reducing activation GPU memory consumption by 50-90% during image-to-layer inference.
[CV-15] Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis
链接: https://arxiv.org/abs/2605.27203
作者: Mannat Khurana,Sanyam Jain,Rishav Agarwal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 6 figures
Abstract:Animation elevates digital documents into immersive experiences, yet creating custom motion paths remains cumbersome, requiring designers to manually select presets, plot Bézier points, and configure timing properties. We introduce Generative Animations, a system that transforms natural language prompts into production-ready animations. By chaining Large Language Models (LLMs) for semantic parsing with the Segment Anything Model (SAM) for visual grounding, our pipeline automatically generates motion paths that respect scene geometry, handle depth-based occlusions, and honor 3D perspective transforms. We demonstrate the system through three use cases: contour-following trajectories, orbital animations with z-order awareness, and perspective-aligned motion on transformed objects.
[CV-16] FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation ICML2026
链接: https://arxiv.org/abs/2605.27178
作者: Zihui Zhang,Zhixuan Sun,Yafei Yang,Jinxi Li,Jiahao Chen,Bo Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: ICML 2026. Zihui and Zhixuan are co-first authors. Code and data are available at: this https URL
Abstract:We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint-based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self-supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi-class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero-shot and long-tail scenarios, underscoring its potential for scalable, label-free 3D object segmentation.
[CV-17] Model discovery for dynamical systems with complex-valued product units
链接: https://arxiv.org/abs/2605.27158
作者: Martin Brückmann,Babette Dellen,Uwe Jaekel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures
Abstract:Discovering the governing equations of a dynamical system from observed trajectories provides deeper insight into its structure than mere prediction of future states. We present a data-driven approach to model discovery based on complex-valued product-unit networks, in which each unit represents a complex monomial and the network output is a sparse linear combination of such monomials. In contrast to established library-based methods such as SINDy, our approach does not require a predefined set of candidate functions: the relevant monomials, including those with fractional or negative exponents, are learned directly from data. Across four chaotic benchmark systems (Lorenz63, Lorenz84, the Four-Wing attractor, and a fractional variant of Lorenz63), we recover the exact governing equations in 90% of trials for the first three systems, and in 70-90% of trials for the fractional case, using at least 3000 training points. Applied to real-world human-gait accelerometer signals, the model produced stable trajectories with bounded prediction errors, corresponding to an RMSE of approximately 12-14% of the signal amplitude range over a test horizon three times longer than the training interval, demonstrating its potential for high-dimensional systems in which analytic equations are unavailable.
[CV-18] Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection
链接: https://arxiv.org/abs/2605.27155
作者: Nico Steckhan,Krutarth Prajapati,Weija Shao,Silvia Vock
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Testing object detectors in safety-critical domains requires semantically meaningful probes beyond pixel-level corruptions. We present SemProbe, a tool for semantic robustness probing: users upload deployment images, create masks manually or automatically, select operational design domain-derived factors (or custom prompts), and run diffusion-based controlled inpainting. The system supports batch jobs, parallel seed/workflow variations, and configurable generation parameters. After each output, model inference runs automatically and displays annotated before/after comparisons with performance deltas. All probes are logged as structured artifacts, enabling traceable robustness evidence aligned with safety evaluation workflows. We demonstrate \textscSemProbe on hand detection for dimension saws, targeting factors from insurance-oriented test criteria.
[CV-19] ouch-R1: Reinforcing Touch Reasoning in MLLM s
链接: https://arxiv.org/abs/2605.27154
作者: Yingxin Lai,Yafei Zhou,Fucai Zhu,Siyu Zhu,Weihao Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our code and data will be made public on the this https URL
Abstract:While rule-based reinforcement learning has recently catalyzed explicit reasoning in multimodal models, tactile reasoning remains largely underexplored. Existing tactile-language models primarily rely on supervised or contrastive objectives, which limits their capacity to ground predictions in physical evidence or rectify misleading visual priors. Tactile reasoning introduces two modality-specific challenges: the ordinal nature of physical attributes (e.g., hardness, roughness) and the cross-sensor distribution shifts inherent in optical tactile hardware. In this work, we introduce TouchReason-1M, a large-scale multimodal dataset comprising over 1M synchronized tactile pairs across four distinct sensors, and TouchReason-Bench, a rigorous framework for evaluating tactile perception and visual-tactile conflict resolution. Building upon these, we propose Touch-R1, a tactile reasoning MLLM based on Qwen2.5-VL-7B. Touch-R1 is trained via a tactile-grounded GRPO objective that combines ordinal-aware accuracy, cross-sensor physical consistency, structured-format control, and an input-side tactile grounding objective. Specifically, the tactile-use reward assigns credit only when authentic tactile inputs yield superior correctness relative to counterfactual controls where the tactile stream is removed, shuffled, or noise-masked. On TouchReason-Bench, Touch-R1-7B outperforms Octopi-13B by 18.4% and GPT-4o by 24.7% on average. Its structured reasoning traces reveal emergent behaviors of probing, comparison, and revision, demonstrating that R1-style reasoning can be effectively grounded in physical contact.
[CV-20] Chaos-SSL: An Attention-Based Self-Supervised Learning Framework with Chaotic Transformation for Medical Image Classification
链接: https://arxiv.org/abs/2605.27146
作者: Joao Batista Florindo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-Supervised Learning (SSL) has emerged as a powerful paradigm to mitigate the reliance on large, annotated datasets, a common bottleneck in medical image analysis. However, standard SSL methods, which rely on simple geometric and color augmentations, may fail to capture the fine-grained, complex textural details necessary for classifying subtle pathologies. This paper introduces Chaos-SSL, a novel two-stage framework for medical image classification. In the first stage, we propose a new self-supervised pre-training strategy that leverages 1D chaotic maps (Logistic, Tent, and Sine) as a complex, non-linear augmentation for contrastive learning. We hypothesize that these chaotic transformations create ``harder’’ and more semantically-rich views, forcing a network to learn robust representations of fine-grained medical textures. In the second stage, we introduce an attention-based fusion model that dynamically combines the specialized features from our Chaos-SSL model with the general-purpose features of a larger, ImageNet-pre-trained model. We validate our method on two public datasets: ISIC 2018 (skin lesions) and APTOS 2019 (diabetic retinopathy). Our results demonstrate that the Chaos-SSL model pre-trained with a Tent map for 30 epochs, followed by attention fusion, achieves performance fully competitive with the state-of-the-art, yielding an accuracy of 0.9261 on ISIC 2018 and 0.8726 on APTOS 2019. This significantly outperforms existing SSL methods, including several recent approaches.
[CV-21] Is an Image Also Worth 16x16=256 Superpixels? A Framework for Attentional Image Classification
链接: https://arxiv.org/abs/2605.27144
作者: Pedro Henrique da Costa Avelar,Anderson R. Tavares,Luís C. Lamb
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Superpixel-based image classification has traditionally leveraged graph neural networks (GNNs) for processing irregular image representations. Recent advances in computer vision, driven by Vision Transformers (ViTs), have introduced new paradigms in self-attentional models, surpassing convolutional neural networks (CNNs) in various tasks. However, a synergistic connection between GNNs, superpixels, and transformers remains unexplored. In this work, we propose Superpixel Transformers (SPT), a novel framework that unifies superpixel-based image classification and ViTs. SPT generalizes the Superpixel Image Classification with Graph Attention Networks (SICGAT) model and ViT to support arbitrary superpixel-based chunking strategies, connectivity graphs, and positional encodings. We introduce refinements including a multidimensional sine-cosine positional encoding and an enriched patch data structure that fully incorporates superpixel shape and color information. By testing SPT across datasets such as CIFAR10, FashionMNIST, and Imagenette, with various superpixel generation and graph connectivity strategies, we demonstrate that SPT achieves superior performance compared to previous superpixel-based GNN methods and remains competitive with ViTs. Notably, our approach addresses the limitations of SICGAT, such as information loss during pixel aggregation, and shows how constrained graph connectivity can enhance ViT performance. SPT bridges the gap between superpixel-based and transformer models, opening avenues for cross-domain generalization and future innovations in hybrid attentional frameworks, and showing that an image can also be worth 16\times16 superpixels.
[CV-22] Leverag ing Visual Signals for Robust Token-Level Uncertainty in Vision-Language Generation
链接: https://arxiv.org/abs/2605.27136
作者: Joseph Hoche,David Brellmann,Gianni Franchi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Uncertainty quantification (UQ) remains a critical challenge in Large Vision Language Models (LVLMs) for reliable predictions and real-world deployment. However, most existing methods are adapted from the LLM literature and primarily focus on the language modality, leaving the contribution of visual information to LVLM uncertainty largely underexplored. In this paper, we investigate how LVLMs process visual information and whether this process can be used to improve uncertainty estimation. By analyzing hidden representations after the integration of visual features during the generation process, we observe that high-confidence predictions rely more heavily on visual content than uncertain ones. Building on this insight, we propose Visual-Grounded Token UQ (VIG-TUQ), a training-free framework that explicitly incorporates visual grounding into uncertainty estimation by weighting token-level language uncertainty with visual grounding scores. We evaluate VIG-TUQ on multiple datasets and across diverse LVLM architectures, including early-fusion, late-fusion, and native-fusion models. Results indicate that our method often improves upon existing token-level uncertainty approaches. Code and data will be made available upon acceptance.
[CV-23] Do Modern Post-Hoc Watermarking Methods Beat Broken-Arrows?
链接: https://arxiv.org/abs/2605.27135
作者: Enoal Gesny,Eva Giboulot
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid proliferation of generative models, such as diffusion models, digital watermarking has emerged as a crucial solution for identifying AI-generated images. Modern post-hoc watermarking schemes use neural networks to achieve an extremely low false-alarm rate while remaining robust to common image transformations. However, there is a lack of comparison between these modern methods and classic ones, particularly in real-world scenarios where robustness and security take precedence over achieving an extremely low false-alarm probability. In this paper, we propose a fair comparison of robustness and security between modern and classic post-hoc watermarking across various types of classic augmentations and recent sophisticated attacks. Our experiments show that, in a realistic scenario, classic watermarking outperforms modern techniques in terms of security while maintaining robustness.
[CV-24] Image Thresholding: Understanding Bias of Evaluation Metrics towards Specific Evaluation Functions ICPR2026
链接: https://arxiv.org/abs/2605.27132
作者: Eslam Hegazy,Mohamed Gabr
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICPR 2026 ( this https URL )
Abstract:Multilevel image thresholding is widely used for segmentation in applications ranging from medical imaging to remote sensing. Classical objective functions, such as Otsu’s between-class variance and Kapur’s entropy, are often optimized using metaheuristic algorithms, with performance evaluated via metrics like Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR). These evaluations implicitly assume that SSIM and PSNR provide unbiased measures of segmentation quality. In this study, we examine this assumption by analyzing the correlation between thresholding objective functions and quality metrics across all possible thresholds for images in the BSDS500 dataset. Results show that Otsu’s criterion consistently exhibits high correlation with both SSIM and PSNR, while Kapur’s entropy demonstrates weaker and more variable correlation. Otsu outperforms Kapur in correlation with PSNR for all images and with SSIM for over 91%. Our findings reveal an inherent metric-objective-function bias. This work highlights the need for more neutral evaluation frameworks and motivates extending the analysis to additional thresholding criteria and domains. Source code of this paper can be found at this https URL
[CV-25] YOLO26-RipeLoc Lite: A lightweight architecture for tomato ripeness detection and picking point localization in greenhouse robotic harvesting
链接: https://arxiv.org/abs/2605.27129
作者: Rajmeet Singh,Manveen Kaur,Shahpour Alirezaee,Irfan Hussain
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:In greenhouse tomato production, automated harvesting requires accurate detection of ripe tomatoes, ripeness classification, and precise picking-point localization for robotic end-effectors. This paper proposes YOLO26-RipeLoc Lite, a lightweight deep learning architecture based on YOLO26 for simultaneous detection, ripeness classification, and center-point localization of greenhouse tomatoes. The model introduces three modifications: (1) a Lightweight Feature Pyramid Network (LFPN) with depthwise separable convolutions for efficient multi-scale fusion, (2) a Ripeness-Aware Attention Module (RAAM) with dual pooling and a learnable ripeness bias vector for enhanced color-texture discrimination, and (3) a Compact Detection Head (CDH) with shared convolutions and an integrated center-point regression branch for direct grasp planning. The model is evaluated on a custom dataset of 1,500 images with 6,227 instances (3,566 ripe, 2,661 unripe) from the SILAL greenhouse, Abu Dhabi, UAE. YOLO26-RipeLoc Lite achieves mAP@0.5 of 92.9% (95.2% ripe, 90.6% unripe) with the highest precision (95.2%) among all evaluated architectures using only 2.38M parameters. Post-training BatchNorm pruning at 30% reduces parameters to ~1.8M with negligible accuracy loss. Ablation studies confirm that greenhouse-aware HSV augmentation provides the largest improvement (+2.02 pp mAP@50), backbone freezing achieves peak precision (93.8%), and 3-phase progressive unfreezing yields the best localization quality (mAP@50:95 of 64.6%). Comparisons with YOLOv8n/s, YOLO11n/s, YOLO12n/s, and YOLO26s confirm superior accuracy-efficiency: 2.9 pp higher precision than YOLO12n with 7.0% fewer parameters and integrated center-point localization for robotic end-effector guidance.
[CV-26] PILOT: A Data-Free Continual Learning Approach for Real-Time Semantic Segmentation via Boundary Guidance
链接: https://arxiv.org/abs/2605.27128
作者: Yujing Zhou,Prashant Shekhar,Thomas Yang,Yongxin Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Real-time semantic segmentation models offer an excellent balance between accuracy and inference speed. However, deploying these models in dynamic real world environments often requires the ability to learn novel classes incrementally without retraining on the entire dataset. This capability is known as continual learning. In this regard, the standard fine-tuning methods in deep learning often fail due to catastrophic forgetting, where the model learns new information but forgets previously trained and learned classes. Contributing to this crucial domain, the current paper proposes a novel continual learning framework tailored for PIDNet, which is a widely cited state-of-the-art real-time semantic segmentation model. Our method, PILOT(Parallel Incremental Learning Over Time), introduces a real-time and lightweight strategy by implementing a parallel Derivative-branch (D-branch) designed to capture the high frequency boundary information of novel classes while freezing the trained parameters of the original segmentation network. This novel setup allows the model to adapt to new semantic categories while preserving the knowledge of previously learned classes. By using only data associated with the new class, our model significantly reduces training overhead. Experimental results demonstrate that our approach successfully segments new classes while maintaining high mean Intersection over Union (mIoU) on the original base classes, thereby comfortably outperforming all major continual learning approaches in this domain. Overall, PILOT is shown to effectively mitigate catastrophic forgetting with minimal impact on inference latency, thus maintaining real-time performance.
[CV-27] COVD: Continual Open-Vocabulary Object Detection with Novel Concept Injection
链接: https://arxiv.org/abs/2605.27116
作者: Yupeng Zhang,Ruize Han,Yuzhong Feng,Zixin Ren,Yuntong Tian,Liang Wan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary object detection (OVD) has made significant progress, enabling detectors to generalize from seen to unseen categories. However, real-world category spaces continually evolve, and existing OVD models still struggle with newly emerging concepts, while repeated full retraining is prohibitively expensive. To this end, we introduce a new task setting, termed Continual OVD with Novel Concept Injection (COVD), where models sequentially learn incoming novel concept groups while preserving prior concepts and original open-vocabulary knowledge, along with a new benchmark, Novel-114. Our key observation is that pretrained visual encoders often already perceive and represent many novel concepts, and the main bottleneck lies in the lack of stable semantic alignment between visual representations and textual concepts. Based on this, we propose NoIn-Det, an efficient continual injection framework without additional parameters. NoIn-Det freezes the visual encoder, preserves the text representation space using only texts of common concepts and previously injected concepts, and injects novel concepts by updating only a small subset of text-branch parameters beneficial to novel concept learning. Extensive experiments show that NoIn-Det effectively learns novel concepts, preserves old knowledge, and consistently outperforms existing continual learning methods for VLMs without introducing additional this http URL-114 and the code will be released.
[CV-28] JLT: Clean-Latent Prediction in Latent Diffusion Transformers
链接: https://arxiv.org/abs/2605.27102
作者: Funing Fu,Tenghui Wang,Junyong Cen,Qichao Zhu,Guanyu Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.
[CV-29] Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning ICML2026
链接: https://arxiv.org/abs/2605.27080
作者: Qida Tan,Hongyu Yang,Wenchao Du
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML2026
Abstract:Appearance-based gaze estimation always suffers from poor generalization due to limited annotated samples and insufficient dataset diversity. Leading approaches adopt weakly supervised learning to generate large-scale pseudo-labeled data from unconstrained real-world scenarios, aiming to mitigate the domain shifts. In this work, we devise a simple yet effective semi-supervised learning architecture that leverages unlabeled data to enhance domain generalization, thereby reducing reliance on labor-intensive manual annotations. Our key insight is to impose Jacobian regularization to disentangle feature representations into discriminative subspaces dedicated to specific gaze components, such as pitch and yaw angles. We further exploit the intrinsic ordinal ranking within each subspace for contrastive learning, enabling the model to learn robust gaze representations from a small set of labeled samples and an abundance of unlabeled ones. This ultimately yields our Disentangled Subspace Contrastive Learning (DSCL) framework. Extensive experiments on multiple benchmarks verify that the proposed DSCL is plug-and-play, achieving competitive performance using only 20%, 10%, and even 5% of the annotated data under both in-domain and cross-domain evaluation settings. The public code is available at \hrefthis https URLthis https URL.
[CV-30] SoftCap: Soft-Budget Control for Diffusion Transformer Acceleration
链接: https://arxiv.org/abs/2605.27075
作者: Yuhang Zhang,Junxiang Qiu,Huixia Ben,Zhenhua Tang,Shuo Wang,Yanbin Hao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Transformers (DiTs) achieve strong visual quality, but their iterative denoising process requires many costly Transformer evaluations. Training-free acceleration methods reduce this cost by caching, forecasting, or verifying intermediate features, yet the runtime decision of when to execute a Full step is often driven by fixed schedules or hand-tuned thresholds. We propose \textbfSoftCap, a training-free control layer for cache-based DiT inference. SoftCap couples a Trajectory Drift Observer, which estimates local cache risk from lightweight hidden-state statistics, with a Soft-Budget PI Controller, which adjusts the Full-triggering threshold from realized compute relative to a fixed reference profile. The budget is a soft ceiling: it shapes the threshold but does not require a run to spend a prescribed number of Full evaluations. On FLUX.1-dev, SoftCap improves over SpeCa at a comparable middle-compute operating point, raising ImageReward from 0.967 to 0.981 and reducing LPIPS-Full from 0.518 to 0.498 at nearly identical FLOPs, while target-sweep diagnostics show the intended soft-ceiling behavior as the budget is relaxed.
[CV-31] IPIBench: Evaluating Interactive Proactive Intelligence of MLLM s under Continuous Streams
链接: https://arxiv.org/abs/2605.27074
作者: Jinzhao Li,Yinuo Chen,Wenxuan Song,Yijia Lei,Yichi Zhang,Honglei Yan,Panwang Pan,Miao Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent multimodal large language models (MLLMs) achieve strong performance on reactive question answering, but real-world streaming assistants require proactive reasoning over continuous visual inputs. Existing benchmarks mainly study reactive or proactive interactions in isolated single-turn settings, overlooking dynamic multi-turn scenarios where users may add, modify, or cancel proactive requests alongside interleaved reactive queries. To address this gap, we introduce IPIBench, the first benchmark for evaluating Interactive Proactive Intelligence of MLLMs under streaming video settings. IPIBench covers proactive monitoring, proactive task management, and interleaved reactive-proactive requests. Evaluations on representative MLLMs reveal two major limitations: unstable proactive triggering and weak coordination between reactive and proactive behaviors. We further propose IPI-Agent, a training-free agentic framework with an interaction-control policy and a temporal-gating mechanism for stabilizing proactive triggering and coordinating multi-turn interactions. Experiments show that IPI-Agent consistently improves existing MLLMs across all benchmark settings.
[CV-32] BEAT: Rhythm-Elastic Alignment for Agent ic Music-guided Movie Trailer Generation
链接: https://arxiv.org/abs/2605.27067
作者: Yutong Wang,Yunke Wang,Xinyuan Chen,Chang Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automatic movie trailer generation must select shots from a full-length film and synchronize them with background music. Existing methods either relegate music alignment to post-processing or enforce rigid one-to-one shot-music mappings, overlooking that professional editing rhythm is elastic: rapid cuts accompany high-energy passages while sustained shots span quieter bars. We introduce BEAT, a framework that addresses this gap with two core components: MuVA, a compact music-visual alignment encoder trained with Sinkhorn-regularized two-stage learning, and Bar-DP, an energy-adaptive dynamic programming algorithm that produces elastic many-to-one alignments following musical dynamics. These components are integrated into a five-phase agentic pipeline that grounds the core alignment in learned cross-modal features while coordinating higher-level creative decisions through structured text signals. To support comprehensive evaluation, we also introduce TrailerArena, a benchmark with 20+ metrics across four complementary dimensions. On TrailerArena, BEAT achieves state-of-the-art performance across shot selection, ordering, and perceptual quality, while producing fully composed trailers end-to-end.
[CV-33] SCKAN: Structural Consensus-based KAN Prototype Learning for Semi-Supervised Pancreas Segmentation
链接: https://arxiv.org/abs/2605.27032
作者: Yuqi Liu,Yufei Chen,Wei Fu,Xiaodong Yue,Shuo Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10.5 pages, 5 figures, Medical Image Computing and Computer Assisted Intervention 2026
Abstract:Accurate pancreas segmentation is critical for early cancer diagnosis, where annotation scarcity necessitates Semi-Supervised Learning (SSL). However, due to significant inter-sample morphological variability, existing SSL methods face severe generalizability limitations under sparse supervision, leading to the Supervision Bias problem. To address this, we propose Structural Consensus-based KAN Prototype Learning (SCKAN), which constructs the first cross-sample structural consensus learning with Kolmogorov-Arnold Networks (KANs), to achieve more generalizable and accurate segmentation. Specifically, SCKAN contains two key designs: Structure-constrained Prototype Consistency Learning (SPCL), which prompts unbiased structural representation by enforcing cross-sample consistency via prototype-level contrastive optimization, and Consensus-based Kolmogorov-Arnold Fusion (CKaF), which reduces morphology-specific bias by aggregating stable consensus and filtering sample-wise noise via KAN’s adaptive B-spline nonlinearity. Extensive experiments on two public pancreas datasets demonstrate the effectiveness of SCKAN. Code is at this https URL.
[CV-34] NeR-SC: Adapting Neural Video Representation to Screen Content
链接: https://arxiv.org/abs/2605.27024
作者: Ruohan Shi,Jiaoyan Zhao,Haogang Feng
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Submitted to PRMVAI 2026
Abstract:Implicit neural representations have emerged as a promising paradigm for video compression, with recent methods achieving competitive performance on natural video. However, screen content video – common in remote desktop, online education, and cloud gaming – exhibits distinct statistics: sharp edges, limited color palettes, and strong temporal redundancy. Existing neural representation methods, designed for natural scenes, lack mechanisms to exploit these properties, leaving substantial room for improvement. In this paper, we propose NeR-SC, a neural representation framework tailored for screen content video. Building on the SNeRV backbone, NeR-SC introduces three screen-content-specific modules: (i) a learnable color palette that models the discrete color structure of screen content by restricting the low-frequency sub-band to a learned color set; (ii) a multi-gate dense fusion module that replaces sequential feature fusion with dense, attention-gated cross-stage interaction; and (iii) an embedding-level frame skip strategy that bypasses redundant decoder invocations for static frames, with zero training overhead. Experiments on DSCVC and VCD show that NeR-SC achieves 40.32~dB and 41.73~dB average PSNR, outperforming representative neural video representation methods and, at low bitrates, surpassing H.264 and H.265. The skip strategy enables real-time decoding with no loss in quality.
[CV-35] Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models CVPR2026
链接: https://arxiv.org/abs/2605.27020
作者: Tao Qi,Huili Wang,Yuanhong Huang,Wendan Wang,Lianchao Zhao,Jinrui Wang,Zichen Qin,Shangguang Wang,Yongfeng Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures; CVPR 2026 camera-ready
Abstract:The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infringements involving human-created data. Membership inference attacks (MIAs) have emerged as a promising tool for identifying unauthorized data usage during model training. Existing methods typically assess the ability of model to denoise perturbed suspect images as an indicator of membership status. However, the discriminative power of such features is highly dependent on the degree of model memorization and deteriorates significantly when applied to less exposed data (e.g., pre-training data). Although several methods attempt to enhance detection by leveraging internal model features, these features are generally inaccessible in mainstream closed-source image generation platforms, limiting their practicality. In this paper, we demonstrate that analyzing how a black-box diffusion model denoises a target image and corresponding perturbed textual instructions can reveal more distinctive membership cues. Based on this insight, we propose a black-box membership inference attack framework (named SD-MIA) that leverages a cross-modal data perturbation mechanism to detect pre-training data in diffusion models. We conduct extensive experiments on both a public benchmark dataset and a newly constructed dataset, each comprising pre-training membership and non-membership samples with identical distributions. Experimental results demonstrate that SD-MIA achieves superior performance compared to existing baselines, including those with the unfair advantage of accessing internal model features.
[CV-36] mestep-Aware SVDQuant-GPT Q for W4A4 Quantization of Wan2.2-I2V
链接: https://arxiv.org/abs/2605.27003
作者: Junhao Wu,Dezhong Yao,Hai Jin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:W4A4 quantization of large video diffusion Transformers offers substantial memory savings but is hindered by two main challenges: sparse large-magnitude activation outliers, and strongly timestep-dependent activation distributions across the multi-step denoising trajectory. These difficulties are compounded by Wan2.2-I2V’s two-expert Mixture-of-Experts DiT design, whose high-noise and low-noise experts exhibit distinct quantization sensitivities that a single global calibration policy cannot capture. We propose a post-training quantization framework combining SVDQuant-based low-rank outlier compensation, GPTQ-based reconstruction-aware residual weight quantization, and timestep-bin-wise per-layer activation clipping-ratio search conducted independently for each expert. On the OpenS2V-Eval benchmark, our method reduces peak GPU memory by 59.3% relative to the BF16 baseline while incurring only a 0.9% drop in VBench average score and a 2.3% drop in Imaging Quality, demonstrating that expert- and timestep-aware calibration is essential for high-fidelity W4A4 inference on MoE video DiTs.
[CV-37] ChartAct: A Benchmark for Dynamic Chart Understanding
链接: https://arxiv.org/abs/2605.26994
作者: Muye Huang,Wu Lin,Lingling Zhang,Hang Yan,Zhiyuan Wang,Yumeng Fu,Zesheng Yang,Jun Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Charts are widely used to present complex data for analysis and decision making. Existing chart understanding benchmarks mainly focus on static charts, but real-world charts are often dynamic and interactive. Key information may only appear after actions such as hovering, clicking, zooming, or dragging. Dynamic chart understanding therefore requires models to identify visible content, choose proper interactions, and reason over changing chart states. To evaluate this ability, we propose ChartAct, an interactive benchmark for dynamic chart understanding. ChartAct collects and filters 673 dynamic charts from 8 real chart websites, covers 7 common chart types, and constructs 1,440 high-quality question-answer samples. Each sample is instantiated in two environments, Dynamic Chart and Dashboard Chart, to evaluate dynamic chart understanding under different contexts. Based on ChartAct, we systematically evaluate 11 advanced multimodal models and GUI agents. Experimental results show that existing models still have clear limitations in dynamic chart understanding. The strongest model, Claude-Opus-4.7, achieves an average success rate of 84.5%, while most models remain below 60%. We also conduct detailed failure attribution and case analysis. ChartAct provides a new benchmark for studying chart understanding in real interactive environments. Codes at this https URL
[CV-38] On the Robustness of Machine Unlearning for Vision-Language Models
链接: https://arxiv.org/abs/2605.26992
作者: Yujie Lin,Kaidi Jia,Jiayao Ma,Chengyi Yang,Jinsong Su
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) may memorize undesirable information from training data, motivating growing interest in machine unlearning. In this work, we present the first systematic survey and robustness analysis of VLM unlearning. We provide a comprehensive taxonomy and review of existing VLM unlearning methods, together with unified evaluations under multiple prompt settings. We then propose three attack paradigms to examine whether forgotten multimodal knowledge can be reactivated through contextual prompting or downstream retraining. Extensive experiments show that many existing methods remain vulnerable under these attacks, indicating that current approaches often hide rather than fully remove target knowledge. Our study provides new insights into the robustness and limitations of current VLM unlearning methods and highlights the need for more reliable multimodal unlearning strategies. Code is available at this https URL.
[CV-39] CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning
链接: https://arxiv.org/abs/2605.26967
作者: Zihan Lin,Songhe Deng,Shuwei He,Danxiang Zhu,Dan Zhang,Yishu Lei,Xianlong Luo,Shikun Feng,Rui Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures
Abstract:Existing video captioning methods struggle to balance visual fidelity and redundancy: holistic captions are compact but lose fine-grained evidence, whereas segment-wise captions improve coverage but introduce heavy redundancy. We propose CodecCap, a codec-inspired framework for high-fidelity dense video captioning. Analogous to video codecs, CodecCap represents videos using keyframe and residual captions. Keyframe captions exhaustively encode stable visual context, while residual captions capture temporally only localized actions, motions and changes. This effectively preserves fine-grained visual evidence while reducing redundant descriptions. To quantify the fidelity of captions, we introduce VidCapQA, a caption-then-QA benchmark with 1,000 questions across 14 capability dimensions. Results on VidCapQA show that captions directly generated by strong VLMs still miss many visual details, highlighting caption representation as a critical bottleneck. Experiments show that CodecCap significantly surpasses direct captioning with the same underlying VLMs, suggesting keyframe-residual captioning a way for high-fidelity video-language supervision. We further use CodecCap to construct CodecVDC-100K, a large-scale dense captioning dataset with anchor, residual, scene-level, and video-level supervision.
[CV-40] DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models
链接: https://arxiv.org/abs/2605.26949
作者: Furkan Mert Algan,Eckehard Steinbach
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:3D shape completion from partial scans remains challenging for unseen categories and noisy real-world observations, where geometry alone is often insufficient for inferring missing structure. We present DinoComplete, a deterministic and efficient shape completion framework that augments geometric reconstruction with voxel-aligned semantic priors distilled from DINO features. First, we construct multi-view DINO feature volumes aligned with ShapeNet data and train a student network to predict dense semantic features directly from incomplete shapes. These predicted features capture global structure and part-aware semantic context while remaining aligned with the underlying geometry. We then integrate these distilled features into a completion network, where geometric and semantic voxel representations are fused through voxel state-space modeling. To enable efficient long-range reasoning without sacrificing resolution, we introduce a multi-scale voxel Mamba module that refines the fused features by combining full-grid and chunk-wise sequence modeling. Experiments on unseen ShapeNet categories and ScanNet objects show that DinoComplete achieves stronger completion quality than prior deterministic and generative based completion methods while using fewer parameters, requiring lower memory, and achieving faster inference. Our results demonstrate that distilling semantic priors from visual foundation models improves generalization and robustness in 3D shape completion.
[CV-41] Object Pose and Shape Estimation for Grasping: Does it Work?
链接: https://arxiv.org/abs/2605.26944
作者: Pavan Karke,Kushal Shah,Gaurav Singh,Md Faizal Karim,K Madhava Krishna,Rajat Talak
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 8 figures
Abstract:The problem of object pose and shape estimation has seen key advancements lately. Encoder-decoder (e.g., SAM3D, LRM, CRISP) and diffusion-based models (e.g., InstantMesh, Zero123, SceneComplete) have shown category-agnostic shape encoding capacity and open-set generalizability. In this work, we ask the question: Are the object pose and shape estimation methods mature enough, such that when used with antipodal grasp sampling, can outperform the end-to-end grasp synthesis methods? We explore this question in detail by scoping our study to parallel jaw grippers, 7-DoF grasps, and single-view RGB(-D) image as input. We implement and compare a state-of-the-art, end-to-end grasp synthesis method and three modular methods, which first estimate the object pose and shape for all objects in the scene, and generate grasps using antipodal sampling. We observe that the modular methods outperform the end-to-end method in all our experiments. The modular methods are able to synthesize plenty of grasps, even for small objects, where the end-to-end methods fail. The effectiveness of the modular methods is contingent on the accuracy of the pose and shape estimation, and suffers partial degradation in cluttered scenes - a limitation of the existing pose and shape estimation methods. We also analyze the failure modes and run-times for the three modular methods, which use two different ways of object pose and shape estimation: one based on an encoder-decoder model, while another a diffusion model. Finally, we demonstrate that the single-view object pose and shape estimation methods can be augmented with vision-language models to yield language-conditioned grasps from just single-view RGB-D image as input. We notice comparable performance to the state-of-the-art LERF-TOGO baseline.
[CV-42] Leverag ing Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking
链接: https://arxiv.org/abs/2605.26933
作者: Zhengbo Zhang,Zhigang Tu,Junsong Yuan,De Wen Soh,Bo Du
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2026
Abstract:Unsupervised visual object tracking is a challenging task that requires following arbitrary targets in videos without training on ground-truth annotations. Despite considerable progress, existing state-of-the-art unsupervised trackers often struggle in scenarios that demand fine-grained understanding of semantic and visual structural information within video frames. Text-to-image diffusion models are well known for their ability to generate images that accurately reflect the semantics and structures described in the input prompt, demonstrating a strong grasp of visual semantics and structures. Building on this capability, we approach the unsupervised tracking from a new perspective by exploiting the rich semantic knowledge encoded in pretrained text-to-image diffusion models. To adapt the diffusion models, which are originally developed for image generation, to the tracking task, we reinterpret the models as a bridge between text and image modalities. This connection is realized through the cross-attention mechanism: when both text and an image are input into the models, they highlight the regions of the image that are semantically aligned with the text in the cross-attention maps. We therefore learn a prompt that represents the tracking target and activates its corresponding region in the cross-attention map for each frame, which enables object tracking with the diffusion model. Specifically, our method Diff-Tracking is composed of two main components: an initial prompt learner and an online prompt updater. The initial prompt learner generates a prompt that captures the target object in the first frame, allowing the diffusion model to identify the target. The online prompt updater refines the prompt based on motion information, enabling consistent tracking across video frames. We evaluate our approach on six challenging tracking datasets demonstrate the effectiveness of our approach.
[CV-43] Revealing the core dimensions underlying representations in brains behavior and AI
链接: https://arxiv.org/abs/2605.26921
作者: Florian P. Mahner,Ka Chun Lam,Francisco Pereira,Martin N. Hebart
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:
Abstract:The study of representations is widespread across fields, including neuroscience, psychology, and artificial intelligence. While representations are often studied and compared through similarities between stimuli, current methods provide only limited access to the dimensions that shape these representations and are often limited in interpretability. To overcome these challenges, here we introduce Similarity-Based Representation Factorization (SRF), a general computational method for recovering low-dimensional, non-negative, interpretable embeddings from similarity matrices derived from measured data. Across simulations and many neural, behavioral, and computational datasets, SRF recovers interpretable dimensions from diverse forms of representational data, even for very sparsely sampled, incomplete data. The dimensions derived from these datasets match those obtained by task-specific models, predict independent behavioral properties, improve exploratory analysis, and offer higher power for confirmatory hypothesis testing than comparing similarity matrices. Together, these results establish SRF as a general-purpose method with broad applications for uncovering, understanding, and leveraging the dimensions underlying representations.
[CV-44] I2PRef: Image-Driven Point Completion with Iterative Refinement
链接: https://arxiv.org/abs/2605.26914
作者: Azhar Hussian,Marina Ritthaler,André Kaup,Vasileios Belagiannis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present an image-conditioned point cloud completion approach that treats images as the primary geometric source rather than a secondary guide. To this end, we introduce an Image-to-Point (I2P) module that can reconstruct complete point clouds directly from a single RGB image, with no need for 3D inputs. Additionally, we introduce a transformer-based Point-to-Point (P2P) refinement module that uses self- and cross-attention between point tokens and image features to iteratively refine the coarse I2P output. The I2P module enables the image encoder to learn rich geometric representations, while the P2P module progressively recovers fine-grained details. Unlike existing multimodal methods that rely on auxiliary losses or fusion modules, our explicit I2P task provides a strong, geometry-aware prior based on images alone. Extensive experiments on ShapeNet-ViPC demonstrate state-of-the-art completion performance with a 12.3% relative Chamfer Distance improvement over prior methods. Code is available at: this https URL
[CV-45] SIMPC: Learning Self-Induced Mirror-Point Consistency for Unsupervised Point Cloud Denoising ICML2026
链接: https://arxiv.org/abs/2605.26894
作者: Chengwei Zhang,Xueyi Zhang,Tao Jiang,Xinhao Xu,Wenjie Li,Fubo Zhang,Longyong Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026. 17 pages, 8 figures, 8 tables
Abstract:In point clouds, noise directly perturbs point coordinates that encode both spatial location and geometry, making one-to-one correspondence construction more challenging than in images. Existing methods impose statistical mappings across noisy variants via noise or optimal transport, but suffer from correspondence ambiguity. In this work, we propose Self-Induced Mirror-Point Consistency (SIMPC) to learn deterministic correspondences between points and the underlying surface in an unsupervised manner. For each noisy point, SIMPC generates a mirror-point on the opposite side of the underlying surface, guided by geometric priors during the denoising process. By encouraging consistency between the denoising targets of the original point and its mirror counterpart, SIMPC effectively localizes the position of underlying surface. Extensive experiments on synthetic and real-world datasets demonstrate that SIMPC significantly outperforms state-of-the-art unsupervised methods and surpasses several strong supervised counterparts.
[CV-46] Small Object Detection in Industrial Recycling: A New Dataset and YOLO Performance Evaluation
链接: https://arxiv.org/abs/2605.26884
作者: Oussama Messai,Abbass Zein-Eddine,Abdelouahid Bentamou,Mickael Picq,Nicolas Duquesne,Stéphane Puydarrieux,Yann Gavet
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:
Abstract:In this paper, we address the problem of detecting small, dense, and overlapping objects, a major challenge in computer vision. Our focus is on reviewing proposed methods based on deep learning supervised approaches. We provide a detailed comparison of these systems on a new dataset of more than 10k images and 120k instances, highlighting their performance, accuracy, and computational efficiency in the industrial recycling process use case. Through this comparative analysis, we identify the most reliable systems currently available and the specific challenges they are designed to tackle. Furthermore, we explore the benefits of data augmentation and synthetic images. Based on our analysis, we also propose potential future directions and innovative solutions that could enhance the effectiveness of small, dense and overlapped object detection systems. The scope of our investigations encompasses object detection, length measurement, and anomaly detection within the context of the recycling process. The anomaly detection strategy is robust against variations in image resolution and zoom levels, ensuring reliable performance in industrial applications. The repository of the proposed dataset, methods and evaluation codes can be found at: this https URL
[CV-47] Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos DATE CVPR2026
链接: https://arxiv.org/abs/2605.26879
作者: Dingkun Wei,Zehong Shen,Yan Xia,Georgios Pavlakos,Yujun Shen,Xiaowei Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures. Accepted as an Oral presentation and Best Paper Candidate at CVPR 2026. Project page: this https URL
Abstract:Human motion recovered from monocular videos often appears overly smooth or dynamically inconsistent, even when joint positions are numerically accurate. We observe that this limitation stems from the absence of reliable high-order temporal cues – velocity and acceleration – which are essential for reconstructing motion that exhibits realistic momentum, timing, and high-frequency detail. We introduce HTD-Refine, a post-processing framework that augments existing Human Motion Recovery (HMR) pipelines using explicitly estimated high-order temporal dynamics. At the core of our system is PVA-Net, a temporal transformer that infers per-joint 2D positions, 3D velocities, and 3D accelerations directly from a monocular video. These predicted dynamics serve as soft yet informative constraints in a global optimization procedure that refines world-space trajectories, significantly reducing jitter, suppressing over-smoothing, and restoring physically plausible motion. Extensive experiments on challenging in-the-wild benchmarks show that HTD-Refine consistently improves state-of-the-art HMR methods, yielding more accurate global trajectories and substantially more natural motion dynamics. Our results highlight the critical role of high-order temporal modeling in advancing monocular human motion recovery.
[CV-48] RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction
链接: https://arxiv.org/abs/2605.26862
作者: Chenxu Peng,Chenxu Wang,Yimian Dai,Yongxiang Liu,Ming-Ming Cheng,Xiang Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate road segmentation from aerial imagery is fundamental to many geospatial applications. However, existing datasets often suffer from limited scene diversity, low semantic granularity, and poor structural continuity, restricting their generalization across environments. To address these challenges, we introduce WorldRoadSeg-360K, the largest and most diverse road segmentation dataset to date, comprising 366,947 high-resolution images collected from 38 countries and 223 cities across various terrains and continents. WorldRoadSeg-360K serves as a comprehensive benchmark and reveals key challenges in handling diverse and structurally complex scenes. Automated approaches often struggle to preserve road connectivity, while current interactive methods lack efficient, topology-sensitive tools for real-world road editing. To this end, we present RoadGIE, establishing a novel interactive paradigm for road extraction in remote sensing. Unlike prior point- or box-based prompting strategies, RoadGIE supports connectivity-aware prompts, including clicks and scribbles, which inherently align with the topology of road networks. To improve structural consistency and mitigate performance degradation during iterative interactions, RoadGIE integrates an expert-guided prompting strategy and adapts the skeleton-based recall loss for interactive scenarios. RoadGIE achieves state-of-the-art performance in both segmentation accuracy and topological consistency on WorldRoadSeg-360K and other benchmarks, while maintaining efficient operation with only 3.7M parameters. The code are publicly available at: this https URL
[CV-49] REVERSE: Reinforcing Evidence Verification and Search for Agent ic Image geo-localization
链接: https://arxiv.org/abs/2605.26861
作者: Yong Li,Furong Jia,Dacheng Yin,Kang Rong,Fengyun Rao,Jing Lyu,Fan Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image geo-localization aims to determine where a photograph was taken, a task that often requires more than recognizing visible landmarks. Human experts typically solve it through an iterative workflow: they inspect informative regions, form location hypotheses, seek external evidence, and revise their judgments as new clues appear. Existing methods only partially capture this process: direct prediction methods bypass evidence acquisition altogether, while retrieval-augmented methods introduce external evidence but usually provide limited supervision on the intermediate decisions of where to search, how to query, and how to filter noisy results. We present REVERSE, a framework that reinforces the interplay between evidence search and verification to enable multi-turn agentic reasoning. REVERSE teaches three intermediate decisions: where to look, what to query, and what evidence to trust. To support this, we construct tool-grounded trajectories with annotated region selections, search observations, and geo-informative evidence labels, and introduce process rewards for visual grounding, query utility, and evidence discrimination. An offline search cache makes retrieval observations stable and reusable during reinforcement learning, enabling dense supervision over noisy search results. With a 4B model, REVERSE outperforms strong retrieval-augmented baselines and rivals substantially larger models on Im2GPS3k and YFCC4k. Code is available at this https URL.
[CV-50] Receipt Replay OOD: A Small Benchmark for Screen Replay Detection Under Domain Shift
链接: https://arxiv.org/abs/2605.26855
作者: Alexander Vinogradov
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Public datasets such as DLC-2021, SynID, and KID34K have significantly contributed to research on presentation attack detection for identity documents, including screen replay attacks. However, evaluation of out-of-domain (OOD) robustness remains insufficiently explored, especially under realistic domain shifts. In this work, we introduce Receipt Replay OOD, a small out-of-domain benchmark for screen replay detection. Receipts share several characteristics with identity documents, including planar geometry, curved corners, wear-and-tear artifacts, and text or logo patterns, while avoiding personally identifiable information constraints commonly associated with identity documents. We evaluate document replay detection models under cross-domain conditions and demonstrate the impact of domain shift on generalization performance. The dataset is publicly available.
[CV-51] OSMa-Bench: Toward Open-Ended Benchmarking of Semantic Mapping for Manipulation with Prompt-Generated Synthetic Scenes
链接: https://arxiv.org/abs/2605.26831
作者: Regina Kurkova,Maxim Popov,Sergey Kolyubin
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Code: this https URL
Abstract:Semantic mapping methods are increasingly used as intermediate scene representations for downstream robotic reasoning and manipulation, yet their evaluation is still largely tied to fixed benchmark datasets with limited coverage of manipulation-relevant corner cases. In this work, we extend OSMa-Bench toward controllable benchmarking with prompt-generated synthetic indoor scenes. Our pipeline automatically generates scene descriptions, synthesizes corresponding environments with SceneSmith, and adapts the resulting assets into an OSMa-Bench-compatible simulation format. This adaptation requires a nontrivial intermediate layer, including semantic normalization, material and texture repair, shader fallback policies, floor handling, navigation setup, and controlled lighting configuration. A key advantage of the proposed setup is that the original scene-generation prompt is known in advance and can therefore serve as an auxiliary semantic specification of the intended scene. We use this property to extend the VQA component of OSMa-Bench with a prompt-grounded question category. The resulting framework supports targeted stress-testing of semantic scene representations under conditions such as clutter, small objects, partial occlusions, and lighting variation, and makes benchmarking more extensible and better aligned with downstream manipulation requirements. Our code is available at this https URL.
[CV-52] he Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery
链接: https://arxiv.org/abs/2605.26830
作者: Vasileios Saketos,Ming Xiao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:State estimation is a fundamental problem in control and signal processing, for which the Kalman Filter provides an optimal solution under linear dynamics, Gaussian noise, and known noise covariances. However, these assumptions often fail in realistic sensing settings such as Doppler radar and LiDAR. In these cases, the optimal estimator is inherently nonlinear, which leads to systematic performance degradation. This creates a performance gap that cannot be eliminated by tuning the noise covariance parameters (i.e., the process and measurement noise in the Kalman Filter) alone. To address this limitation, we propose Kalman Evolve, a framework for discovering improved filtering algorithms by jointly optimizing both noise parameters and the update structure. Our approach leverages large language models (LLMs) as a structured prior over program space, enabling the generation of interpretable, non-affine modifications to the classical Kalman filter while preserving its recursive form. We provide analytical results establishing the suboptimality of affine estimators under common nonlinear sensing models, motivating the need for structure-aware updates. Across a range of synthetic and real-world tracking benchmarks, including Doppler radar, LiDAR-based localization, and pedestrian tracking, the discovered algorithms consistently improve over strong baselines such as the Optimized Kalman Filter, achieving up to 12% reduction in RMSE. These results suggest that optimizing the structure of the Kalman filter, rather than only its parameters, provides a practical and interpretable way to improve state estimation.
[CV-53] Cesarean Scar Defect Segmentation in Transvaginal Ultrasound Images: a Dataset and Benchmark
链接: https://arxiv.org/abs/2605.26774
作者: Yuan Tian,Yue Li,Wei Xia,Tianyu Xu,Jian Zhang,Liye Shi,Jing Liu,Yang Wang,Ming Liu,Qing Xu,Yixuan Zhang,Maggie M. He,Xiangjian He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cesarean Scar Defect (CSD) is one of the most prevalent complications following cesarean delivery. Transvaginal ultrasonography is widely used for primary CSD screening. Accurate determination of CSD outline and dimensions is crucial for treatment. However, CSDs are frequently overlooked by sonographers due to small size and irregular morphology, suboptimal image quality, and limited clinical awareness in resource-constrained settings. Despite artificial intelligence advances in medical imaging, no public dataset exists for transvaginal ultrasound CSD segmentation. To address this gap, we present a comprehensive CSD dataset comprising 1,111 images and 16 videos, yielding 501 positive samples with confirmed CSD and precise pixel-level manual annotations. Annotations are performed following standardized clinical guidelines through collaboration between experienced sonographers and trained PhD students. This work provides high-quality benchmark resources for advancing medical image segmentation algorithms and promoting clinical innovation. Ultimately, improved CSD diagnosis and subsequent treatment strategies can enhance the quality of life in women of reproductive age, representing significant value for both medical research and clinical practice.
[CV-54] Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning
链接: https://arxiv.org/abs/2605.26761
作者: Mingkang Dong,Hongyi Cai,Xiwen Lei,Jie Li,Tao Zhang,Muxin Pu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures. Mingkang Dong and Hongyi Cai contributed equally to this work. Muxin Pu is the corresponding author
Abstract:Multimodal instruction tuning is the de facto recipe for adapting vision language models (VLMs), yet instruction data are highly redundant, making data selection critical for training efficiency. Existing methods derive selection signals from a specific model or dataset, so whenever the target model or candidate pool changes, the criteria must be recomputed from scratch at substantial cost. To address this, we propose OFA, a data selection framework that trains a reusable selector once and applies it to any dataset or model without recomputation. OFA clusters multimodal instructions in a frozen CLIP space, derives pseudo labels from the cluster structure, and trains a lightweight selector for only a few epochs; samples on which this selector is least confident are selected as the most informative. Once trained, the frozen selector transfers directly across datasets and model scales. The selector is trained once on LLaVA-665K and applied both to LLaVA-665K itself and, without any retraining, to the unseen Vision-Flan-186K. Selecting only 15% of the data, OFA achieves 98.3% of full data performance across 10 downstream benchmarks; on the smaller Vision-Flan-186K, the transferred selector surpasses full data training by 10.6%, confirming that the learned signal generalizes to datasets never seen during selector training. The same selected subsets benefit VLMs at both Qwen2.5-VL-3B and LLaVA-v1.5-7B without per model recomputation, decoupling selection from the target model. These results demonstrate that a single, transferable selector provides an effective and reusable solution for efficient multimodal instruction tuning.
[CV-55] Self-Intersection-Aware 3D Human Motion Generation Using an Efficient Human Sphere Proxy BMVC2025
链接: https://arxiv.org/abs/2605.26744
作者: Pascal Herrmann,Maarten Bieshaar,Dennis Mack,Robert Herzog,Juergen Gall
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to BMVC 2025
Abstract:Human motion generation has made tremendous progress in recent years, with state-of-the-art approaches surpassing ground truth data in leading evaluation benchmarks. However, visual inspection of the generated motions paints a different picture. Even state-of-the-art approaches generate motions frequently containing self-intersections, i.e., body parts interpenetrating, which are strong artifacts, severely limiting the perceived motion quality. We introduce a novel loss, which explicitly penalizes self-intersections, to the training of human motion generation methods. We base our loss on a sphere proxy of human geometry, which allows us to calculate a self-intersection loss 98% faster and uses 83% less memory than comparable methods based on triangular meshes. The loss is agnostic to the specific approach, and we add it to the training of the recent human motion generation methods human motion diffusion model (MDM) and MoMask. Our extensive experiments show a reduction of self-intersections in generated motions of up to 49% while improving other evaluation metrics. The code is available at this https URL .
[CV-56] CIRCLED: A Multi-turn CIR Dataset with Consistent Dialogues across Domains
链接: https://arxiv.org/abs/2605.26734
作者: Tomohisa Takeda,Yu-Chieh Lin,Yuji Nozawa,Youyang Ng,Osamu Torii,Yusuke Matsui
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing Multi-Turn Composed Image Retrieval (MTCIR) datasets lack dialogue-history consistency and are restricted to the fashion domain. To address these limitations, we construct CIRCLED by extending FashionIQ, CIRR, and CIRCO. In CIRCLED, the query at each turn progressively approaches the target image. Data are generated via a CIReVL-based retrieval pipeline and curated with multiple filters on retrieval success, turn length, consistency, and information redundancy to ensure quality. In total, we collect 22,608 multi-turn sessions across nine subsets, substantially exceeding Multi-turn FashionIQ (11,505 sessions) in both scale and generality. We further apply multiple baseline methods and quantitatively assess retrieval accuracy on CIRCLED. Our work provides a practical, high-quality benchmark to facilitate future research on multi-turn CIR. The dataset and code are publicly available at this https URL and this https URL.
[CV-57] Learning Reference-Guided Exposure Correction with Hybrid Illumination Characteristics ICASSP2026
链接: https://arxiv.org/abs/2605.26729
作者: Hao Ren,Zetong Bi,Zhaoliang Wan,Hui Cheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICASSP2026
Abstract:We present HICNet, a reference-guided exposure correction framework. A lightweight, content-agnostic encoder distills each image into a compact illumination embedding capturing regional brightness, edge contrast, and higher-order luminance moments. The embedding difference between a source and its reference drives a multi-scale modulation network that combines FiLM-based global adjustment with Photometric Channel Rebalancing for fine-grained, illumination-aware spectral gating, producing exposure-matched outputs while faithfully preserving scene details. A cross-batch contrastive loss orders the illumination manifold, bolstering robustness to diverse lighting conditions. Trained without ground truth or intrinsic decomposition, HICNet attains better accuracy on public benchmarks and generalizes well to entirely unseen scenes.
[CV-58] Joint 2D-3D Segmentation and Association in Street-level Imaging
链接: https://arxiv.org/abs/2605.26725
作者: Amir Melnikov,Masayuki Tanaka,Yusuke Monno,Masatoshi Okutomi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 image figures, 1 in-body table, 1 in-body algorithm, 2 indexes with tables
Abstract:Accurate interpretation of street-level imagery is essential for large-scale urban mapping and the creation of Spatial Digital Twin (SDT) environments. This work presents a unified framework for joint 2D-3D segmentation and association that integrates visual semantics with multi-view geometric reasoning. Unlike conventional approaches that rely heavily on sequential frames for temporal tracking, our method leverages zero-shot detection and segmentation together with structure-from-motion reconstruction to establish stable cross-view correspondences. A 3D-driven association mechanism replaces traditional 2D multi-object tracking, using geometric consistency to guide identity preservation across wide-baseline viewpoints and varying imaging conditions. By combining 2D texture cues with global 3D context, the proposed pipeline is well-suited for scalable street-level processing and can be used for a variety of object types. Experiments demonstrate substantially improved coverage of ground-truth sequences and more robust identity retention compared to state-of-the-art 2D-only tracking methods, achieving a 22% performance gain in challenging urban scenarios.
[CV-59] METATR: A Multilingual Evolving Benchmark for Automatic Text Recognition
链接: https://arxiv.org/abs/2605.26712
作者: Mélodie Boillet,Solène Tarride,Christopher Kermorvant
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Benchmarks that reflect the diversity and complexity of real-world documents are essential for accurately evaluating Automatic Text Recognition (ATR) systems, especially Vision-Large Language Models (vLLMs). Although recent models demonstrate impressive performance, they are often evaluated on datasets containing modern, printed texts mostly written in English, which limits their relevance to many practical applications. Therefore, selecting a model for a specific use case requires evaluating it on data that matches the target documents. This highlights the importance of representative benchmarks for real-world applications. In this paper, we introduce METATR (v1.0), a multilingual, evolving benchmark designed to evaluate ATR models across a wide range of documents, facilitating meaningful model comparison and selection. The benchmark was designed to maximize diversity by including documents from various public collections. These documents cover 29 languages and include texts with multiple scripts and layouts. Beyond the dataset itself, METATR defines a standardized prompting and normalization methodology and establishes a dynamic evaluation framework. This approach is intended to produce reproducible results while remaining extensible over time. We evaluated a wide range of state-of-the-art systems, including open-source models and closed-source models. Results are reported across various dimensions, including performance at the dataset and language levels, robustness to handwritten documents, and computational efficiency. Our findings show that, although proprietary models achieve the most consistent performance, substantial variability persists across scripts and layouts. Overall, METATR provides a multidimensional, practitioner-oriented framework for assessing multilingual ATR in real-world conditions and tracking progress as the field evolves.
[CV-60] Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling ICML2026
链接: https://arxiv.org/abs/2605.26702
作者: Pengzhen Chen,Yanwei Liu,Xiaoyan Gu,Antonios Argyriou,Wu Liu,Weiping Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: ICML 2026
Abstract:Reliable watermarking of panoramic imagery is fundamentally challenged by arbitrary 3D rotations. As panoramas are defined on the sphere, they naturally transform under the action of SO(3) , rendering conventional planar representations and augmentation-based robustness strategies inadequate and devoid of theoretical guarantees. To address this, we formulate panoramas as spherical signals and leverage SO(3) representation theory to derive provably rotation-invariant descriptors. While spherical harmonic coefficients transform equivariantly under rotations, the natural invariant constructions are typically limited to zeroth-order statistics which eliminate directional information and severely constrain embedding capacity. In this work, we introduce a principled third-order invariant construction by coupling higher-order SO(3) irreducible representations via tensor products and projecting onto the trivial representation. This yields a spherical invariant bispectrum that preserves phase information while remaining strictly rotation-invariant. Leveraging this property, we embed watermarks into higher-order spherical harmonic coefficients and recover them from invariant bispectral scalars, enabling reliable extraction under arbitrary 3D rotations. We provide a theoretical proof of SO(3) invariance for it and demonstrate experimentally its near-perfect robustness to continuous rotations while maintaining high visual fidelity.
[CV-61] SteelDS: A High-Resolution Video Dataset of E40 Steel Scrap for Object Detection and Instance Segmentation
链接: https://arxiv.org/abs/2605.26682
作者: Melanie Neubauer,Christian Rauch,Gerald Koinig,Alexia Tischberger-Aldrian,Roland Pomberger,Elmar Rueckert
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This dataset provides high-resolution, annotated video sequences of shredded E40-grade steel and copper scrap on a conveyor belt. Captured in a controlled laboratory environment, the data reflects the industrial post-magnetic sorting stage, where manual intervention is typically required to remove copper contaminants. The dataset comprises 24,297 labeled frames across five subsets, featuring 396 steel and 101 copper objects categorized by size. It supports the development of machine learning models for material classification, object detection, and instance segmentation. Variations in object spacing and density are included to simulate realistic industrial sorting conditions. Ground truth annotations include pixel-wise segmentation masks and material classes. This dataset serves as a benchmark for evaluating automated sorting algorithms aiming to identify copper impurities within complex, heterogeneous steel scrap streams.
[CV-62] DynFrame: Adaptive Reasoning -Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding
链接: https://arxiv.org/abs/2605.26680
作者: Peng Zhang,Guanghao Zhang,Wanggui He,Longxiang Zhang,Mushui Liu,Yan Xia,Zhenhao Peng,Weilong Dai,Jinlong Liu,Haobing Tang,Le Zhang,Hao Jiang,Pipei Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent video multimodal large language models (MLLMs) increasingly couple step-by-step reasoning with on-demand visual evidence retrieval, allowing models to revisit relevant video segments during inference. However, two structural gaps remain in existing thinking-with-video systems. (i) Sampling density is not a learnable decision: existing methods may let the model decide where to look, but the per-window frame rate is largely fixed. As a result, fine-grained evidence is often recovered through repeated retrieval calls, which increases inference context length and training difficulty. (ii) Retrieval and answer generation are usually optimized with a single trajectory-level advantage, so the “where to look” tokens and the “how to answer” tokens receive the same credit even when one is correct and the other is not. To address these gaps, we present DynFrame, a framework that emits the temporal window and the sampling density as native tokens within a single autoregressive pass. This learnable span-density retrieval enables acquiring multi-granularity evidence with a single retrieval step. Based on the above tokenized retrieval interface, we further introduce Segment-Decoupled GRPO (SD-GRPO), which splits each rollout at the retrieval boundary and assigns role-specific token-level advantages, separately crediting the sampling decision and the answer. Trained on the curated DM-CoT-74k and DM-RL-45k, DynFrame-4B is competitive with strong 7B-8B baselines across six benchmarks (NExT-GQA, Charades-STA, ActivityNet-MR, Video-MME, MLVU, LVBench), and DynFrame-8B sets new state-of-the-art on most metrics. Code is available at this https URL.
[CV-63] Memory-Distilled Selection for Noise-Robust Anomaly Detection ICML2026
链接: https://arxiv.org/abs/2605.26676
作者: Sirojbek Safarov,Jaewoo Park,Yoon Gyo Jung,Kuan-Chuan Peng,Wonchul Kim,Seongdeok Bang,Octavia Camps
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML2026. The code is available at this https URL
Abstract:Anomaly detection (AD) under data contamination is critical for deploying unsupervised defect detection in industrial environments, where curating perfectly clean training sets is impractical. However, existing methods are sensitive to contamination, suffering significant performance degradation as the noise ratio increases. In this paper, we propose Memory-Distilled Selection (MeDS), a training algorithm based on data selection. MeDS constructs an ensemble of partial memories via random subsampling, where the resulting sparsity acts as a low-pass filter that captures nominal patterns across a wide range of noise ratios, enabling coarse-level identification of contaminated samples. The aggregated distances to the bootstrapped memories are then distilled into a reconstruction score network, which is subsequently fine-tuned on clean data filtered using scores from the distilled model, enabling fine-grained localization of anomalies. MeDS is robust across a wide range of noise ratios without requiring noise-ratio-specific hyperparameter tuning, achieving 99.16% image-level AUROC on MVTecAD at a 40% noise ratio, and attaining state-of-the-art performance on both VisA and Real-IAD under noisy settings. We thoroughly verify the efficacy of MeDS on industrial AD benchmarks under noisy data scenarios, accompanied by in-depth empirical analyses.
[CV-64] Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models
链接: https://arxiv.org/abs/2605.26661
作者: Yuanwei Hu,Bo Peng,Yadan Luo,Zhen Fang,Ling Chen,Jie Lu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Out-of-distribution (OOD) detection has emerged as a popular technique to enhance the reliability of machine learning models by identifying unexpected inputs from unknown classes. Recent progress in pre-trained vision-language models (VLMs) has enabled zero-shot OOD detection without access to in-distribution (ID) training data; in this setting, existing methods commonly treat text embeddings of class names as class prototypes. In this paper, we challenge the widely adopted text-as-prototype paradigm by theoretically showing that off-the-shelf textual prototypes are generally misaligned with the optimal visual prototypes, yielding an intrinsic modality gap that cannot be eliminated by prompt engineering alone. To mitigate this gap under the post-hoc constraint, this paper presents an online pseudo-supervised framework that directly learns class prototypes in the visual feature space using unlabeled test-time data streams and soft predictions from the pre-trained VLMs. We provide theoretical guarantees for the convergence of the online optimization procedure. Extensive experiments empirically demonstrate that our method achieves a new state of the art across a variety of OOD detection setups.
[CV-65] DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding
链接: https://arxiv.org/abs/2605.26656
作者: Jianfei Zhao,Feng Zhang,Xin Sun,Chong Feng,Bing Wang,Zhixing Tan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Multimodal large language models are typically trained end-to-end to predict ground-truth answers, yet supervision signals are applied exclusively to text tokens. Visual tokens, the core carriers of visual information, are optimized only implicitly as part of the context, leading to coarse-grained visual understanding. Prior works attempt to supervise visual inputs but inevitably rely on auxiliary components such as additional decoders or forward passes, because visual tokens lack readily interpretable labels. This limits their practical applicability. In this work, we propose \textbfDirect \textbfVision \textbfSupervised \textbfFine-\textbfTuning (DV-SFT), which constructs explicit, token-level supervision for visual tokens and trains them through the same next-token prediction objective used for text. Specifically, we exploit the direct vision–text correspondence in OCR-related scenarios and automatically label each visual token with the word in its corresponding image patch. DV-SFT treats the MLLM as a black box, requiring no architectural modifications or additional forward passes. Extensive experiments demonstrate the superiority of direct vision supervision. DV-SFT consistently outperforms standard SFT across three in-domain and four out-of-domain benchmarks. Further analyses show that vision supervision effectively enhances fine-grained visual understanding and achieves higher multimodal alignment efficiency.
[CV-66] Adaptation-Free Heterogeneous Collaborative Perception with Unseen Agent Configurations
链接: https://arxiv.org/abs/2605.26642
作者: Hyunchul Bae,Heejin Ahn
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages main paper, 23 pages including references and appendix, 7 figures
Abstract:Collaborative perception improves 3D object detection by enabling agents to share complementary observations, but most existing methods assume fixed or known collaborator encoder configurations, limiting deployment in practice. In this work, we consider an open-world setting in which auxiliary agents with unseen configurations may appear after deployment, such as different LiDAR beam counts or encoder architectures. To address this challenge, we propose ALF, a collaborative perception framework that enables zero-adaptation collaboration with unseen agent configurations by lifting lightweight box-level messages into ego-compatible auxiliary features. ALF converts auxiliary box-level messages into pseudo-BEV maps and synthesizes ego-compatible latent features by combining object-centric cues with scene context from the ego feature. On V2X-Real, under a zero-shot evaluation across 64 case studies, ALF outperforms the strongest prior baseline by 35.91% in relative mAP@0.7 while requiring only 120 bytes per agent per frame (approximately 9.6 Kbps bandwidth at 10 Hz).
[CV-67] OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation
链接: https://arxiv.org/abs/2605.26641
作者: Yunze Liu,Chi-Hao Wu,Enmin Zhou,Junxiao Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Unified multimodal embedding spaces have become the standard interface for cross-modal retrieval and multimodal RAG, and recent audio-video-text (AVT) encoders extend this setting to three modalities. Such encoders can produce a joint (T,V,A) embedding whenever all three modalities are available, but standard pairwise InfoNCE objectives leave this signal unused during training. We close this gap with fusion-as-teacher distillation, which treats a stop-gradient copy of the fused embedding as a teacher signal for the single-modal embeddings, paired with a Tuple-InfoNCE term that supervises the fused embedding directly. We instantiate this objective as OmniRetriever-7B. Across six zero-shot retrieval benchmarks, OmniRetriever-7B surpasses the closed-source Gemini Embedding 2 by 13.3-18.0 R@1 on Clotho and SoundDescs, and reaches the contemporary zero-shot specialist band of open video-text encoders on MSR-VTT and MSVD. To stress-test joint representations, we further release OmniRetriever-Bench, a 12-direction AVT retrieval benchmark totaling 3782 triples; on it OmniRetriever-7B attains AVG-all 34.84, improving over Gemini Embedding 2 by 1.72 and over the best prior open-source AVT method by 8.03.
[CV-68] JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search CVPR2026
链接: https://arxiv.org/abs/2605.26636
作者: Dongyun Zou,Zhuoyang Zhang,Junyu Chen,Wenkun He,Qinhe Peng,Hanrong Ye,Yao Lu,Hongxu Yin,Yu Wang,Song Han,Han Cai
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026 Findings
Abstract:We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our approach is Post-Training Attention Search, a post-training acceleration framework that converts pre-trained full-attention ViTs into efficient hybrid-attention variants by identifying and replacing redundant full-attention blocks with linear or window-attention blocks. By inheriting the MLP and attention weights from the base model, Post-Training Attention Search efficiently explores the architectural design space through three key steps: (1) optimizing the linear-attention block design; (2) finding the best combination of linear-attention and window-attention blocks; and (3) identifying and preserving critical full-attention blocks. We evaluate JetViT on two representative high-resolution vision foundation models, DINOv3 and DepthAnythingV2. On the NVIDIA H100 GPU, JetViT achieves up to 1.79x higher throughput and up to 44.81% lower latency without sacrificing accuracy. We will release our code and accelerated ViT models soon.
[CV-69] Attenuation-Resilient Alternating Optimization for Laparoscopic Liver Landmark Detection MICCAI2026
链接: https://arxiv.org/abs/2605.26630
作者: Lanqing Liu,Ruize Cui,Jialun Pei,Diandian Guo,Tiffany Y. So,Pheng-Ann Heng,Jing Qin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by MICCAI 2026
Abstract:Liver surface landmark detection is a fundamental prerequisite for anatomical guidance in laparoscopic liver surgery. However, it remains unreliable in practice due to two pervasive challenges: illumination attenuation in underexposed regions and the structural mismatch between pixel-wise localization and continuous curvilinear geometry. To address these limitations, we propose A2ONet, an attenuation-resilient alternating optimization network for robust liver landmark detection. To mitigate illumination attenuation, A2ONet embraces an illumination field compensation (IFC) block that adaptively enhances dark regions while preserving structural consistency. Meanwhile, we introduce a lightweight frequency-orientation selective filter (FOSF) to suppress repetitive texture interference and preserve salient curvilinear cues. Building upon these resilient representations, we design an alternating seg-curve optimization (ASCO) decoder that iteratively couples dense segmentation with explicit curve modeling, enabling mutual guidance to optimize both structural continuity and endpoint localization. Extensive evaluations on L3D-2K, L3D, and P2ILF demonstrate consistent improvements over competitive methods, establishing a more reliable foundation for intraoperative anatomy guidance. Our code will be available at this https URL.
[CV-70] DelowlightSplat: Feed-Forward Gaussian Splatting for Lowlight 3D Scene Reconstruction
链接: https://arxiv.org/abs/2605.26629
作者: Fuzhen Jiang,Zengtian Xie,Zhuoran Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Novel-view synthesis and 3D reconstruction from sparse posed images are central to robotics and AR/VR. Yet, feed-forward 3D Gaussian reconstruction fails under lowlight due to noise, color shifts, and unreliable correspondence. We propose DelowlightSplat, a lowlight-aware feed-forward Gaussian splatting framework for clean novel-view rendering. We build a controllable multi-view lowlight benchmark by degrading only context views while keeping target views clean. We introduce a lightweight Lowlight Adapter for residual enhancement to improve matchability, and couple it with cost-volume-based multi-view inference to directly predict clean 3D Gaussians. Experiments show that DelowlightSplat significantly outperforms previous feed-forward method and two-stage pipeline under lowlight conditions.
[CV-71] MSCGC-KAN: Multi-scale Causal Graph Convolution and Kolmogorov-Arnold Feature Mapping for EEG Emotion Recognition
链接: https://arxiv.org/abs/2605.26624
作者: Haoliang Gong,Qingshan She,Jiale Xua,Yunyan Gao,Xugang Xi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Electroencephalogram (EEG)-based emotion recognition is an important affective computing task, and recent EEG foundation models provide useful generic representations for downstream adaptation. However, under the fine-tuning setting, three limitations remain prominent: insufficient modeling of multi-scale emotional dynamics, inadequate exploitation of inter-channel functional connectivity, and the limited expressive power of simple linear classification heads. To address these issues, this paper proposes a new EEG emotion recognition method, termed MSCGC-KAN, which introduces a structured task head composed of multi-scale causal graph convolution and Kolmogorov–Arnold feature mapping. Built on a pre-trained CBraMod backbone, MSCGC-KAN enhances downstream adaptation by jointly strengthening multi-scale temporal modeling, learnable inter-channel connectivity modeling, and nonlinear discriminative mapping within a compact task-specific head. This design preserves the representation advantage of the foundation model while making the classifier more sensitive to emotion-related spatiotemporal patterns. Extensive experiments are conducted on the public FACED and SEED-VII datasets. The proposed method achieves a balanced accuracy of 60.66%, a Cohen’s Kappa of 0.5525, and a weighted F1-score of 60.40% on FACED, and obtains 33.27%, 0.2223, and 33.64%, respectively, on SEED-VII. Compared with the CBraMod+Linear baseline, the balanced accuracy is improved by 5.91 and 2.03 percentage points on the two datasets, respectively. These results indicate that structured task-head design is an effective way to improve EEG emotion recognition when fine-tuning pre-trained EEG models.
[CV-72] MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation
链接: https://arxiv.org/abs/2605.26621
作者: Zichun Wang,Hairong Shi,Bingzheng Wei,Yan Xu,Zihua Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Volumetric Reasoning Segmentation (VRS) aims to segment a target region in a 3D medical scan from a free-form clinical query, where the referent is often implicit and requires both medical knowledge and volume-grounded reasoning. Existing methods typically rely on specialized segmentation tokens to connect language with mask decoding, but this coupling collapses the decision process into opaque latent representations, limiting interpretability and generalization to diverse narrative expressions. In this paper, we present MedVol-R1, a reinforcement learning-based framework for VRS that explicitly decouples evidence grounding from volumetric delineation: the LVLM grounds clinical reasoning to a verifiable 2D evidence anchor (key axial slice and 2D bounding boxes), which is then propagated into a coherent 3D mask by a frozen MedSAM2 module. We train MedVol-R1 with cold-start supervised fine-tuning followed by GRPO, guided by a multi-component reward that encourages informative evidence selection, accurate 2D spatial grounding, and cross-slice volumetric coherence, without requiring costly chain-of-thought annotations. Experiments on CT-ORG, AbdomenCT-1K, and KiTS23 from the M3D-Seg benchmark demonstrate that MedVol-R1 consistently outperforms strong baselines and achieves state-of-the-art performance, with reinforcement learning providing clear gains over pure supervised fine-tuning.
[CV-73] Gaussian-Voxel Duet: A Dual-Scaffolding Hybrid Representation for Fast and Accurate Monocular Surface Reconstruction
链接: https://arxiv.org/abs/2605.26616
作者: Zhenhua Du,Zhen Tan,Haoyu Zhang,Dewen Hu,Shuaifeng Zhi,Peidong Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 14 figures
Abstract:While 3D Gaussian Splatting has achieved remarkable success in photorealistic novel view synthesis, its pursuit of fast and high-fidelity 3D reconstruction has long been constrained by a trade-off between geometric accuracy and optimization efficiency. Methods specialized in image rendering converge quickly at the cost of imperfect geometry caused by superfluous primitives overfitting training views, while methods integrating neural signed-distance field (SDF) for better geometry incur prohibitive training costs. In this paper, we attempt to strike a better trade-off by tethering scaffold-anchored Gaussians to a jointly optimized sparse voxel scaffold. This hybrid Gaussian-Voxel representation explicitly confines anchored Gaussians to a narrow band around surfaces defined by voxelized SDFs, which effectively improves representation efficiency and condenses floating Gaussians without sacrificing geometry quality. An implicit surface tethering loss further pulls individual Gaussian primitives closer to SDF-induced surfaces in a mutually regularized manner for improved reconstruction accuracy. Extensive experiments on diverse real-world indoor scenes from ScanNet++, ScanNetv2, and DeepBlending datasets demonstrate that our method achieves state-of-the-art surface reconstruction quality as well as superior novel view synthesis against leading baselines, while maintaining fast training convergence and real-time rendering. Code will be available at this https URL.
[CV-74] FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling
链接: https://arxiv.org/abs/2605.26601
作者: Guixian Xu,Yide Liang,Zeli Su,Xuexian Song,Ziyin Zhang,Yushuang Dong,Ting Zhang,Xu Han
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:
Abstract:Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone’s original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.
[CV-75] O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding
链接: https://arxiv.org/abs/2605.26584
作者: Peiran Wu,Yunze Liu,Chi-Hao Wu,Chen Chen,Junxiao Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Omnimodal large language models enable unified audio video understanding, but long joint token sequences make inference costly, and existing benchmarks do not fully isolate audio visual association in noisy user generated videos. We introduce UGC-AVQA, a public UGC benchmark with 1,000 videos and 4,816 QA pairs, where an audio removal test ensures that benchmark questions require both acoustic and visual evidence. To reduce inference cost, we propose OMAC, a training free plug in compression method that preserves salient visual memory and temporally grounded audio anchors. To further make compact models robust to compressed inputs, we introduce O-MARC, a compression distillation framework for learning with memory compressed multimodal contexts. On Qwen2.5-Omni-3B, O-MARC improves the average score across four benchmarks to 45.8, outperforming full token inference at 44.1 and OmniZip at 41.0. OMAC also keeps inference efficient, reducing latency by 34.6% (1.53 \times speedup) and memory by 34.7% compared with full token inference.
[CV-76] rackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting
链接: https://arxiv.org/abs/2605.26576
作者: Yuyang Tan,Renhe Zhang,Hang Zhang,Ao Li,Xin Tan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Referring 3D Gaussian Splatting (R3DGS), which utilizes natural language for 3D object segmentation, has emerged as a crucial capability for embodied AI. However, existing methods typically rely on expensive per-scene manual annotation and per-view pseudo mask generation, which suffer from multi-view inconsistency and poor generalization to varying query specificities. To address this, we present TrackRef3D, a fully automatic pipeline that achieves open-world referring segmentation in 3D Gaussian Splatting (3DGS) without manual annotation by introducing a multi-view consistent track-then-label paradigm that fundamentally decouples object discovery from semantic grounding. Specifically, we propose a Trajectory-Aware Semantic Consensus Module (TSCM) which aggregates cross-view predictions via synonymous clustering and trajectory-aware voting to establish a canonical semantic identity, thereby ensuring multi-view consistency. Furthermore, we employ a visibility-aware description generation strategy to mitigate ambiguity and propose a Hybrid Training Strategy (HTS) that jointly optimizes coarse category semantics and fine-grained referential cues to ensure robustness under varying query specificities using a multi-positive contrastive objective. Extensive experiments on benchmarks demonstrate that TrackRef3D achieves state-of-the-art performance.
[CV-77] Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer CVPR
链接: https://arxiv.org/abs/2605.26538
作者: Amey Sunil Kulkarni
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR NTIRE 2026
Abstract:Style transfer with pre-trained diffusion models has advanced rapidly, but a core question remains underexplored: where in the model should style injection be strongest? StyleID, the leading training-free method, uses a single global parameter (gamma) uniformly across all layers and timesteps, which forces a fixed tradeoff between style quality and content preservation. We show this tradeoff is unnecessarily rigid. We systematically explore four dimensions of control: varying style injection strength across decoder layers, across denoising timesteps, and scheduling ControlNet geometric conditioning along both axes. The pattern is consistent everywhere: decreasing schedules, with stronger structural signal injection in shallower layers and earlier timesteps, reliably outperform the reverse. Beyond direction, schedule shape matters: cosine and square-root timestep schedules outperform linear. Most importantly, we find that gamma scheduling and ControlNet conditioning are nearly independent. The resulting combined configurations expand the Pareto frontier, offering superior tradeoffs between style fidelity and content preservation compared to any single baseline setting. Our best balanced configuration achieves ArtFID of 27.036 versus StyleID’s 28.801 - a 6.1% relative improvement, with consistent gains across the full style-content tradeoff frontier. Results are validated across 35 configurations totaling over 28,000 stylized images using four complementary metrics. These findings generalize across SD backbones with identical rank ordering. All modifications are training-free, parameter-free, and require only a few lines of scheduling code; code is available at this https URL.
[CV-78] Recursive Flow Matching
链接: https://arxiv.org/abs/2605.26535
作者: Jiahe Huang,Sihan Xu,Sharvaree Vadgama,Rose Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: Project page: this https URL
Abstract:Generative models have emerged as a powerful paradigm for solving physics systems and modeling complex spatiotemporal dynamics. However, achieving high physical accuracy without incurring high computational cost remains a fundamental challenge, as existing approaches face a critical speed-fidelity trade-off. In this work, we introduce Recursive Flow Matching (RecFM), a generative framework for forecasting complex spatiotemporal dynamics. RecFM enforces self-consistency to align trajectories across discretization scales, reducing discretization errors and improving performance across metrics for physics-based tasks. To our knowledge, this is the first method to achieve high-fidelity one- and few-step (2-4 step) dynamic generation for scientific systems with performance comparable to state-of-the-art multi-step solvers. Across challenging scientific benchmarks, RecFM achieves up to a 20 \times speedup over leading diffusion-based emulators while improving predictive accuracy. Furthermore, RecFM reduces mean squared error by over 15% compared to vanilla flow matching, offering a scalable and efficient solution for real-time scientific emulation.
[CV-79] ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation
链接: https://arxiv.org/abs/2605.26525
作者: Akide Liu,Jinbo Xing,Chaojie Mao,Ye Li,Zeyu Zhang,Yefei He,Weijie Wang,Zihan Wang,Yu Liu,Gholamreza Haffari,Bohan Zhuang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL , Code: this https URL
Abstract:Minute-scale cinematic video generation is a central challenge for generative video models. Existing paradigms address only fragments of this challenge: single-shot extrapolation preserves an anchor but lacks cinematic structure, while multi-shot storytelling imposes structure yet remains free to invent its visual states rather than continue an observed one. We define Multi-Shot Video Extrapolation (MSVE), a task that extends an observed frame or clip into a sequence of cinematically structured shots while preserving anchor state and advancing narrative intent. This setting operates under the finite per-call generation budget of short-video models. We identify three coupled bottlenecks: (1) global planners over-specify unsupported details from full screenplays; (2) shot-level prompts dilute task-relevant state when carrying the complete story; and (3) temporal chaining turns generated frames into a lossy memory in which identity, scene, object, and action state decay. MSVE reveals that long-video failure is not merely a limitation of context length, but a failure of context allocation. We propose Recursive Context Allocation (ReCA), an inference-time framework that allocates context hierarchically across planning and generation. ReCA recursively decomposes MSVE into context-bounded subproblems, invokes frozen generators at leaf nodes, and propagates structured state updates across time. To evaluate this setting, we further propose MSVE-Bench and NB-Q, a source-grounded protocol with prompts purpose-built for 3 to 5 minute long-video generation, a regime not addressed by existing short-clip benchmarks. Compared to previous methods, ReCA improves average normalized score by 8 to 16 percent over the strongest competing controller and improves multi-shot consistency metrics by 28 to 43 percent. View the project page at this https URL.
[CV-80] CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence
链接: https://arxiv.org/abs/2605.26524
作者: Yuxu Lu,Dong Yang,Xiaoyu Li,Mengwei Bao,Congcong Zhao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Maritime intelligent transportation systems (MITS) are essential for ensuring navigation safety and efficiency in busy waterways. However, accurate vessel trajectory prediction remains challenging due to the limitations of single-source data. Automatic identification system (AIS) data is often sparse or unavailable for small vessels, while closed-circuit television (CCTV) data alone cannot fully capture dynamic vessel behavior. To mitigate these challenges, we propose a cross-modal interaction-based vessel trajectory prediction (named CmIVTP) framework to model the intricate interactions between vessel dynamics and environmental constraints. Specifically, we introduce a target-aware scene encoder to extract scene semantic features, effectively capturing vessel-environment interactions and enhancing trajectory prediction accuracy. In addition, we propose a cross-modal interaction transformer, which integrates AIS-derived motion features, CCTV-based environmental features, and scene representations. It leverages cross-modal attention mechanisms to simultaneously capture intra-modal semantics and inter-modal interactions, ensuring dynamically consistent and environmentally feasible predictions. Furthermore, we construct a vessel group trajectory bank by clustering historical AIS trajectories into representative motion patterns, providing an efficient and scalable approach for candidate trajectory generation. Additionally, we introduce the maritime multimodal dataset plus (named Maritime-MmD ^+ ), a large-scale dataset that synchronizes AIS data and CCTV video data, providing robust support for multimodal trajectory prediction research. Extensive experiments demonstrate that CmIVTP achieves better performance on multimodal-driven vessel trajectory prediction benchmarks. The code resources for this work can be available at this https URL.
[CV-81] InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward
链接: https://arxiv.org/abs/2605.26520
作者: Zhiwei Ning,Wenwen Tong,Xiangli Kong,Shengnan Ma,Ziyi Shang,Jingcheng Ni,Tao Hu,Yong Xien Chng,Jixuan Ying,Zehuan Wu,Hanming Deng,Jie Yang,Yuanjie Zheng,Wei Liu,Lewei Lu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model’s capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.
[CV-82] R3: 3D Reconstruction via Relative Regression
链接: https://arxiv.org/abs/2605.26519
作者: Congrong Xu,Huachen Gao,Xingyu Chen,Yuliang Xiu,Jun Gao,Anpei Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent feed-forward geometry foundation models have demonstrated impressive generalization by recovering depth and poses in a single forward pass. However, these models are typically constrained by a global coordinate frame assumption. This dependency becomes a significant bottleneck for long-context and streaming reconstruction, as it forces the network to maintain an arbitrary temporal origin and handle translation magnitudes that grow unbounded over time. Our solution, which we call R^3 , employs relative regression. We employ a lightweight MLP to predict confidence-weighted relative constraints. These confidences serve as a unified anchor: weighting losses during training and guiding pose aggregation during inference. R^3 supports both full-context offline reconstruction and causal, bounded-memory streaming. Our evaluation in both offline and streaming settings validates the effectiveness of our relative mechanism. Project page: this https URL
[CV-83] CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimers Disease Pathologies
链接: https://arxiv.org/abs/2605.26514
作者: Geonwoo Baek,Ikbeom Jang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Confirming Alzheimer’s disease (AD) typically relies on positron emission tomography (PET), which remains costly and invasive, motivating the use of structural MRI-based prescreening. Deep learning on non-Euclidean manifolds, particularly brain cortical surfaces, faces significant challenges due to the data’s spherical topology. Recent surface models have enabled learning from cortical surface data; however, imposing face-based uniform patches often causes duplicate vertices at patch boundaries. In general, many surface-based models are limited in their awareness of the region of interest (ROI), which can result in non-cortical regions, such as the medial wall, being included. We propose a cortical surface tokenization that performs ROI-preserving, vertex-based, variable-sized patch partitioning. We refer to these cortical surface patches as cortical supervertices (CSVs). Building on this representation, we design the CSV Vision Transformer (CSV-ViT), a variable-size patch-tolerant Vision Transformer that uses padding and a mask-aware patch embedding. We used T1-weighted MRI and evaluated our framework by classifying AD-related status into three categories: AD diagnosis, amyloid positivity, and tau positivity. Across the experiments, CSV-ViT achieved higher classification performance than recent surface-based models. The results suggest that the proposed CSV-ViT may support MRI-based prediction of AD-related status prior to PET or CSF confirmation.
[CV-84] Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression
链接: https://arxiv.org/abs/2605.26513
作者: Haojie Yin,Chengcheng Feng,Tianyi Liu,Tianqi Zhang,Kaizhu Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mean Deviation (MD) is a critical metric for assessing visual field loss in ophthalmology. While previous work has focused solely on predicting MD from Optical Coherence Tomography (OCT), it is intuitive to assume that combining OCT with another imaging of fundus photography (FP) could improve performance, as two ophthalmic medical imaging provide complementary information. This is particularly expected when sophisticated multi-objective optimization is applied, as documented in common multimodal classification. Surprisingly, our investigations reveal that multimodal fusion in this medical imaging scenario performs worse than unimodal model. Through detailed analysis, we identify the root cause as a coupled imbalance between data distribution and modality learning conflict. This imbalance distorts the optimization landscape, leading to unstable training. To address this challenge, we propose the method of Rebalanced MultiModal Mean Deviation Regression (Re-M3Dr), a novel multimodal regression framework. We enhance unimodal representation through adaptive margin based supervised contrastive learning. Then, our framework stabilizes the joint optimization with the sharpness-aware gradient modulation. Experimental results on both public and private clinical datasets show average 29% reduction in MSE compared to SOTA multimodal learning methods, demonstrating the superiority of Re-M3Dr. The code is available in the supplementary materials.
[CV-85] Uncertainty-Aware Gaussian Map for Vision-Language Navigation
链接: https://arxiv.org/abs/2605.26503
作者: Jianzhe Gao,Rui Liu,Yuxuan Xu,Tongtong Cao,Yingxue Zhang,Zhanguang Zhang,Sida Peng,Yi Yang,Wenguan Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Navigation (VLN) requires an agent to navigate 3D environments following natural language instructions. During navigation, existing agents commonly encounter perceptual uncertainty, such as insufficient evidence for reliable grounding or ambiguity in interpreting spatial cues, yet they typically ignore such information when predicting actions. In this work, we explicitly model three forms of perceptual uncertainty (i.e., geometric, semantic, and appearance uncertainty) and integrate them into the agent’s observation space to enable informed decision-making. Concretely, our agent first constructs a Semantic Gaussian Map (SGM), composed of differentiable 3D Gaussian primitives initialized from panoramic observations, that encodes both the geometric structure and semantic content of the environment. On top of SGM, geometric uncertainty is estimated through variational perturbations of Gaussian position and scale to assess structural reliability; semantic uncertainty is captured by perturbing Gaussian semantic attributes to reveal ambiguous interpretations; and appearance uncertainty is characterized by Fisher Information, which measures the sensitivity of rendered observations to Gaussian-level variations. These uncertainties are incorporated into SGM, extending it into a unified 3D Value Map, which grounds them as affordances and constraints that support reliable navigation. Comprehensive evaluations across multiple VLN benchmarks show the effectiveness of our agent.
[CV-86] Unveiling the Frag ility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization AAAI2026
链接: https://arxiv.org/abs/2605.26501
作者: Xiang Fang,Wanlong Fang,Changshuo Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Publish in AAAI 2026
Abstract:Large Vision-Language Models (LVLMs) have transformed multi-modal understanding, excelling in tasks like image captioning and visual question answering by integrating visual and textual inputs. However, their robustness against adversarial attacks, particularly those exploiting both modalities, remains underexplored, posing risks to critical applications like autonomous driving and content moderation. Existing attacks focus on single modalities or require impractical white-box access, limiting their real-world relevance. In this paper, we introduce Multi-Modal Adversarial Synergy, a groundbreaking framework that crafts universal, black-box multi-modal attacks against LVLMs. MMAS simultaneously generates a texture scale-constrained universal adversarial perturbation for images and a learnable prompt perturbation for text, optimized jointly using only model queries. The image perturbation leverages wavelet-based texture constraints to ensure imperceptibility and robustness across diverse visual inputs. The text perturbation, constrained by an L-norm in the embedding space, maintains semantic coherence while steering outputs toward a target. A novel cross-modal regularization term aligns the perturbations’ gradient directions, enhancing their synergistic impact and transferability across tasks and models. Extensive experiments show the strong universal adversarial capabilities of our proposed attack with prevalent LVLMs.
[CV-87] 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation
链接: https://arxiv.org/abs/2605.26500
作者: Jianzhe Gao,Rui Liu,Wenguan Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language navigation (VLN) requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works equip agents with various scene representations to enhance spatial awareness, they often neglect the complex 3D geometry and rich semantics in VLN scenarios, limiting the ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. Specifically, Egocentric Scene Map is constructed online by initializing 3D Gaussians from sparse pseudo-lidar point clouds, providing informative geometric priors for scene understanding. Each Gaussian primitive is further enriched through Open-Set Semantic Grouping operation, which groups 3D Gaussians based on their membership in object instances or stuff categories within the open world, resulting in a unified 3D Gaussian Map. Building on this map, Multi-Level Action Prediction strategy, which combines spatial-semantic cues at multiple granularities, is designed to assist agents in decision-making. Extensive experiments conducted on three public benchmarks (i.e., R2R, R4R, and REVERIE) validate the effectiveness of our method.
[CV-88] Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models
链接: https://arxiv.org/abs/2605.26491
作者: Austin Wang,Jiaqi Han,Stefano Ermon,Yisong Yue
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary pairwise comparisons. This pairwise reduction is limiting when training data naturally contains multiple candidate images for the same prompt, and when continuous reward scores can provide richer information than a single winner-loser label. To address these limitations, we propose Diffusion LAIR, a reward-aware listwise preference optimization method for diffusion models. For each prompt, LAIR converts reward scores across a group of candidate images into centered advantage weights, then optimizes an advantage-weighted regression objective on the implicit reward, defined as the denoising-loss improvement of the current model over a fixed reference model, with a quadratic penalty that regularizes the magnitude of the implicit reward. The resulting objective uses all candidates simultaneously rather than selecting pairs, and remains conservative by explicitly controlling the magnitude of the implicit reward. The LAIR objective admits a bounded closed-form optimum in implicit-reward space, clarifying how the regularization strength controls the magnitude of the preference update. Experiments show that Diffusion LAIR outperforms strong preference optimization baselines on SD1.5 and SDXL across text-to-image generation, compositional generation, and image editing benchmarks.
[CV-89] LongCat-Video-Avatar 1.5 Technical Report
链接: https://arxiv.org/abs/2605.26486
作者: Meituan LongCat Team:Xunliang Cai,Meng Cheng,Feng Gao,Zhe Kong,Jiamu Li,Le Li,Weiheng Li,Hongyu Liu,Shuai Tan,Xiaoming Wei,Tianyu Yang,Yong Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homepage: this https URL Github: this https URL
Abstract:Despite advances in audio-driven video generation, achieving commercial-grade stability remains challenging. We present LongCat-Video-Avatar 1.5, an upgraded open-source framework prioritizing systematic engineering and production-readiness over architectural novelty. By upgrading the audio encoder to Whisper Large and meticulously scaling our training recipes, v1.5 achieves accurate lip-synchronization, full-body temporal stability, and robust long-video generation with strict identity consistency. Through rigorous data curation and RLHF Training, the model readily generalizes to stylized domains such as anime and animals, and natively handles complex real-world conditions, such as multi-person interactions and object handling. Furthermore, addressing the practical demands of industrial deployment, we employ advanced step distillation to accelerate inference to an optimal 8 NFE, achieving a favorable trade-off between serving efficiency and visual fidelity. The superiority of our approach is validated through extensive quantitative metrics and a rigorous human evaluation conducted on a comprehensive benchmark of over 500 diverse test cases. Results show that v1.5 achieves competitive or superior performance compared to leading closed-source systems (e.g., HeyGen, OmniHuman 1.5, Kling Avatar 2.0) across human-likeness ratings and expert-level quality assessments on our benchmark. With its open-source release, LongCat-Video-Avatar 1.5 narrows the gap between academic research prototypes and commercial-grade deployment.
[CV-90] Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis
链接: https://arxiv.org/abs/2605.26483
作者: Jianzhe Gao,Churan Wang,Weiyi Zhang,Jianghua Li,Li-An Li,Wenguan Wang,Yixin Zhu,Yizhou Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical video diagnosis involves inferring clinical decisions from dynamic tissue responses throughout examination processes. Existing methods rely on an end-to-end learning paradigm that i) focuses on appearance rather than pathology, ii) lacks clinical priors, and iii) reasons solely from observations without counterfactual comparison. This work introduces MedVCR, a counterfactual reasoning framework that mimics clinical diagnostic thinking. MedVCR comprises three components: a Counterfactual Generator that synthesizes tissue evolution under specified pathological states via a diffusion-based manner; a Counterfactual Representation Learning module that encodes diagnostic knowledge through clinical rules (i.e., temporal consistency, pathological separability, and counterfactual alignment); and a Dual Diagnostic Prediction strategy that integrates video-level assessment with frame-level counterfactual analysis. MedVCR is evaluated under both fully supervised (e.g., colposcopy) and weakly supervised (e.g., colonoscopy) video diagnosis settings, yielding 2.6%-10.2% performance gains compared with leading baselines. Comprehensive ablation studies further validate the effectiveness of each component. The code will be released.
[CV-91] Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient
链接: https://arxiv.org/abs/2605.26478
作者: Haoxiang You,Yilang Liu,Davis Zong,Qian Wang,Teeratham Vitchutripop,Qi Wang,Daniel Rakita,Ian Abraham
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:We present the stochastic decoupled policy gradient (SDPG), a lightweight visual reinforcement learning (RL) method that trains diverse visuomotor control policies end-to-end within a few hours on a single NVIDIA RTX 4080 GPU. SDPG estimates policy gradients via random perturbations of trajectory rollouts, requiring orders of magnitude fewer batch-rendered environments and substantially reducing compute and memory overhead. On visual MuJoCo benchmarks, SDPG consistently outperforms baseline methods in training time, memory usage, and rewards. Finally, to support future research, we introduce a suite of realistic visual robotics benchmarks spanning dexterous manipulation, challenging locomotion, and demonstrate effective sim-to-real transfer on physical hardware.
[CV-92] Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes
链接: https://arxiv.org/abs/2605.26475
作者: ZhiXin Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-based metric distance and area measurement remains challenging in large-scale outdoor environments due to long-range sensing, camera zoom, and unstable imaging conditions. This work studies planar metric measurement in a real-world reservoir monitoring scenario using PTZ cameras and compares three representative approaches: geometry-based monocular ranging, image stitching with birds-eye-view transformation, and stereo-based ranging using two jointly calibrated monocular cameras. For monocular ranging, planar localization models are derived from camera geometry and the effect of camera pitch angle is analyzed. Image stitching is investigated for large-area mapping, while a stereo-based scheme is developed for long-range measurement without dedicated stereo hardware. Experiments show clear trade-offs: monocular ranging achieves meter-level accuracy under sufficiently large pitch angles, stereo-based ranging achieves decimeter-level accuracy with reduced sensitivity to pitch variations, and image stitching is effective for small-scale scenes but degrades in stability and scalability as scene size increases.
[CV-93] riadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules ICML2026
链接: https://arxiv.org/abs/2605.26470
作者: Junseo Bang,Dong Ju Mun,Hoigi Seo,Seongmin Hong,Se Young Chun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026
Abstract:Generative posterior sampling using diffusion models has emerged as a dominant paradigm for solving inverse problems in imaging, which usually consists of three main components: data consistency (DC) guidance, classifier-free guidance (CFG) and stochasticity. While prior arts have focused on how to develop each or all components, less attention has given to how to schedule them, leading to heuristically fixed or partially adjusted suboptimal schedules. In this work, we argue that the interactions among all three components in terms of scheduling are crucial for significantly improved performance in solving inverse problems in imaging. Our analysis shows that aggressive CFG early in sampling conflict with DC guidance, while stochasticity brings the trajectory back to higher-probability regions. Based on these findings, we propose Triadic Dynamics Aware Posterior Sampling (TriPS), which reformulates posterior sampling as a time-varying control problem and optimizes schedules following a triadic trend of decreasing DC and stochasticity scales alongside increasing CFG scale. TriPS achieves this through two strategies: template-based search over functional priors for reliable baseline schedules, and Group Relative Policy Optimization (GRPO)-based reinforcement learning for more flexible temporal curves. Experiments demonstrate TriPS outperforms state-of-the-art baselines in data fidelity and perceptual realism.
[CV-94] AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation
链接: https://arxiv.org/abs/2605.26460
作者: Jian Zhang,Zhijun Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-Modal Diffusion Transformers (MM-DiTs) encode rich representations for training-free concept grounding, but existing attention-based methods often produce overlapping activations on visually confusable concepts, a failure mode we call concept leakage, where target responses spill over to non-target objects. To address this issue, we propose AnchorDiff, a training-free grounding method that decouples semantic localization from structural refinement. AnchorDiff selects a high-confidence anchor from concept-to-image attention map and propagates it as a one-hot seed over a hybrid graph derived from image-to-image self-attention. The graph uses output-space similarity for dense within-object propagation and a row-wise attention gate to suppress cross-object connections. Additionally, we introduce the Multi-Concept Confusion Dataset, which contains images with multiple visually similar concepts and separate masks, enabling explicit evaluation of concept leakage. Experiments show that AnchorDiff achieves strong grounding performance on ImageNet-Segmentation and PascalVOC, while substantially reducing concept leakage on our Multi-Concept Confusion Dataset.
[CV-95] Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth
链接: https://arxiv.org/abs/2605.26456
作者: Kai Zheng,Qiang Feng,Xingjian Liu,Wenquan Tan,Yuan Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 2 tables
Abstract:Sparse-LiDAR-prompted depth foundation models (PromptDA, Prior Depth Anything, DMD3C) have shown strong results on indoor scenes or within KITTI’s standard 80-meter evaluation cap. However, two limitations remain: (i) systematic distance-stratified evaluation in long-range driving regimes (50-150 m) is largely absent; (ii) prior approaches built on disparity-based foundations rely on pre-interpolated dense priors, leaving truly sparse LiDAR injection on point-map foundations (e.g., MoGe-2, NeurIPS 2025) unexplored. We present SLIM (Sparse-LiDAR Injected Monocular geometry), the first adaptation of MoGe-2 to accept truly sparse LiDAR input. SLIM integrates a partial-convolution sparse encoder with a multi-scale fusion neck that fuses LiDAR features into the point-map decoder at five scales. We adopt density-agnostic training (random injection ratio in [0.005, 0.30]) so a single model serves diverse input densities. On Virtual KITTI and CARLA, SLIM reduces the absolute relative error of the MoGe-2 baseline by approximately 39-51% at 100-150 m. Ablation across six injection ratios shows partial-convolution injection improves both AbsRel and RMSE on Virtual KITTI in all six settings; on CARLA, AbsRel improves in five of six settings (one near-tie at 0.015 differs by 0.0013), and RMSE is comparable across encoders, with partial-convolution improving in three settings (by up to 0.31 unit) and losing by at most 0.11 unit in the other three.
[CV-96] Cross-scale Aligned Supervision for Training GANs
链接: https://arxiv.org/abs/2605.26449
作者: Sangeek Hyun,MinKyu Lee,Jae-Pil Heo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fine hierarchical generation. In this work, we challenge this interpretation. We argue that standard scale-wise adversarial supervision does not construct a proper coarse-to-fine hierarchy: each intermediate image is independently pushed toward the real distribution at its own resolution, but this scale-wise realism does not ensure that outputs across stages represent the identical generated sample. Moreover, the scale-specific image produced at each stage is not used as an explicit refinement target for the subsequent stage. Therefore, its adversarial loss can improve a scale-specific output without constraining later stages to preserve the same sample trajectory, allowing them to move toward a different sample rather than refine the previous output. We refer to this problem as a cross-scale trajectory misalignment problem. To resolve it, we propose CAT, a Cross-scale Aligned Transformer for multi-scale adversarial generation. CAT keeps the discriminator scale-wise, so each intermediate output is evaluated at its own resolution, while adding a simple generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.
[CV-97] Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting
链接: https://arxiv.org/abs/2605.26447
作者: Jiangbei Hu,Weichao Song,Shibo Yu,Mohan Wang,Zihan Yi,Rui Wu,Mingkang Xiang,Na Lei,Shengfa Wang,Zhongxuan Luo,Ying He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Underwater scene reconstruction is essential for immersive exploration of aquatic environments, yet remains challenging due to complex participating-media effects such as absorption and scattering, as well as the limited field of view (FoV) of conventional cameras. Although combining panoramic imaging with 3D Gaussian Splatting (3DGS) offers a promising direction for photorealistic underwater rendering, traditional 3DGS struggles with both spherical projection distortion and underwater medium degradation. In this paper, we propose \textbfUnderwater360, a physics-informed omnidirectional 3DGS framework for underwater panoramic scene reconstruction. First, we introduce an Omnidirectional Gaussian Splatting module that performs ray casting directly in spherical camera space instead of relying on 2D projection approximations, thereby reducing geometric distortions under 360 ^\circ FoV. Second, we design a physics-based appearance-medium modeling architecture with pose-conditioned appearance embeddings to explicitly decouple intrinsic scene radiance from depth-dependent backscatter and attenuation, enabling physically grounded scene appearance restoration. Finally, we establish a new panoramic underwater benchmark dataset containing both synthetic and real-world scenes. Extensive experiments demonstrate that Underwater360 achieves superior performance in underwater novel view synthesis and scene appearance restoration, delivering improved rendering quality and cross-view consistency in complex underwater environments. The code and datasets are released at this https URL
[CV-98] Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective ECCV2024
链接: https://arxiv.org/abs/2605.26441
作者: Xiang Fang,Zeyu Xiong,Wanlong Fang,Xiaoye Qu,Chen Chen,Jianfeng Dong,Keke Tang,Pan Zhou,Yu Cheng,Daizong Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in ECCV 2024
Abstract:This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal this http URL, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment this http URL show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.
[CV-99] HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection
链接: https://arxiv.org/abs/2605.26421
作者: Senyuan Shi,Hao Tan,Zichang Tan,Shuhan Feng,Ajian Liu,Sergio Escalera,Jun Wan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures
Abstract:The rapid evolution of generative models has precipitated a proliferation of fabricated content, posing significant challenges to existing Synthetic Image Detection (SID) methods. Capitalizing on advancements in vision-language models (e.g., CLIP), recent attempts have leveraged learnable textual prompts to identify synthetic images. However, they still leverage static prompt as a fixed boundary for real and fake images, failing to adapt to the varying types of forgery that emerge during inference. To overcome this issue, we propose HydraPrompt, an asymmetric prompting framework that dynamically adjusts the category centers by aligning with fine-grained image cues. Specifically, we propose an Asymmetric Prompt Adapter (APA): (1) for authentic category, we introduce a single set of prompts to capture the consistent representative patterns, which serves as a unified anchor for real content. While (2) for fake category, we construct sample-adaptive prompts that specialize in capturing diverse cues from different samples, enabling adaptive modeling of forgery image variations. To increase pronounced discriminability within different synthetic images, we further introduce a Conditional Supervised Contrastive (CSC) objective, which compacts the authentic representations while capturing fine-grained forgery clues. Extensive experiments on popular SID benchmarks demonstrate the state-of-the-art performance of our framework.
[CV-100] he Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP
链接: https://arxiv.org/abs/2605.26415
作者: Kahyeon Nam,Hyesong Choi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architectures such as CLIP this introduces a failure mode distinct from quantized CNN classifiers: activation noise accumulated across transformer blocks perturbs the direction of the multimodal embedding, eroding the cosine alignment on which zero-shot retrieval depends. We characterize this as Quantization-Induced Representation Collapse (QIRC) and quantify it on INT8 CLIP ViT-B/32, where the layer-wise noise-to-signal ratio grows from below 10% in shallow blocks to 52% at Layer 11. We propose LRA-EE (Layer-wise Representation-Aware Early Exit), which bypasses noise-saturated deep layers via Spatio-Semantic Aggregation (replacing the immature shallow [CLS] with a global patch-token average), a learned multi-feature gate (confidence, top-2 margin, spatial-activation variance), and Layer-adaptive Confidence Thresholding calibrated to each layer’s Information-to-Noise Ratio. On ImageNet-1K zero-shot classification, LRA-EE reduces FLOPs by 13.4% and improves Top-1 accuracy by +2.44%p (58.72% - 61.16%) over the INT8 baseline. A four-quadrant decomposition isolates the Rescue Effect: 9.5% of samples are correctly classified at shallow exits but lost to noise at full depth, against only 7.1% suffering the inverse.
[CV-101] OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following
链接: https://arxiv.org/abs/2605.26399
作者: Qiaomu Miao,Haoyu Wu,Jingyi Xu,Minh Hoai,Dimitris Samaras
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding human gaze behavior is essential for complex scene comprehension and human-computer interaction. Traditional gaze following models are typically restricted to pure spatial localization, lacking the high-level capacity to reason about semantic targets or complex social contexts. Furthermore, these models often process individuals sequentially, requiring redundant computations over the same scene image for multi-person inference. While recent Vision-Language Models (VLMs) offer the exceptional semantic reasoning needed to address gaze-related semantic tasks, their reliance on discrete text generation inherently limits precision in continuous spatial tasks like gaze localization. To bridge this gap, we propose OmniGF, a unified vision-language framework that adapts foundational VLMs for highly scalable multi-person gaze reasoning. The model adopts a dual-branch decoding strategy: a structured language branch generates discrete reasoning states, while a continuous spatial branch directly taps into the VLM’s dense hidden states. Supervising these extracted representations with high-resolution gaze target heatmaps effectively overcomes the spatial bottleneck of text-only coordinate generation. Furthermore, to explicitly ground the model in multi-person scenes, we augment the input with head embeddings encoded from cropped head images, providing fine-grained appearance and orientation cues for all individuals simultaneously. By modeling all individuals and leveraging the strong semantic capability of VLMs, OmniGF seamlessly integrates precise spatial gaze target estimation, semantic gaze prediction, and complex social gaze reasoning. Extensive experiments demonstrate that our framework establishes new state-of-the-art performance across multiple standard benchmarks. Code is available at this https URL.
[CV-102] Garment Particles: A 2D–3D Symmetric Garment Representation for Generation and Editing
链接: https://arxiv.org/abs/2605.26391
作者: Kiyohiro Nakayama,I-chao Shen,Ruofan Liu,Yiming Wang,Gordon Wetzstein,Takeo Igarashi
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Practical garment design spans two modes: intuitive creation from high-level intent, such as a reference image or text description, and complex low-level editing across 2D sewing patterns and 3D draped geometry, which requires professional training to navigate their complex interdependencies. Yet existing frameworks address only part of this challenge, offering either garment generation from casual inputs or direct editing on sewing patterns. To support both ends of the spectrum, we propose Garment Particles, a 5D point-cloud representation that jointly encodes 2D sewing patterns and 3D geometry. This representation enables Garment Particles Flow (GPF), a rectified flow framework that supports intuitive generation from high-level inputs (text, images, sketches) and various editing operations on 2D sewing patterns and 3D geometries via diffusion posterior sampling. Finally, we introduce Particles-to-Pattern Flow that converts generated garment particles into curved-based patterns for simulation. We validate our model’s generation ability on multiple datasets, achieving state-of-the-art garment generation results against competitive baselines. Our model also enables many garment editing scenarios, including garment interpolation, sewing pattern editing, point-cloud- and silhouette-conditioned garment generation. Our project website is at this https URL .
[CV-103] Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion
链接: https://arxiv.org/abs/2605.26383
作者: Dmytro Klepachevskyi,Alexander Wong,Sirisha Rambhatla,Yuhao Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Object re-identification (ReID) in egocentric kitchen videos is challenging due to rapid viewpoint changes, frequent occlusions, cluttered scenes, and large intra-class appearance variations. Objects may leave and re-enter the field of view, and the large diversity of instances with limited annotations makes supervised ReID difficult to scale, motivating zero-shot approaches. We study zero-shot object ReID on the EPIC-Kitchens benchmark, where the goal is to match active food and kitchen-tool instances across frames using only pre-trained visual features. We first evaluate five state-of-the-art feature extractors, including Vision-Language Models (VLMs) - CLIP, DINOv2, DreamSim, I-JEPA, and SAM3 - and show that zero-shot methods fail, with the best baseline achieving only 45.3% mAP. We then propose an Enhanced SAM3 ReID Pipeline, a zero-shot multi-stage method built around SAM3 segmentation as the core component. Stage 1 uses SAM3 to suppress background clutter. Stage 2 fuses embeddings from SAM3, DINOv2, and CLIP into a single L2-normalized descriptor. Stage 3 augments cosine similarity with mask-shape IoU for geometric consistency, and Stage 4 applies k-reciprocal re-ranking. The full pipeline improves performance by 7.5% mAP to 52.8%.
[CV-104] Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation MICCAI2026
链接: https://arxiv.org/abs/2605.26382
作者: Mengchen Fan,Baocheng Geng,Xi Xiao,Tianyang Wang,Siyuan Mei,Pulin Che,Xiaoqian Jiang,Qizhen Lan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2026. 11 pages, 3 figures
Abstract:Deploying high-performing 3D medical image segmenters (e.g., nnU-Net) is often limited by memory footprint and inference latency. Compression is therefore necessary, but compact 3D encoders tend to lose fine structural cues (small lesions and sharp boundaries) as downsampling repeats across multi-resolution stages. We propose Detail Consistent Distillation (DCD), a stage-wise distillation framework that preserves structural detail across scales by aligning teacher-student features in a wavelet-decomposed representation. At each encoder stage, DCD distills directional detail components in the wavelet domain while leaving the coarse approximation comparatively unconstrained, avoiding over-regularization of global semantics. DCD is used only during training and introduces no inference-time overhead. Experiments on the BraTS 2024 and ISLES 2022 benchmarks demonstrate that our approach achieves superior performance in MRI segmentation using 3D multi-modal data. Code and implementation details for DCD are publicly available at this https URL.
[CV-105] Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery
链接: https://arxiv.org/abs/2605.26381
作者: Niels Sombekke,Rob G.J. Wijnhoven,Martin R. Oswald
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a multi-modal classification framework that fuses satellite and street-level imagery through a Perceiver IO architecture operating on spatial patch tokens from a shared DINOv2 backbone. The design naturally handles a variable number of street-level views per building without padding or fixed-size pooling, and jointly predicts multi-label roof element and roof material classes. We construct a large-scale dataset of 32,135 buildings (61,672 segments) spanning ten countries, pairing satellite images with up to eight street-level views per segment and evaluating four masking strategies for isolating the target building. We propose an RGB-M masking strategy that appends the building footprint mask as a fourth input channel, providing a soft spatial prior that outperforms hard cropping across both modalities. The Perceiver IO fusion model improves over all other fusion strategies and yields substantial per-class gains for attributes visible from street level (e.g., +11.3 AP for slate, +1.3 AP for dormers), though the satellite-only baseline retains a slight advantage in macro-averaged mAP for classes that are predominantly visible from above. These results establish a scalable, flexible architecture for multi-modal building inspection that can accommodate heterogeneous inputs and multiple output tasks.
[CV-106] VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes
链接: https://arxiv.org/abs/2605.26380
作者: Jingru Chen,Yiming Liu,Mingtao Chen,Sijie Chen,Richeng Xuan,Liang Yang,Zhichao Hu,Fanyang Lu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Frontier multimodal large language models (MLLMs) have been reported to achieve over 90% accuracy on fine-grained perception benchmarks. However, such scores do not necessarily imply faithful use of visual evidence. Prior studies have identified three shortcuts that inflate benchmark performance. First, linguistic priors and lexical cues in questions often enable models to infer plausible answers without seeing the image. Second, coarse global semantics from the visual encoder can bypass fine-grained local details. Third, in some ``think-with-images’’ benchmarks, corrupting the intermediate images returned by visual tools barely affects the final answer. These findings suggest that higher input resolution or larger question pools alone do not elicit genuine active visual search. To address this, we introduce VisualNeedle, a challenging, information-dense, and fine-grained benchmark for scenes where critical evidence is spatially constrained to minute regions and not discernible at a glance. We further propose a counterfactual crop-black setting, which replaces crops returned by tools with black images of the same size, to test whether tool-enabled performance truly relies on intermediate visual evidence. We evaluate 9 promninent MLLMs across three settings: no-tool, standard tool-enabled, and crop-black. No-tool accuracy stays below 20%, and the best tool-enabled model reaches only 56.01%, still trailing the 63.00% human majority-vote accuracy. These results reveal persistent limitations in fine-grained visual search, while the crop-black ablation confirms that success on VisualNeedle hinges on genuine intermediate visual evidence.
[CV-107] BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma MICCAI2026
链接: https://arxiv.org/abs/2605.26376
作者: Junlin Yang,Tian Yu,Nicha C. Dvornek,Yuexi Du,Peiyu Duan,Annabella Shewarega,Lawrence H. Staib,James S. Duncan,Julius Chapiro
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Early accepted at MICCAI 2026
Abstract:Hepatocellular carcinoma (HCC) is biologically heterogeneous, shaped by the interplay between hepatic functional reserve and tumor-related oncologic factors; thus, similar survival outcomes may reflect fundamentally different underlying biological processes. Prognostic modeling in HCC is informed by rich multimodal information from multiparametric MRI and radiology reports from routine clinical practice. Existing prognostic vision-language models (VLMs) learn a single entangled latent representation that blends hepatic and tumor-related factors, limiting both accuracy and biological interpretability. We present BioFact-MoE, a biologically factorized Mixture of Experts (MoE) framework that explicitly decomposes liver and tumor factors via biologically supervised experts within a residual MoE survival architecture. On a HCC cohort of N=588 patients (pretrained on 4,582 3D MRI image-report pairs), BioFact-MoE consistently improves survival prediction over all baselines across time horizons, achieving 12-, 18-, and 24-month AUCs of 75.33%, 75.85%, and 73.96%. Beyond scalar risk prediction, gated expert weights enable phenotype-aware risk stratification. Pathway-informed gating uncovers clinically meaningful treatment-associated survival heterogeneity. In held-out validation, hepatic and tumor embeddings show selective associations with liver function and tumor burden markers, respectively (p0.05), without supervision. The code is available at this https URL.
[CV-108] Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery
链接: https://arxiv.org/abs/2605.26370
作者: Luuk Versteeg,Rob G.J. Wijnhoven,Martin R. Oswald
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a method for jointly predicting instance-level roof segment masks together with three continuous geometric attributes – building height, roof slope, and roof azimuth – from a single aerial orthophoto. Our approach extends Mask R-CNN with a dedicated attribute regression branch and introduces two key innovations: a conditional azimuth loss that suppresses supervision for flat roof segments where azimuth labels are inherently noisy, and a log-normalized height representation that addresses the heavily skewed distribution of building heights. We train and evaluate on a large-scale dataset of Dutch aerial images paired with automatically derived ground truth from 3DBAG, a nationwide LiDAR-based 3D building dataset. Using a DINOv3 ConvNeXt-Base backbone, our method achieves a mean absolute error of approximately 4 degrees for roof slope, 7 degrees for azimuth, and 1 meter for building height, with an instance segmentation AP _50 of 0.566. The predicted per-segment masks and attributes are sufficient to reconstruct simplified 3D building models (LoD2) from a single overhead image, requiring expensive 3D reference data only for training.
[CV-109] Unified Panoramic Geometry Estimation via Multi-View Foundation Models
链接: https://arxiv.org/abs/2605.26368
作者: Vukasin Bozic,Isidora Slavkovic,Dominik Narnhofer,Nando Metzger,Denis Rozumny,Konrad Schindler,Nikolai Kalischek
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Geometry estimation from perspective images has greatly advanced, maturing to the point where off-the-shelf foundation models are able to reconstruct 3D scene structure not only from multi-view imagery, but even from a single view. A natural extension is 3D reconstruction from panoramas, with the exciting prospect of recovering a full 360-degree scene from a single panoramic image. In this work, we introduce PaGeR (Panoramic Geometry Reconstruction), a framework to lift powerful 3D foundation models designed for perspective imagery to the panorama domain. Our strategy is to start from a pre-trained transformer for 3D reconstruction and turn it into a unified high-performance model that predicts scale-invariant depth, metric depth, surface normals, and sky masks from both perspective and omnidirectional images, in a single forward pass. By keeping architectural changes to a minimum and mixing perspective and panoramic images during training, PaGeR retains the rich 3D prior of the underlying foundation model while learning to also estimate geometrically consistent 360-degree scenes from single panoramas. We extensively test our method in both indoor and outdoor environments and find that it delivers state-of-the-art performance and excellent zero-shot performance across a wide range of scenes.
[CV-110] Personalized Generative Models for Contextual Debiasing CVPR2026
链接: https://arxiv.org/abs/2605.26353
作者: Xinran Liang,Esin Tureci,Prachi Sinha,Ye Zhu,Vikram V. Ramaswamy,Olga Russakovsky
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR 2026 Workshop on Synthetic Data for Computer Vision and Generative Models for Computer Vision. Code available at this https URL
Abstract:Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on a road may arguably be even more important than recognizing it on sand. We study how to mitigate this discrepancy. Since collecting uncommon images in the real world may be difficult, we explore whether generating images with less frequent contexts can serve as effective training augmentation. A key challenge is guiding generations to remain close to the original dataset distribution while creating diverse images with uncommon contexts. We introduce Decoupling Contextual Patterns with Generations (DecoupleGen), a method that personalizes text-to-image diffusion models to facilitate coherent synthesis of images with rare contexts while preserving original visual details. The generated images contain semantically meaningful content and remain visually aligned with the original datasets. We further apply verification constraints to ensure relevance of the augmented data. We evaluate our approach on object classification and recognition tasks on complex scene datasets. Our experiments demonstrate consistent improvements over previous approaches, and our analyses identify factors underlying these improvements.
[CV-111] Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models
链接: https://arxiv.org/abs/2605.26332
作者: Arian Komaei Koma,Seyed Amir Kasaei,AmirMahdi Sadeghzadeh,Mohammad Hossein Rohban
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i.e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding. We aim to address this gap in this paper. We introduce BEAP, a black-box, embedding-aware adversarial prompting attack that leverages a large language model (LLM) to iteratively generate effective adversarial prompts and exploit such hidden vulnerabilities. BEAP performs an embedding-aware search in text space, combining multiple reward signals: unlearned concept presence, text-image alignment, and image quality, to refine generated prompts. Unlike previous attack methods, BEAP keeps its prompts undetectable to safety filters while producing high-quality images. Extensive experiments show that BEAP improves the Attack Success Rate (ASR) by more than 60% over prior methods, while requiring only an average of fifteen prompts per successful attack. Warning: This paper contains model outputs that may be offensive or upsetting in nature. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.26332 [cs.CV] (or arXiv:2605.26332v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.26332 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Arian Komaei Koma [view email] [v1] Mon, 25 May 2026 21:11:59 UTC (8,823 KB) Full-text links: Access Paper: View a PDF of the paper titled Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models, by Arian Komaei Koma and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-112] RadarSim: Simulating Single-Chip Radar via Multimodal Neural Fields
链接: https://arxiv.org/abs/2605.26328
作者: Chuhan Chen,Tianshu Huang,Akarsh Prabhakara,Chaithanya Kumar Mummadi,Zhongxiao Cong,Anthony Rowe,Matthew O’Toole,Deva Ramanan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 3DV 2026. Project website: this https URL
Abstract:Radars are an ideal complement to cameras: both are inexpensive, solid-state sensors, with cameras offering fine angular resolution, while radars provide metric depth and robustness under adverse weather. However, radar data is more difficult to interpret than camera images and varies significantly between sensors, necessitating increased reliance on simulation for prototyping sensors and processing pipelines. Recent work treating radar reconstruction as a novel view synthesis problem has shown great promise in reconstructing radar-relevant geometry and simulating low-level radar data. However, such methods are constrained by the low spatial resolution of the underlying radar. To address this, we propose a unified differentiable renderer, RadarSim, which leverages the high angular resolution of RGB cameras to generate Doppler radar range images from a camera-initialized neural field. Using a novel data set of calibrated radar camera recordings from a custom hand-held rig, we demonstrate that RadarSim produces sharper geometry and Doppler range frames than radar-only reconstructions.
[CV-113] E3C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control
链接: https://arxiv.org/abs/2605.26316
作者: Qiao Gu,Lingni Ma,Adam W Harley,Richard Newcombe,Florian Shkurti,Julian Straub
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint. Project Page: this https URL
Abstract:Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others’ actions manifest and change the world. Compared to generic video synthesis, egocentric generation is especially challenging: the camera is tightly coupled to the actor, leading to rapid viewpoint changes and frequent self-occlusions; the underlying actions are subtle, articulated, and often only partially visible; and both the people and the scene state must evolve consistently with the specified controls. We present E ^3 C, a controllable video diffusion framework for egocentric generation that builds structured and compact conditions disentangling persistent scene structure from human-driven dynamics. From context frames, E ^3 C constructs a semi-dense point cloud-based 3D memory and augments each point with appearance descriptors from video-VAE features. Rendering this memory into target viewpoints produces conditioning aligned with the target frames. Human dynamics are modeled separately. The observed people in the scene are controlled by skeleton renderings (exo human control), while the camera wearer is specified by their 3D body joints and 6DoF wrist motion (ego human control). To preserve ego human control when the wearer’s body parts are invisible, we introduce an ego motion encoder that produces persistent cross-attention tokens. Experiments on Nymeria show that E ^3 C improves visual fidelity, camera-motion accuracy, object consistency, and ego exo human control over strong baselines, while also enabling intuitive scene editing.
[CV-114] Sleep-stage efficient classification using a lightweight self-supervised model
链接: https://arxiv.org/abs/2605.26295
作者: Eldiane Borges dos Santos Durães,João Batista Florindo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate classification of sleep stages is crucial for diagnosing sleep disorders and automating this process can significantly enhance clinical assessments. This study aims to explore the use of a self-supervised model (more specifically, an adapted version of mulEEG) combined with a Linear SVM classifier to improve sleep stage classification. \textbfMethods: The mulEEG model, which learns electroencephalogram signal representations in a self-supervised manner, was simplified here by replacing ResNet-50 with 1D-convolutions used as time series encoder by a ResNet-18 backbone. Two other adaptations were conducted: the first one evaluated different configurations of the model and data volume for training, while the second tested the effectiveness of time series features, spectrogram features, and their concatenation as inputs to a Linear SVM classifier. \textbfResults: The results showed that reducing the volume of data offered a better cost-benefit ratio compared to simplifying the model. Using the concatenated features with ResNet-18 also outperformed the linear evaluations of the original mulEEG model, achieving higher classification performance. \textbfConclusions: Simplifying the mulEEG model to extract features and pairing it with a robust classifier leads to more efficient and accurate sleep stage classification. This approach holds promise for improving clinical sleep assessments and can be extended to other biological signal classification tasks.
[CV-115] CNNs Transformers Hybrid and Vision Language Models for Skin Cancer Detection ICPR
链接: https://arxiv.org/abs/2605.26294
作者: Durjoy Dey,Yuhong Yan,Hassan Hajjdiab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 3 figures, accepted at ICPRAI 2026, The Fifth International Conference on Pattern Recognition and Artificial Intelligence. To appear in Lecture Notes in Computer Science
Abstract:Skin cancer is a common and fast rising malignancy worldwide. Early detection is critical for improving outcomes. Deep learning models trained on dermoscopic and clinical images can support automated and fast triage. However, many studies evaluate only a limited set of architectures. Experimental setups also vary across studies. In this paper, we present a unified evaluation of twelve deep learning models for binary skin cancer detection on the PAD-UFES-20 dataset. The models span four families: convolutional neural networks (CNN), vision transformers (ViT), hybrid convolution transformer backbones, and vision language models (VLM). Performance is assessed using AUC, the maximum F1 score with its precision and recall, and sensitivity at 80% specificity, reflecting screening oriented requirements. Our results show that well tuned CNNs already provide strong baselines, but transformer based families consistently improve discrimination. Hybrid models (MaxViT Tiny, CoAtNet0) and a SigLIP based VLM achieve the best overall trade off between ranking performance and clinically relevant operating points, while CLIP based model offers high precision. The full codebase for all experiments is publicly released. Together, these findings offer practical guidance on which model families are most suitable for real world deployment in skin cancer screening and establish a reproducible reference point for future work on PAD-UFES-20.
[CV-116] A multifractal-based masked auto-encoder: an application to medical images
链接: https://arxiv.org/abs/2605.26287
作者: Joao Batista Florindo,Viviane de Moura
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Masked autoencoders (MAE) have shown great promise in medical image classification. However, the random masking strategy employed by traditional MAEs may overlook critical areas in medical images, where even subtle changes can indicate disease. To address this limitation, we propose a novel approach that utilizes a multifractal measure (Renyi entropy) to optimize the masking strategy. Our method, termed Multifractal-Optimized Masked Autoencoder (MO-MAE), employs a multifractal analysis to identify regions of high complexity and information content. By focusing the masking process on these areas, MO-MAE ensures that the model learns to reconstruct the most diagnostically relevant features. This approach is particularly beneficial for medical imaging, where fine-grained inspection of tissue structures is crucial for accurate diagnosis. We evaluate MO-MAE on several medical datasets covering various diseases, including MedMNIST and COVID-CT. Our results demonstrate that MO-MAE achieves promising performance, surpassing other basiline and state-of-the-art models. The proposed method also adds minimum computational overhead as the computation of the proposed measure is straightforward. Our findings suggest that the multifractal-optimized masking strategy enhances the model’s ability to capture and reconstruct complex tissue structures, leading to more accurate and efficient medical image representation. The proposed MO-MAE framework offers a promising direction for improving the accuracy and efficiency of deep learning models in medical image analysis, potentially advancing the field of computer-aided diagnosis.
[CV-117] Benchmarking Convolutional Transformer Hybrid and Vision Language Models for Multi Disease Retinal Screening ALT
链接: https://arxiv.org/abs/2605.26283
作者: Durjoy Dey,Aymane Ajbar,Yuhong Yan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 3 figures, accepted at ICMHI 2026, 10th International Conference on Medical and Health Informatics, Kyoto, Japan. To appear in ACM Conference Proceedings
Abstract:Modern deep learning offers powerful tools for automated retinal screening, but it remains unclear how different visual model families compare in realistic multi-disease settings and under domain shift. In this work, we benchmark twelve architectures across four model families: convolutional neural networks, vision transformers, hybrid CNN-transformer backbones, and vision-language models, using the Retinal Fundus Multi-disease Image Dataset (RFMiD). We evaluate two tasks: binary screening for any retinal disease and multi-label classification across 28 disease classes. Using standardized training, calibration, and evaluation protocols, we report AUC, F1, precision, recall, and sensitivity at a clinically relevant operating point with specificity near 80%. On RFMiD, all architectures perform well on binary screening, with AUC above 84%, but attention-based models perform best. SwinTiny and the hybrid CoAtNet0 and MaxViTTiny models achieve the strongest binary screening results and improve macro and micro F1 in the multi-label setting. Vision-language models, including CLIP ViT-B/16 and SigLIP-Base384, are competitive with CNN baselines but do not surpass the best transformer and hybrid backbones. In external validation on Messidor-2 for referable diabetic retinopathy, AUC ranges from 66.8% to 84.7%, with hybrid and transformer models again showing strong performance. These results provide a reproducible reference for model selection in multi-disease retinal screening and guide future automated screening tools for clinical deployment.
[CV-118] VesselSim: learning 3D blood vessel segmentation without expert annotations MICCAI2026
链接: https://arxiv.org/abs/2605.26277
作者: Erin Rainville,Melissa Ananian,Tristan Mirolla,Hassan Rivax,Yiming Xiao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published as part of the MICCAI 2026 proceedings in October
Abstract:Blood vessel segmentation is a core task in medical image analysis for the care of vascular diseases and surgical planning, yet the challenges of providing expert vascular annotations pose a major obstacle for the progress of related deep learning techniques. To address this, we propose VesselSim, a two-stage framework for universal 3D blood vessel segmentation that eliminates the need for real annotated data during training. First, we introduce a stochastic, geometry-driven vascular simulation framework that models recursive branching, curvature-controlled growth, and collision-aware topology, followed by domain-randomized intensity synthesis to generate 16,500 anatomically plausible 3D angiographic volumes. Second, a 3D U-Net is trained solely on this synthetic data. To bridge the domain gap from synthetic to real images at inference time, we introduce a test-time adaptation strategy via a self-supervised mask reconstruction decoder, enabling adaptation to unseen clinical scans without prior domain knowledge. We evaluate VesselSim in a zero-shot setting on multiple real-world datasets spanning MR and CT across several anatomical regions, including the brain and kidneys. Despite being trained exclusively on synthetic data, VesselSim achieves performance competitive with state-of-the-art vascular segmentation foundation models. These findings suggest that learning vessel geometry from synthetic tubular structures is effective for robust cross-domain generalization, substantially reducing the reliance on acquired medical imaging data and more importantly, expert annotations.
[CV-119] Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation CVPR
链接: https://arxiv.org/abs/2605.26273
作者: İsmail Emre Canıtez,Özgür Erkent
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures, To be Presented at Perception Beyond the Visible Spectrum workshop series (IEEE PBVS) at CVPR, 2026
Abstract:Semantic segmentation in complex environments such as urban driving scenes remains challenging under adverse lighting conditions, where RGB images alone provide insufficient information. RGB-Thermal fusion leverages the complementary strengths of visible and infrared imagery to improve scene understanding; however, effectively integrating these heterogeneous modalities at varying levels of feature abstraction remains an open problem. In this paper, we propose a multi-modal fusion architecture built upon dual ConvNeXt V2 backbones that employs stage-wise, modality-adaptive fusion strategies. For early-stage features, we introduce a Frequency-Based Fusion Module that decomposes infrared features into low- and high-frequency components via Gaussian filtering, applies dual-branch spatial attention to selectively emphasize thermal patterns and fine-grained boundaries, and integrates them with RGB features through a confidence-gated residual mechanism. For late-stage features, we design a semantic fusion module with cross-modal attention and multi-scale depthwise convolutions to capture semantic correspondences across modalities. The fused features are decoded via a PANet-style bidirectional decoder with deep supervision. Experiments on MFNet and PST900 demonstrate that our lightest variant achieves 61.73% and 86.24% mIoU, respectively, with only 35.43M parameters, outperforming recent methods while using substantially fewer parameters and lower computational cost. Code is available at this https URL
[CV-120] Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion ICML2026
链接: https://arxiv.org/abs/2605.26266
作者: Tuna Tuncer,Felix Becker,Thomas Pfeil
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
备注: Variants of this manuscript were accepted to the ICML 2026 workshops SCALE and F2S
Abstract:Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.
[CV-121] Dimensional Distribution Emotion State: Leverag ing Valence and Arousal as a Common Embedding Space for Visual Emotion Analysis
链接: https://arxiv.org/abs/2605.26262
作者: Émile Bergeron,Tadagbé Dhossou,Sébastien Tremblay,Jean-François Lalonde
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Museums are important sites for the dissemination of culture and art. They are institutions rooted in history and tradition; their exhibitions are often designed to highlight these aspects. Recently, a new approach is being explored in the field: emotion-based exhibitions. These exhibitions are designed specifically to elicit emotions in the visitors, in order to maximize engagement, and as a way to democratize access to art and attract a wider, more diverse audience. To do so, the emotional content of the artworks must first be extracted, however, manually annotating the artworks by experts is a prohibitively labor-intensive process, and risks introducing the personal bias of curators. To assist the museum curators in their design of these exhibitions, we wish to develop a tool that can predict the emotional response evoked by a work of art. In this article, we leverage a continuous bi-dimensional emotion space to enhance emotion representations and the training process of deep learning models. Drawing inspiration from existing categorical and dimensional emotion representations, we introduce a new representation, Dimensional Distribution Emotion State (DDES), along with a pipeline for multi-dataset training. We show that DDES provides multiple advantages compared to widely used representations while exhibiting similar baseline performance.
[CV-122] LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV I2AV and V2AV
链接: https://arxiv.org/abs/2605.26244
作者: Tengfei Liu,Yang Shi,Xuanyu Zhu,Jiafu Tang,Liu Yang,Qixun Wang,Zhuoran Zhang,Yuqi Tang,Fengxiang Wang,Yuhao Dong,Xinlong Chen,Bozhou Li,Bohan Zeng,Yue Ding,Xiaohan Zhang,Jialu Chen,Haotian Wang,Yuanxing Zhang,Pengfei Wan,Leye Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5–10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.
[CV-123] RoMo: A Large-Scale Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation CVPR’26
链接: https://arxiv.org/abs/2605.26241
作者: Jiahao Zhang,Joseph Liu,Young-Yoon Lee,Seonghyeon Moon,Victor Zordan,Guy Tevet,Karen Liu,Stephen Gould,Oren Jacob,Haomiao Jiang,Mubbasir Kapadia,Yizhak Ben-Shabat
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR’26
Abstract:Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences. We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure enables fine-grained, per-category evaluation, that reveals model strengths and weaknesses obscured by global metrics. We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.
[CV-124] DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation
链接: https://arxiv.org/abs/2605.26236
作者: Ferdinand Paar,Lanmiao Liu,Aslı Özyürek,Serge Thill,Esam Ghaleb
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:
Abstract:Co-speech gesture generation requires both semantic expressivity and biomechanically plausible rhythmic motion. Existing holistic gesture models mix lexically grounded semantic gestures with frequent prosody-aligned beat gestures. This limits semantic grounding, speech-motion alignment, and kinematic smoothness. We propose \emphDuoGesture, a neuro-inspired and biomechanically informed dual-stream approach that decomposes co-speech gesture synthesis into coupled semantic and beat streams. The two streams are coordinated by a \emphSemantic Variational Information Bottleneck, a stochastic frame-level gate that learns when semantic gestures should override rhythmic beat motion. The semantic stream is controlled by \emphMotion-Grounded Semantic Conditioning, which replaces purely linguistic word embeddings with motion-language representations to provide motion-aligned semantic priors for long-tailed lexical triggers of gestures. The beat stream is further regularised by an \emphInertial Beat Prior, an anthropometry-weighted arm-chain module that reduces jitter and improves rhythmic consistency without constraining semantic frames. Objective evaluations and subjective experiments show that DuoGesture outperforms strong holistic baselines, while component ablations confirm the complementary roles of semantic grounding, stochastic stream selection, and biomechanical regularisation.
[CV-125] Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos
链接: https://arxiv.org/abs/2605.26232
作者: Bonan Ding,Umair Nawaz,Ufaq Khan,Abdelrahman M. Shaker,Muhammad Haris Khan,Jiale Cao,Jin Xie,Fahad Shahbaz Khan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 8 figures, 7 tables, preprint
Abstract:Pre-trained video large language models excel at visual reasoning. However, they struggle when videos arrive with auxiliary streams, such as audio, depth map, or dense temporal evidence. In such a scenario, uniform fusion induces modality interference, allowing irrelevant channels to distract the model. To address this issue, we present a unified multimodal video understanding framework, named UniMVU, that performs instruction-aware fusion across video, audio, depth map, or any other modality inputs via two levels of dynamic gating: inner-modality gates emphasize salient regions within each modality, whereas modality-level gates re-weight whole streams; both are conditioned on the text instruction to adaptively balance modality importance. Our UniMVU combines cross-modal self-attention with instruction-driven inner-modality gating module and a modality-level gating module with control token; for time-aligned streams we further adopt a fast-to-slow fusion scheme that reduces redundancy. Across six benchmarks (AVQA, AVSD, Music-AVQA, ScanQA, SQA3D and MVBench), our UniMVU achieves consistent gains over static-fusion baselines achieving gains as high as 13.5 in terms of CIDEr metric. Further, our analysis shows that the gating mechanism aligns with the human-interpretable modality relevance, and ablations show the contributions of inner-modality and modality-level gating. Our UniMVU provides a simple, unified recipe for instruction-aware multimodal video understanding that scales to diverse modalities without hand-crafted fusion rules.
[CV-126] Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction
链接: https://arxiv.org/abs/2605.26230
作者: Jin Hyeon Kim,Jaeeun Lee,Claire Kim,Kyoungjin Oh,Paul Hyunbin Cho,Jaewon Min,Yeji Choi,Jihye Park,Hyunhee Park,Minkyu Park,Seungryong Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-view 3D reconstruction has achieved remarkable progress with the advent of feed-forward 3D reconstruction models. However, these models are typically trained and evaluated under ideal, degradation-free imaging conditions, whereas real-world observations often contain degradations that differ significantly from such settings. Improving robustness for multi-view 3D reconstruction under degraded conditions therefore remains an important challenge. We present Geometry-Aware Representation Denoising (GARD), a novel framework that performs diffusion-based multi-view restoration directly in the feature space of a feed-forward 3D reconstruction model. This design exploits the geometry-aware feature representations of the 3D reconstructor to effectively recover accurate scene geometry. Furthermore, by employing an additional RGB image decoder, the refined representations can also be used to restore high-quality RGB images, thereby enabling the simultaneous recovery of 3D scene geometry and high-quality imagery. Comprehensive experiments on the Depth Anything 3 (DA3) benchmark demonstrate the effectiveness of the proposed GARD framework.
[CV-127] AnySurf: Any Surface Generation with Directed Edge
链接: https://arxiv.org/abs/2605.26149
作者: Wenda Shi,Chenyuan Pan,Dengming Zhang,Yiren Song,Biao Zhang,Xingxing Zou
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open surface components prevail in real industrial 3D content and support rendering, physical simulation and geometric editing. Garments serve as a typical open surface type, with numerous existing generation methods leveraging sewing patterns to generate 2D panels and stitch them into 3D shapes. Such domain-specific designs lack scalability and cannot generalize to shoes and accessories. Common field-based 3D generators prioritize watertight meshes and tend to create flawed double-layer structures on open surfaces. Though Trellis2 adopts field-free representation, its open surface results still contain normal and topology errors. We present AnySurf, a unified framework generating open, closed and hybrid 3D surfaces with accurate face orientation. Built on directed-edge enhanced Flexible Dual Grid (FDG-D), our representation retains normal direction information via oriented grid edges. We also propose ROS-FT post-training and a lightweight DE-Adapter with merely 1% extra parameters, facilitating directed edge learning while preserving original generation performance. We further construct Outfit3D dataset containing industrial garments and closed accessories. Our work transforms garment modeling into a universal 3D generation task. Experimental results demonstrate superior mesh quality and better practicality for downstream applications.
[CV-128] VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents
链接: https://arxiv.org/abs/2605.26144
作者: JunJia Guo,Yuhang Yao,Jiawei(Joe)Zhou,Jingdi Chen
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.
[CV-129] AssetGen: Deployable 3D Asset Generation at Interactive Speed
链接: https://arxiv.org/abs/2605.26137
作者: Dilin Wang,Xiaoyu Xiang,Kihyuk Sohn,Tom Monnier,Yu-Ying Yeh,Thu Nguyen-Phuoc,Jiawen Zhang,Yuchen Fan,Antoine Toisoul,Hyunyoung Jung,Prithviraj Dhar,Michael Bunnell,Nikolaos Sarafianos,Chuhang Zou,Roman Shapovalov,Andrea Vedaldi,Rakesh Ranjan
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While 3D generation is progressing rapidly, recent work has often focused on obtaining high-resolution assets, leaving user experience and deployability as afterthoughts. We present AssetGen, a 3D generator that focuses instead on these two aspects. Given one reference image, in 30 seconds it produces a high-quality mesh with baked normals, a color texture, and a controlled polygon budget suitable for real-time rendering, including mobile use cases. The AssetGen Flash variant further reduces latency to 14 seconds for interactive and agentic creation loops. Our model generates the object geometry with a coarse-to-refine VecSet framework, which implements mesh simplification, cleaning, and normal baking on the GPU, and a fast parallel UV unwrapping. It then generates textures in a multi-view fashion, followed by backprojection and 3D inpainting. Model distillation, kernel optimization, and pipeline parallelization are co-designed to accelerate the system end-to-end. We introduce numerous automated and blind human evaluations and demonstrate competitive visual quality against leading commercial solutions in 30 seconds and preview-quality results in less than 15 seconds. The final result is a system that supports AI-assisted, deployable 3D content creation in interactive workflows.
[CV-130] Unsupervised Deep Image Prior for Sparse-View and Limited-Angle Electron Tomography
链接: https://arxiv.org/abs/2605.27139
作者: Serge Brosset,Daniel del Pozo Bueno,Thomas David,Laure Guetaz,Philippe Ciuciu,Zineb Saghi
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Detectors (physics.ins-det)
备注: 22 pages, 12 figures
Abstract:Electron tomography (ET) plays an important role in the three-dimensional (3D) characterization of nanomaterials. However, under limited-angle and sparse-view conditions, conventional algorithms produce degraded reconstructions, which compromise the quality and interpretability of resulting 3D data. In this paper, we present deep image prior (DIP), an unsupervised deep learning (DL) approach, for highly degraded tomography acquisitions and demonstrate, using simulated data, that its performance is comparable to that of supervised approaches requiring training datasets, even for tilt ranges as limited as 60° and tilt increments of 10°. We then apply it to experimental data and show that it enables reliable 3D quantification under both sparse-view and limited-angle conditions, highlighting its potential for a wide range of materials and acquisition modalities.
[CV-131] Measuring Prediction Uncertainty in Neural Cellular Automata MICCAI2026
链接: https://arxiv.org/abs/2605.26726
作者: Ario Sadafi,Michael Deutges,Nassir Navab,Carsten Marr
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026
Abstract:Neural cellular automata (NCA) provide a lightweight alternative to encoder-decoder segmentation networks. However, it can be difficult to decide when a prediction should be trusted. Here, we study uncertainty estimation for NCA-based medical image segmentation without modifying the underlying architecture or retraining the model. Our approach is motivated by viewing the NCA as a dynamical system where convergent attractors correspond to confident predictions. Concretely, we propose resilience, a simple measure that leverages the intrinsic iterative structure of NCAs by probing the stability of the final prediction under small perturbations of the automaton state. Predictions that return to the same solution are deemed confident, while those that change substantially are flagged as uncertain. We evaluate uncertainty by its ability to predict segmentation quality using selective prediction metrics ( \Delta Dice@90 and AURC) and ranking metrics (AUROC and AUPRC). Across multiple medical segmentation benchmarks, resilience identifies failure cases more reliably than baselines, improving trust and safety in NCA-based models.
人工智能
[AI-0] Algorithmic Monocultures in Hiring
链接: https://arxiv.org/abs/2605.27371
作者: Rishi Bommasani,Sarah H. Bana,Kathleen A. Creel,Dan Jurafsky,Percy Liang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Published at FAccT 2026. Website: this https URL
Abstract:Many employers screen job applicants with algorithms built by the same few algorithm vendors. We hypothesize that algorithmic monoculture leads to the same individuals and members of the same racial groups facing rejection. We acquire and analyze a novel dataset of 3 million applicants submitting 4 million applications where all the applications are screened by algorithms built by the same vendor. We find clear racial disparities in applicant outcomes. Of all applications submitted by Asian and Black applicants, 14.74% and 25.87% are submitted to positions that adversely impact Asian and Black applicants, respectively, according to U.S. employment discrimination standards. Individuals also receive homogeneous outcomes: 4% of all applicants who apply to 10 positions are recommended for rejection from all positions, a rate higher than expected by chance. To better understand this homogeneity, we leverage the deterministic replicability of hiring algorithms to generate the outcomes applicants would have received if they applied to all positions. We show that applicants would need to apply widely in order to ensure their applications are considered by a human
[AI-1] Natural Language Query to Configuration for Retrieval Agents
链接: https://arxiv.org/abs/2605.27361
作者: Melissa Z. Pan,Negar Arabzadeh,Mathew Jacob,Fiodar Kazhamiaka,Esha Choukse,Matei Zaharia
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Modern retrieval agents expose many configuration choices – LLM, retriever, number of documents, number of hops, and synthesis strategy – each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose BRANE, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, BRANE selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, BRANE consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration’s accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.
[AI-2] GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis Research and Testing
链接: https://arxiv.org/abs/2605.27360
作者: Tamerlan Aghayev,Maxime Elkael,Michele Polese,Minh Dat Nguyen,Gabriele Gemmi,Andrea Lacava,Ali Saeizadeh,Reshma Prasad,Paolo Testolina,Angelo Feraudo,Soumendra Nanda,Pedram Johari,Salvatore D’Oro,Tommaso Melodia
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 18 pages, 16 figures
Abstract:Cellular research and development (RD) is throttled by six structural processes that each consume months of manual engineering work per iteration: (i) synthesizing new features from standards or research papers into production code; (ii) conformance and interoperability testing; (iii) hardening against field anomalies and diverse deployment environments; (iv) data-driven optimization of network functionalities; (v) discovering and prototyping novel waveforms, functionalities, and capabilities for future standards; and (vi) securing the stack against vulnerabilities. Although Large Language Models (LLMs) have compressed comparable RD work in general software engineering from days to minutes, their known pitfalls worsen on Radio Access Network (RAN) use cases: they hallucinate Application Programming Interfaces (APIs) and mis-read specifications, which kills interoperability of RAN components at the first mistake, and they heavily rely on simulations for designing algorithms, which is notorious for breaking when transferred to real hardware. To address these challenges, we present GENESIS, an agentic Artificial Intelligence (AI) framework that converts intents (e.g., a specification clause, a telemetry anomaly, or a research hypothesis) into solutions validated with over-the-air experiments, fed back into a persistent knowledge base. GENESIS is built on three composable primitives (agents, skills, hooks) and a knowledge layer (SYNAPSE) that doubles as the source of ground truth and the recipient of every artifact the framework produces, making capabilities compound across runs.
[AI-3] Maat: The Agent ic Legal Research Assistant for Competition Protection
链接: https://arxiv.org/abs/2605.27331
作者: Basant Mounir,Farida Madkour,Amira Abdelaziz,Asmaa Sami
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure
Abstract:Competition law experts conducting legal research must review extensive volumes of cases, decisions, and judicial reports to identify precedents and assess key elements in competition and merger cases. Although general research assistants such as Claude and ChatGPT and legal assistants such as SaulLM-7B and LegalGPT are increasingly used to assist legal research, they remain inadequate for competition law analysis: they lack specialized domain expertise, provide insufficient official citations, or hallucinate competition law cases. We propose Maat, a ReAct agent that orchestrates tools corresponding to different tasks of the research process. Designed iteratively with competition law experts, Maat grounds cases and findings in official sources using RAG for reliability, provides rich in-line citations, falls back to web search when database coverage is insufficient, and prompts the user for clarification when queries are ambiguous. Maat significantly outperforms all baseline assistants on case-specific tasks and performs within range of the top baseline on theoretical question tasks. The dataset used is available on GitHub.
[AI-4] Modeling Agent ic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement Simulation and Dashboarding
链接: https://arxiv.org/abs/2605.27320
作者: Muhammad Zia Hydari,Raja Iqbal,Narayan Ramasubbu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Economics (econ.GN)
备注:
Abstract:Agentic AI systems combine probabilistic reasoning with delegated action through tools, context, memory, orchestration, and external workflow integration. This note develops a formal and managerially usable model that distinguishes Agentic Technical Debt from Stochastic Tax. Agentic Technical Debt is a stock of accumulated design and governance liability. Stochastic Tax is a recurring flow of operating burden that arises when stochastic agents are used in business workflows. The two constructs are related, but they are not the same: debt can amplify the tax, while the tax can remain positive even when debt is minimized. The note starts from a compact dashboard expression, expands it into a fuller structural model, defines all variables and parameters, shows how each cost category can be estimated from operational data, and illustrates the framework with an accounts-payable simulation and companion spreadsheet.
[AI-5] Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling
链接: https://arxiv.org/abs/2605.27286
作者: Yiding Liu,Yifan Hu,Hongjie Xia,Peiyuan Liu,Hongzhou Chen,Xilin Dai,Zewei Dong,Jiang-Ming Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series foundation models (TSFMs) are transforming the forecasting paradigm through large-scale cross-domain pretraining. However, most existing TSFMs remain univariate, and recent efforts to enable cross-variate modeling still operate directly within the raw variate space. This design introduces fundamental limitations in semantic alignment and relational expressivity. Specifically, raw-space group mixing lacks a dedicated mechanism to align heterogeneous physical quantities, while standard non-negative attention fails to capture the complex synergistic and antagonistic interactions ubiquitous in real-world systems. To address these challenges, we propose Falcon-X, decouples variates from the raw space and maps them into a unified latent prototype space. Falcon-X employs a Unified Prototype Diff-Attention mechanism that explicitly evaluates both positive and negative semantic affinities to explicitly align heterogeneous variates. Cross-variate interactions are then efficiently performed within this shared space via Latent Entity Attention, naturally facilitating zero-shot structural transfer. Finally, a Variate Reassembly Router robustly reconstructs variate-specific trajectories via a request-and-dispatch mechanism. Extensive evaluations on the GIFT-Eval and fev-bench benchmarks demonstrate that Falcon-X achieves state-of-the-art forecasting performance, offering a principled and scalable paradigm for complex multivariate environments. Falcon-X is publicly released to support future research.
[AI-6] FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies
链接: https://arxiv.org/abs/2605.27284
作者: Xintong Hu,Xuhong Huang,Jinyu Zhang,Yutong Yao,Yuchong Sun,Qiuyue Wang,Mingsheng Li,Sicheng Xie,Yitao Liu,Junhao Chen,Yixuan Chen,Yingming Zheng,Shuai Bai,Tao Yu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 26 pages, 7 figures, 25 tables
Abstract:Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about how those tasks should be executed. However, existing robot datasets usually pair trajectories with coarse goal-level language, leaving execution-critical details such as active arm, approach direction, and contact region unspecified. This limits steerable policy learning and robotic video understanding. We introduce FineVLA, an open framework for action-aligned fine-grained VLA supervision. The framework includes: (1) a data construction tool that unifies 972,247 trajectories across 85K tasks from 10 open-source robot datasets and builds FineVLA-Data, a human-verified dataset of 47,159 fine-grained trajectories; (2) a held-out benchmark with 500 videos, 10,816 atomic facts, and 1,030 VQA questions; (3) a robotics-specialized VLM annotator for scalable fine-grained annotation; and (4) a steerable VLA policy trained with controlled mixtures of fine-grained and raw goal-level instructions. Our experiments yield three findings. First, fine-grained supervision does not sacrifice goal-level success: FG-only improves over Raw-only by +1.4 to +8.1 success-rate points across settings. Second, fine-grained and raw instructions are complementary, following a consistent inverted-U trend peaking at FG:Raw = 1:2 to 1:1. The best mixed setting reaches 86.8%/82.5% in RoboTwin simulation and 62.7/100 in real-world dual-arm manipulation (vs. 49.9 Raw-only). Third, fine-grained supervision improves steerable control: the largest real-world gains appear on pose (+23), color (+18), and approach direction (+18)–factors where goal-level instructions provide no guidance. Overall, fine-grained language should augment goal-level instructions: specifying how to execute alongside what to achieve. Project page: this https URL
[AI-7] PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis
链接: https://arxiv.org/abs/2605.27258
作者: Bowen Li,Shaotong Guo,Zhen Wang,Yang Xiang,Mingli Jin,Yihang Lin,Jiahui Zhao,Weibo Xiong,Dongrui Li,Keming Chen,Yunze Gao,Yuze Zhou,Zeyang Lin,Yue Liu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open-source tools. Specifically, our contributions are: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at this https URL.
[AI-8] LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models
链接: https://arxiv.org/abs/2605.27254
作者: Oroel Ipas,Guillermo Gomez-Trenado,Rocío Romero-Zaliz,Isaac Triguero
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Comments: 18 pages, 4 figures, supplementary appendices included
Abstract:Selecting which instances to label is a key challenge in low-label tabular learning. For recent Tabular Foundation Models such as TabPFN, context selection directly determines predictive performance. Supervised oracle experiments show that carefully chosen labeled context sets can strongly outperform random selection under the same labeling budget. However, the cold-start setting, where instances must be selected before any labels are available, has received little attention in the TFM literature. This problem is fundamentally geometric. In vision and language, foundation models induce embedding spaces where simple geometric selection methods are effective. In contrast, tabular instance selection has so far been performed predominantly in the original tabular space, which lacks a natural metric; heterogeneous types, mixed scales, and nonlinear interactions make raw-space distances unreliable for context construction, and original-space selection falls below random on the majority of datasets as the budget grows. We propose LUCoS (Latent Unsupervised Context Selection), which replaces raw-feature geometry with the latent geometry induced by embeddings from an unsupervised Prior-Fitted Network (PFN) and selects representative medoids as context. Evaluated on 67 OpenML-CC18 datasets across six low-label budgets, LUCoS ranks first under mean AUC, ACC, and F1, with conclusions stable across metrics and dataset-level robustness checks. A gain decomposition reveals a simple mechanism: at the smallest budgets, the main benefit comes from enforcing coverage; as the budget increases, the decisive factor becomes the representation space in which coverage is measured. LUCoS mitigates failures of original feature space selection, showing that reliable unsupervised context selection depends less on selector sophistication than on defining representativeness in a meaningful representation geometry.
[AI-9] Many Logics One Methodology: A Plea for Logical Pluralism in Formalised Reasoning (preprint)
链接: https://arxiv.org/abs/2605.27246
作者: Christoph Benzmüller,Daniel Kirchner,Luca Pasetto
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Logic (math.LO)
备注: 21 pages, 6 figures; to appear (preprint)
Abstract:This position statement looks back on two decades of work on shallow embeddings of non-classical logics in classical higher-order logic (HOL), a line of research that expanded into a range of logic embeddings in HOL and inspired the LogiKEy logic-pluralistic knowledge representation and reasoning methodology. This paper advances the case for logical pluralism at object-logic level within a unifying meta-logical framework such as LogiKEy, grounding the argument in computational metaphysics. More broadly, it advocates principled support for logical pluralism in modern proof assistants, and cautions against logical imperialism – the rigid adoption of a single foundational logic for large-scale theory developments – which impedes the interdisciplinary reuse that LogiKEy is designed to enable.
[AI-10] Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments
链接: https://arxiv.org/abs/2605.27209
作者: Yuxin Chen,Xiaodong Cai,Junfeng Fang,Zhuowen Han,Yu Wang,Yaorui Shi,Yi Zhang,Qi Gu,Xunliang Cai,Xiang Wang,An Zhang,Tat-Seng Chua
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real-world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real-world interaction dynamics, where current paradigms rely on carefully curated task instructions and stable, well-controlled environments. To address this gap, we propose NoisyAgent, an agentic training framework that explicitly incorporates environmental imperfections into the agent learning process. We identify two major sources of interaction noise in real-world scenarios: user noise, which captures ambiguity and variability in user interaction, and tool noise, which reflects failures and anomalies in tool execution. We introduce such perturbations into the training pipeline by modifying user interaction patterns and simulating tool execution results within the training environment. To stabilize training while encouraging agents to handle increasingly challenging imperfections, noise is applied to only a subset of rollouts and progressively increased in difficulty as the model adapts to the current noise level. Extensive experiments demonstrate that our approach consistently improves agent robustness under noisy and dynamic environments. Our analysis reveals that training under noise conditions also yields performance gains on idealized benchmarks, suggesting that controlled exposure to environmental noise promotes more generalizable reasoning and decision-making behaviors. Our findings highlight the importance of modeling interaction imperfections for bridging the gap between agent training and real-world deployment.
[AI-11] he Compressive Knowledge Graph Hypothesis: Which Graph Facts Matter for Scientific Hypothesis Generation?
链接: https://arxiv.org/abs/2605.27176
作者: Shashwat Sourav,Viktoriia Baibakova,Sanjay Das,Ran Elgedawy,Maria Mahbub,Emily Herron,Tirthankar Ghosal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge graphs (KGs) can provide structured scientific context to language models, but it remains unclear which graph facts actually shape the generated hypotheses. We study KG-guided hypothesis generation for battery materials across Mistral-7B, Llama-3.1-70B, and Gemini 2.5 Flash. We perturb local KGs by varying density, ontology richness, topology, and control structure, and evaluate outputs with both provided-graph and fixed-reference metrics. Across models, KG utility is selective and model-dependent: graph context changes outputs, but no-KG outputs also recover substantial graph content from model priors. Compact top-k subgraphs often approximate full-KG behavior, including when claimed-outcome triples are held out. At the same time, compression is not unique to one semantic ranking rule, random and topology-based subsets can also recover much of the signal. These results support a redundancy-aware Compressive KG hypothesis: useful KG signal is often recoverable from compact, scientifically structured subgraphs rather than requiring the full local graph.
[AI-12] An investigation of AI integration in sound designer workflows and experiences
链接: https://arxiv.org/abs/2605.27174
作者: Nelly Garcia,Joshua Reiss
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Artificial intelligence is increasingly being integrated into professional audio production workflows, yet a gap persists between the tools developers produce and the requirements of practising sound designers. This paper investigates this gap through a mixed-methods study comprising a survey of 76 practitioners and follow-up semi-structured interviews with 20 industry professionals. Results were analysed using descriptive statistical analysis and thematic analysis to identify patterns across both datasets. Five themes emerged from our analysis: Context, Workflow, Potential, Risks, and Right Use. Our work indicates that current AI tools perform adequately in fast-consumption media contexts but lack the narrative sophistication required for high-end sound design (films, immersive experiences etc). Practitioners demonstrate a preference for assistive, task-specific applications, particularly in audio restoration and library management, over end-to-end generative systems. This work contributes to the on-going discussion on the use of AI and AI-enhanced tools in the creative industries. We report on the current status of the field from the point of view of sound designers and creative audio practitioners, and offer a set of recommendation for sound technologist and developers based on our findings to guide the development of more informed AI tools for sound design.
[AI-13] Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering
链接: https://arxiv.org/abs/2605.27164
作者: Mateusz Czyżnikiewicz,Ryszard Tuora,Adam Kozakiewicz,Tomasz Ziętkiewicz,Mateusz Galiński,Michał Godziszewski,Michał Karpowicz,Timothy Hospedales,Cristina Cornelio
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems for question answering typically retrieve evidence by semantic similarity between the query and document chunks. While effective for unstructured text, this approach is less reliable on semi-structured corpora where answering may require exact filtering, aggregation, or exhaustive retrieval over structured attributes across multiple documents. Symbolic approaches support such operations, but they are often brittle on noisy natural-language corpora. We address this gap with DualGraph, a RAG framework that represents documents through two complementary views: a Textual Knowledge Graph for semantic retrieval and a Symbolic Knowledge Graph for symbolic querying over typed subject–predicate–object triples. Building on these two components, we provide multiple strategies for selecting or combining semantic and symbolic this http URL also introduce SpecsQA, a benchmark from a commercial shopping website with semi-structured product documents and manually curated questions spanning open-ended and specification-oriented retrieval. Experiments show that DualGraph consistently outperforms state-of-the-art dense-retrieval, GraphRAG, symbolic, and table-oriented baselines across question this http URL and data are available at this https URL.
[AI-14] Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLM s
链接: https://arxiv.org/abs/2605.27157
作者: Zhe Yu,Wenpeng Xing,Chen Ye,Xuyang Teng,Bo Yang,Changting Lin,Meng Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented LLMs are deployed for tasks where evidence quality determines action safety, yet evaluation protocols assume that single-turn robustness predicts robustness when evidence accumulates across turns. We show this assumption is fundamentally incorrect. Models exhibit a monitoring-control gap: they readily acknowledge contradictory evidence, yet this awareness fails to constrain their final recommendations - detecting epistemic conflict does not imply resolving it safely. Through a multi-turn document accumulation protocol across four model families (1.5B-32B parameters) and over 50,000 turn-level evaluations, we demonstrate that single-turn diagnostics systematically overestimate RAG safety, that contradiction acknowledgement is uncorrelated with safe resolution, a pattern corroborated by targeted human validation, and that no universal prompt fix exists. Converging mechanism evidence - hidden-state probing, attention analysis, and response-strategy taxonomy - points to action selection as the most plausible locus of the deficit: danger-relevant information is internally represented and receives enhanced attention during unsafe generation, yet fails to constrain output behavior. The gap between what models recognize and what they do must be measured and closed before retrieval-augmented systems can be trusted in high-stakes settings.
[AI-15] VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions
链接: https://arxiv.org/abs/2605.27141
作者: Yuxin Chen,Yi Zhang,Zhengzhou Cai,Yaorui Shi,Zhiyuan Yao,Chenhang Cui,Jingnan Zheng,Yaqi Huo,Xi Su,Qi Gu,Xunliang Cai,Xiang Wang,An Zhang,Tat-Seng Chua
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.
[AI-16] StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning
链接: https://arxiv.org/abs/2605.27140
作者: Yanfei Zhang,Xu Lin,Chenglin Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically treats heterogeneous agent trajectories as monolithic strings rather than causal interaction units. We present StepOPSD, a post-rollout preference self-distillation framework that takes the agent step as the unit of credit redistribution. StepOPSD decomposes trajectories into action-centered step segments, rescoring them under hindsight-enriched teacher contexts and converting token-level log-probability gaps into sign-preserving advantage shaping with a normalized per-step credit budget before the GRPO update. Across ALFWorld and Search-QA with Qwen3-1.7B and Qwen2.5-3B-Instruct, StepOPSD attains best or second-best results on subsets most sensitive to local causal errors, including first-place performance on ALFWorld Heat (79.1%), PickTwo (95.0%), Search-QA TriviaQA (61.6%), and tied-best performance on HotpotQA (40.4%). The results further reveal a consistent two-knob law: smaller \alpha_clip acts as a broadly stabilizing local trust region, whereas the optimal global mixing strength \lambda_mix remains task-dependent. These findings suggest that step-aware distillation is most useful when trajectory-level rewards are weakly aligned with the local action that determines downstream success.
[AI-17] ICCU: In-Context Continual Unlearning via Pattern-Induced Refusal Rules
链接: https://arxiv.org/abs/2605.27138
作者: Ruihao Pan,Suhang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Machine unlearning aims to remove the influence of specific data from trained language models. In real-world deployments, unlearning requests often arrive sequentially, which challenges existing fine-tuning-based methods: fine-tuning each request is costly, accumulates utility loss, and may cause cross-request interference. To address these issues, we propose ICCU (In-Context Continual Unlearning), an in-context continual unlearning framework that induces readable refusal rules from unlearning datasets and applies them at inference time either as a filter or via the system prompt, without modifying model parameters. Because rules are accumulated as an order-independent union, ICCU is compositional and free of cross-request interference, and the original forget-set data can be discarded after rule induction. Extensive experiments show that ICCU effectively suppresses target knowledge while preserving utility, scales across sequential requests, and remains robust to paraphrased and cross-lingual queries.
[AI-18] Scaling Benchmarking and Reasoning of Vision-Language Agents for Mobile GUI Navigation ICML2026
链接: https://arxiv.org/abs/2605.27134
作者: Heng Qu,Yike Liu,Renren Jin,Wenzong Zhang,Pengzhi Gao,Wei Liu,Jian Luan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026
Abstract:Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.
[AI-19] Deep-layer limit and stability analysis of the basic forward-backward-splitting induced network (II): learning problems
链接: https://arxiv.org/abs/2605.27133
作者: Xuan Lin,Chunlin Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 38 pages, 1 figure
Abstract:Deep unfolding neural networks derived from iterative optimization schemes and numerical ordinary/partial differential equations (ODEs/PDEs) have attracted much attention in data science over the last decade. Therein, numerous important network architectures were constructed from the basic forward-backward-splitting (FBS) algorithm. In this paper, we continue our research on the most basic FBS-induced network, an architecture unrolled from the original FBS algorithm by incorporating direct parameter relaxations. Following the difference/differential inclusion formulations in our previous forward system analyses, we here consider some theoretical aspects of corresponding learning problems. Under some mild assumptions, we establish a general convergence property of the training problem of the basic FBS-induced network to the learning problem of the deep-layer limit system, implying a \Gamma -convergence argument showing that any cluster point of the optimal learning parameters for the network is a solution to the learning problem of the deep-layer limit system. A qualitative analysis of perturbation stabilities of these learning problems is also presented. A simple numerical experiment is conducted to validate our main general convergence result.
[AI-20] Beyond the Data Mesh Illusion: Designing Modern AI-augmented Lakehouses to Bridge the Gap Between Theory and Practice
链接: https://arxiv.org/abs/2605.27131
作者: Oliver Angélil,Jan Migon
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 11 pages, 5 figures
Abstract:Enterprise data platforms face an enduring tension between domain self-service and holistic governance. The data mesh paradigm proposed decentralized domain ownership as a remedy, but pure implementations frequently underdeliver: teams inherit new responsibilities without the platform maturity, tooling, or coordination mechanisms needed to exercise them effectively. This paper argues that the flexibility-versus-control trade-off can be relaxed through an AI-augmented hub-and-spoke model layered on a modern lakehouse architecture. A central hub (Center of Excellence) provides shared platform services, policy automation, and AI-enabled governance, automatically standardizing data products, generating quality rules, drafting data contracts, and reviewing changes for regressions. Domain spokes own business semantics, product backlogs, and local iteration cadence, progressively assuming greater responsibility as they mature. The same LLMs that automate governance tasks also lower the barrier for domain practitioners to develop genuine cross-functional expertise spanning business and data engineering, enabling spoke teams to take on greater end-to-end ownership without proportionally increasing their dependence on the hub. Natural-language conversational interfaces further democratize access for business users, exposing historically underutilized enterprise data. On the organizational side, we propose a staged framework that shifts ownership from hub to spokes, avoiding both centralized bottlenecks and uncoordinated decentralization. We evaluate the architecture through three outcome metrics: data product adoption, time-to-find, and time-to-insight, that tie platform success to measurable business value rather than internal activity.
[AI-21] DEI: Diversity in Evolutionary Inference for Quality-Diversity Search ICML2026
链接: https://arxiv.org/abs/2605.27130
作者: John Donaghy,Shikhar Rastogi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026 Workshop Scalable Learning and Optimization for Efficient Multimodal AI Agents (SCALE)
Abstract:We present DEI: Diversity in Evolutionary Inference, a distributed Quality-Diversity (QD) search framework that assigns heterogeneous large language models (LLMs) as mutation operators across peer nodes communicating with non-blocking collective operations. Unlike homogeneous parallel search, which replicates a single model’s inductive biases across all workers, DEI treats each LLM’s distinct creative prior as a complementary source of behavioral novelty. Extending the Digital Red Queen framework with DEI, nodes share local optimal solutions at the end of each round to seed the next round’s population. This creates cross-model adversarial pressure that drives robustness beyond intra-model self-play. Evaluated on the Core War domain, a competitive programming benchmark in which Redcode warrior programs battle inside a simulated machine, a four-node heterogeneous ensemble (GPT-5.4-mini, Claude Sonnet 4.6, GPT-5.2, and Claude Haiku 4.5) achieves 124 percent higher merged-archive QD-Score (45.90 vs. 20.46) and 28 percent higher coverage (80.6 percent vs. 63.0 percent of cells) than a single-node baseline at equal total LLM-call budget. The heterogeneous ensemble also outperforms an equally-budgeted homogeneous ensemble on QD-Score, coverage, and held-out solution generality across all four model families. These results provide the first empirical evidence that model diversity, not merely parallelism, is the key driver of gain in distributed LLM-based QD search.
[AI-22] Position: AI Safety Requires Effective Controllability
链接: https://arxiv.org/abs/2605.27117
作者: Yige Li,Yunhao Feng,Jun Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages
Abstract:AI safety is still largely framed as alignment: training models to follow human preferences, safety policies, and normative constraints. That framing has improved the behavior of modern language models, but aligned behavior does not by itself guarantee that a deployed agent can be stopped, overridden, or constrained once it operates in open-ended, interactive, and tool-using environments. A system may be safe in expectation and still fail to yield to explicit runtime authority under conflicting instructions, long-horizon execution, adversarial inputs, or risky tool use. This position paper argues that AI safety therefore requires controllability as a first-class objective. We define \emphcontrollability as the ability of an AI system to remain reliably interruptible, overridable, redirectable, and constrainable by explicit control signals at runtime while preserving ordinary utility when such signals are absent. To study this gap, we introduce \controlbench, a benchmark for evaluating controllability failures in high-risk agentic scenarios. Experiments with OpenClaw-based agents show that current alignment and guardrail mechanisms reduce risk, but often fail to provide persistent, authoritative, and enforceable runtime control. We therefore propose a control-centric architectural framework that highlights explicit control planes, runtime intervention pathways, persistent control states, and auditable decision interfaces as key design principles for future controllable AI systems.
[AI-23] Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation
链接: https://arxiv.org/abs/2605.27115
作者: Tianlei Chen,Jiao Ou,Ziyuan Liu,Ruiming Tang,Jian Liang,Han Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Domain specialization can improve LLM behavior in vertical domains, but often weakens the general capabilities inherited from the original model. Recent Multi-Teacher On-Policy Distillation (MOPD) pipelines recover model capabilities by supervising student-generated trajectories with teacher feedback, but typically assume teacher-aligned prompt coverage, requiring prompts to match the teachers’ training distributions. This assumption is difficult to satisfy when the general teacher is an open-source model whose post-training data are unknown. Instead of attempting to reconstruct this hidden distribution, we study general capability recovery with readily available proxy general prompts. We identify two failure modes of vanilla MOPD in this incomplete-coverage situation: recovery-preservation counteraction from mixing conflicting recovery and preservation gradients, and weak-signal flattening from uniformly averaging samples with unequal correction demand. We propose Counteraction-Aware Multi-Teacher On-Policy Distillation (CaMOPD), which addresses these issues with decoupled alternating training and gap-based sample selection. CaMOPD gives general recovery dedicated updates, periodically reviews domain prompts for preservation, and selects samples with larger averaged token-level teacher-student log-probability gaps to concentrate correction signals. Across role-play dialogue and medical reasoning QA scenarios, CaMOPD performs best in general recovery over baselines while maintaining domain-specific behavior. Gradient coherence analyses further support the intended effect of CaMOPD in producing more coherent correction signals.
[AI-24] High-Quality Synthetic Financial Time-Series using a GAN-Diffusion Framework
链接: https://arxiv.org/abs/2605.27113
作者: Giuseppe Masi,Andrea Coletta,Novella Bartolini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, financial institutions and firms have increasingly adopted synthetic data to address data scarcity and to generate counterfactual market scenarios. However, reproducing all the statistical properties of financial time series, commonly known as stylized facts, remains an open challenge for many existing general-purpose architectures. In this paper, we present a quality-aware generative framework that combines two classes of generative methods, demonstrating how their integration addresses existing limitations while enhancing the realism of synthetic data. Specifically, we first introduce CoMeTS-GAN (Correlated Multivariate Time Series GAN), a Conditional Generative Adversarial Network (C-GAN) designed to jointly generate mid-price and volume time-series for correlated stocks. We then show how our GAN architecture can be incorporated into state-of-the-art diffusion models to enhance the quality of generated correlation structures. Specifically, the GAN’s Critic serves as a quality evaluation module that guides the diffusion process, enforcing learned correlation structures in the generated time-series. Our framework offers a lightweight and responsive solution for realistic stock market simulation, explicitly modeling inter-asset correlation structures. We experimentally validate our framework against leading generative architectures, showing that it more effectively captures the stylized facts of stock markets and models inter-asset correlations.
[AI-25] Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?
链接: https://arxiv.org/abs/2605.27082
作者: Qingyuan Zeng,Ziyang Chen,Pengxiang Cai,Zixin Guan,Anglin Liu,Lang Qin,Xinyao Lai,Jintai Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Biomedical discovery often requires connecting broad biomedical knowledge with specific experimental or clinical data. Background knowledge suggests relevant mechanisms but is usually too general to map directly onto dataset variables, while data-driven patterns can be dataset-specific and hard to interpret mechanistically. We study this missing link as knowledge contextualization: transforming broad biomedical knowledge into evidence-supported, scenario-grounded propositions that domain experts can inspect, replay, and validate. We propose SCENE, a bi-level multi-agent framework that treats knowledge contextualization as iterative search. The upper level converts broad knowledge into search directions and grounds them in the dataset schema. The lower level executes these directions through multi-objective optimization to identify concrete propositions that balance evidential strength and data support. Feedback between the two levels progressively refines the search. We evaluate SCENE in two settings: discovering patient subgroups with heterogeneous treatment benefits in clinical trial scenarios, and identifying context-specific biological responses in LINCS L1000 studies. In clinical trials, SCENE discovers specific, well-supported subgroups and outperforms existing baselines. In L1000 studies, SCENE identifies perturbational contexts with strong target-response matching and high positive rates. These results show that SCENE bridges broad knowledge and scenario-specific evidence, producing traceable, inspectable hypotheses for follow-up validation.
[AI-26] ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference ICML2026
链接: https://arxiv.org/abs/2605.27081
作者: Xiongwei Zhu,Xiaojian Liao,Tianyang Jiang,Yusen Zhang,Liang Wang,Limin Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory-constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine-tuning framework designed to boost token-wise expert reuse. ReMoE biases the router toward recently selected experts, producing temporally stable routing that better matches cache locality constraints. By increasing short-horizon expert reuse, ReMoE reduces expert fetches from storage without adding inference-time computation. Experiments on DeepSeek and Qwen models show that ReMoE improves expert reuse by 26% while maintaining downstream task performance. Real-system evaluations further confirm these benefits, improving output throughput by 8.4% under vLLM GPU-CPU expert offloading and reducing TPOT by 43.6-49.8% under this http URL on Jetson Orin NX, corresponding to a 1.77-1.99 \times decode speedup across diverse workloads. Checkpoints and usage instructions are available at this https URL. Comments: Accepted at the 43rd International Conference on Machine Learning (ICML 2026) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) ACMclasses: I.2.6; C.1.3 Cite as: arXiv:2605.27081 [cs.LG] (or arXiv:2605.27081v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.27081 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-27] rust Region Q Adjoint Matching
链接: https://arxiv.org/abs/2605.27079
作者: Yonghoon Dong,Kyungmin Lee,Changyeon Kim,Jaehyuk Kim,Jinwoo Shin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter \lambda in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of \lambda . As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.
[AI-28] wo Speeds of Learning: A Representation-Readout Decomposition of Grokking and Double Descent
链接: https://arxiv.org/abs/2605.27078
作者: Chi-Ning Chou,Oscar Uzdelewicz,Neng-Chun Chiu,Yao-Yuan Yang,SueYeon Chung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Training loss and accuracy are the standard signals used to monitor generalization during deep neural network training. Two well-documented phenomena complicate this picture: in grokking, train loss falls rapidly while test performance improves abruptly only after a long delay; in epoch-wise double descent, train loss decreases monotonically while test loss or error rises and falls. Existing accounts are often task-specific, and a task-agnostic analysis framework for diagnosing and explaining these phenomena across realistic tasks and architectures is missing. We address this challenge by analyzing two competing processes that underlie learning dynamics: representation learning in the encoder and readout calibration in the final classifier. Using tools from representational geometry, neural tangent kernels, and linear probing, we show that both processes are active throughout training, with the fluctuations of their relative speed giving rise to seemingly anomalous generalization dynamics. Applying the representation-readout decomposition to grokking across a wide range of tasks and architectures, we find that the readout is train-biased before grokking onset, and representation learning is gradual but not absent, contrary to the lazy-to-rich account. The framework further provides diagnostic signatures distinguishing spurious from genuine generalization: in a previously reported MNIST grokking example and an epoch-wise double descent example, apparent delayed or non-monotone generalization is shown to arise from representation degradation and readout misalignment induced by non-standard training recipes. Together, these results establish the representation-readout decomposition as a top-down framework for understanding learning dynamics and revealing underlying algorithms for interpretability research.
[AI-29] raceable Knowledge Graph Reasoning Enables LLM -Assisted Decision Support for Industrial VOCs in the Steel Industry
链接: https://arxiv.org/abs/2605.27071
作者: Changqing Su,Yu Ding,Zuhong Lin,Hongyu Liu,Xi He,Zheng Zeng,Liqing Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Key knowledge for steel-industry volatile organic compounds (VOCs) governance is scattered across unstructured scientific literature, making it difficult to integrate process, pollutant, and control-technology evidence and increasing the risk of hallucination when general large language models (LLMs) answer low-frequency industrial questions. Here we developed Chat-ISV, a knowledge graph (KG) enhanced multi-agent QA system that parses a curated steel-industry VOCs literature corpus, constructs a Neo4j KG with 27180 nodes and 81779 semantic edges, and combines prompt-constrained extraction, chunk-centered topology optimization, multi-agent routing, source-backtracking retrieval, local literature retrieval, open-domain knowledge access, and interactive subgraph visualization. Benchmark tests and 400 expert blind evaluations showed that topology optimization reduced isolated nodes from 57% to 4.08% and that Chat-ISV achieved high factual reliability, with 96.93% precision, 72.63% recall, an F1-score of 0.830, and a mean score of 1.69/2.00. By converting fragmented environmental-engineering literature into traceable, queryable, and decision-support-oriented knowledge, Chat-ISV establishes a scalable environmental-informatics paradigm for reliable LLM deployment and intelligent pollution-control decision support in specialized industrial domains.
[AI-30] ConVer: Using Contracts and Loop Invariant Synthesis for Scalable Formal Software Verification
链接: https://arxiv.org/abs/2605.27051
作者: Muhammad A. A. Pirzada,Weiqi Wang,Yiannis Charalambous,Konstantin Korovin,Lucas C. Cordeiro
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 12 pages; 6 figures
Abstract:Formal verification of large C programs is impeded by state-space explosion: Bounded Model Checking (BMC) tools must encode the entire state space up to the predetermined bound by unrolling all nested constructs. We present ConVer, a top-down compositional verification tool. Given a C program with a top-level assertion, ConVer decomposes verification top-down: it uses a large language model (LLM) to synthesise function contracts from the system property, then alternates system-level and function-level checks in a CEGAR-CEGIS loop, refining contracts whenever a check fails via SMART ICE learning. We evaluate ConVer on four benchmark suites of increasing difficulty and against other state-of-the-art (SOTA) tools. On the Frama-C benchmark of 45 simple C programs, ConVer achieves 82-96% verification success across three LLM backends, with 93-95% of converged programs requiring only a single CEGAR-CEGIS iteration. On the X.509 parser benchmark (6~programs) and LF2C-Simple suite (17 programs), ConVer achieves 33-50% and 82-88% success respectively. On the VerifyThis suite of 11 recursive and loop-intensive programs, the Pre-Abstraction strategy achieves 55-64% success. In addition, we present ESBMC-LF a preprocessor tool that converts LF models to C while preserving the properties of the LF files, enabling ConVer to verify them. We transpile the LF Verifier Benchmarks using ESBMC-LF to C; we denote those LF-Hard. We show that ConVer successfully verifies 67% of LF-Hard benchmarks overall.
[AI-31] BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting
链接: https://arxiv.org/abs/2605.27044
作者: Ruifeng Tan,Jintao Dong,Weixiang Hong,Jia Li,Jiaqiang Huang,Tong-Yi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Early battery degradation trajectory forecasting (BDTF), which predicts the full-life state-of-health trajectory from early operational data, is critical for battery optimization, manufacturing, and deployment. Battery degradation data exhibit two key characteristics. First, degradation data present a multi-level structure, including regularities shared within aging conditions and trajectory patterns shared across batteries. Second, degradation-related variations in voltage-current profiles are often localized to specific state-of-charge (SOC) intervals. Existing approaches often fail to explicitly model these characteristics. To bridge this gap, we propose BatteryMFormer, a multi-level Transformer for early BDTF. BatteryMFormer integrates (1) an aging-condition-aware decoder that injects aging-condition priors via aging-condition-informed queries and aging-condition-aware attention, (2) a meta degradation pattern memory that learns and retrieves trajectory prototypes to guide long-horizon forecasting, and (3) a dual-view encoder that jointly captures temporal dynamics and SOC-localized variations from voltage and current time series. Extensive experiments on four battery domains show that BatteryMFormer consistently outperforms state-of-the-art baselines, marking a significant step toward reliable BDTF. Our code is available at this https URL.
[AI-32] Lessons from Penetration Tests on Large-Scale Agent Systems
链接: https://arxiv.org/abs/2605.27042
作者: Kevin Eykholt,Dhilung Kirat,Xiaokui Shu,Jiyong Jang,Frederico Araujo,Ian Molloy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at SAGAI 2026
Abstract:As AI systems gain increasing autonomy and execution capability, the number of discovered security vulnerabilities continues to rise. However, many of these vulnerabilities are not fundamentally novel, but instead reflect recurring classes of weaknesses long observed in prior computing systems. Execution-capable AI agents are effectively unbounded, self-modifying programs that interact extensively with multiple layers of the computing stack. This broad interaction surface imposes a significant security burden on developers, who must reason about and secure complex cross-layer behaviors. Prior research has primarily focused on vulnerabilities in open-source agents and agent frameworks. In contrast, it remains unclear whether proprietary agent systems – developed under stricter coding standards and formal review processes – exhibit similar security weaknesses. In this paper, we present findings from two penetration tests conducted in 2025 against proprietary agent products and evaluate whether the security posture of AI agents has improved since these assessments.
[AI-33] Less is More: Early Stopping Rollout for On-Policy Distillation
链接: https://arxiv.org/abs/2605.27028
作者: Zhou Ziheng,Jiaqi Li,Huacong Tang,Ying Nian Wu,Demetri Terzopoulos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay’’ problem in this paradigm: for the later tokens, with student’s earlier trajectory as context that is off-policy to the teacher, the teacher’s ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered “Cascading Alignment” and “Sub-mode Commitment” effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.27028 [cs.LG] (or arXiv:2605.27028v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.27028 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-34] Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling
链接: https://arxiv.org/abs/2605.27023
作者: Yinan Liu,Wenjin Xu,Zhiyuan Zha,Xiaochun Yang,Bin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge graphs (KGs) have become the core backbone of numerous downstream tasks such as question answering and recommender systems. However, despite all this, KGs are often very incomplete. To perform zero-shot knowledge graph completion in unseen KGs, which have different relational vocabularies from those used for pre-training, KG foundation models (KGFMs) receive a wide range of attention. Existing KGFMs often perform training using random negative triples, which are constructed by replacing the head or tail entity of a positive triple with a random entity. However, these negative triples are often constructed with limited quality, providing weak supervision for KGFM training. In this paper, we propose a simple yet effective adaptive negative sampling approach, KMAS, to enhance existing KGFMs. KMAS constructs hard negative triples through the updated relation embeddings generated from the existing KGFM’s relation encoder. To further adaptively align with the evolving capability of the KGFM during the training process, KMAS adjusts the ratio of hard negative triples dynamically throughout the whole training process: after a warmup phrase, it increases the ratio linearly and then decreases linearly. Extensive experiments are conducted over 44 data sets. Experimental results demonstrate that our proposed negative sampling method can enhance many SOTA KGFMs without requiring excessive additional time or memory consumption.
[AI-35] ORCA: An End-to-End Interactive Copilot for Optimized Root Cause Analysis
链接: https://arxiv.org/abs/2605.27022
作者: Phi Nguyen Xuan,Nicholas Tagliapietra,Lavdim Halilaj,Kristian Kersting,Juergen Luettin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Causal analysis is a crucial task in many domains, including manufacturing, social science, and medicine. However, despite recent progress, the conceptual and methodological complexity of causal methods makes them largely inaccessible to domain experts. This gap prevents experts from leveraging these advances and hinders researchers who lack access to real-world data for validation. To bridge this divide, we introduce ORCA, a copilot for end-to-end causal analysis. ORCA orchestrates agents to understand the user’s goals and guide them through the most appropriate causal analysis workflow, from fully automatic to highly user-guided execution. It features causal discovery, causal effect estimation, explainability and Root-Cause-Analysis (RCA). ORCA evaluates and compares performance, generates key metrics and diagrams, and generates insights through structured reports. We highlight its effectiveness across several real-world use-cases.
[AI-36] Reason Ops: A Unified Operational Paradigm for Trustworthy Verified LLM Reasoning
链接: https://arxiv.org/abs/2605.27014
作者: Adnan Rashid
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 5 Pages
Abstract:Large Language Models (LLMs) have transformed artificial intelligence from primarily generative systems into increasingly capable reasoning agents. Recent advances in theorem proving, autoformalization, symbolic reasoning, and tool-augmented language models demonstrate substantial progress toward machine-assisted formal reasoning. However, current reasoning systems still suffer from hidden logical inconsistencies, hallucinated symbolic transitions, unsupported theorem applications, and limited reliability guarantees. Existing approaches remain fragmented across formal verification, runtime assurance, neuro-symbolic reasoning and trustworthy Artificial Intelligence (AI) research communities. This paper introduces ReasonOps, a unified operational paradigm for trustworthy verified reasoning systems. Inspired by operational ecosystems such as DevOps and MLOps, ReasonOps treats reasoning as a continuously monitored, verifiable, reliability-aware operational process rather than an isolated inference task. The proposed paradigm integrates semantic interpretation, autoformalization, symbolic reasoning, theorem proving, runtime assurance, probabilistic reliability estimation, and adaptive correction into a unified reasoning lifecycle. The paper further presents the ReasonOps architecture, demonstrates its workflow using an autonomous braking system analysis example, and discusses its potential role in future safety-critical autonomous AI systems. We argue that operational reasoning paradigms such as ReasonOps may become foundational infrastructure for next-generation trustworthy AI ecosystems. Comments: 5 Pages Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.27014 [cs.LO] (or arXiv:2605.27014v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2605.27014 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-37] Generating Robust Portfolios of Optimization Models using Large Language Models ICML2026
链接: https://arxiv.org/abs/2605.27013
作者: Eleni Straitouri,Cheol Woo Kim,Milind Tambe
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the ICML 2026 LM4Plan Workshop
Abstract:Mathematical optimization is a powerful tool for structured decision-making across domains such as resource allocation and planning. Formulating optimization models faithful to reality, though, remains a significant bottleneck as it typically demands both domain expertise and optimization knowledge that are often scarce. Recent advances in large language models (LLMs) promise to bridge this gap, enabling the generation of candidate optimization models from natural language descriptions. However, there is no guarantee that any single LLM-generated model is reliable, and existing approaches that output only one model are therefore risky. In this work, we propose a novel algorithm that generates a portfolio of optimization models, designed to be robust to the limitations of LLMs. Our method exploits the observation that a single LLM can play two distinct roles \unicodex2014 as a stochastic generator and as a reasoning evaluator \unicodex2014 and proposes a unified framework that leverages both capabilities in a complementary manner. We provide theoretical guarantees showing that, as long as either the generator or the evaluator is well-aligned with human preferences, the portfolio is guaranteed to contain high-quality candidates, enabling a principled human-in-the-loop process in which a decision-maker can review multiple candidates before committing to one. We further validate our approach empirically, demonstrating strong performance across a range of optimization modeling tasks.
[AI-38] Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)
链接: https://arxiv.org/abs/2605.26942
作者: Paul Sigloch,Christoph Benzmüller
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注: Extended preprint version of accepted technical communication at KI 2026. 22 pages, 3 figures
Abstract:LLMs deployed in high-stakes domains face fundamental reliability challenges: hallucinations, inconsistencies, and privacy vulnerabilities introduce unacceptable risks where errors carry legal, financial, or safety consequences. This paper presents a hybrid verification architecture combining formal symbolic methods with neural semantic analysis to provide complementary guarantees for LLM-generated content. This architecture employs logical reasoning for input verification, leveraging completeness properties to provide decidable guarantees on structured requirements. For output validation, embedding-based semantic similarity detects contextual hallucinations where formal methods lack expressiveness. This separation is realized in a parallel, actor-based pipeline, addressing limitations of prompt-based self-verification approaches, which inherit the distributional biases that produce hallucinations. The proposed architecture and type-aware verification method are validated with HAIMEDA, a real-world medical device damage assessment reporting system developed through Action Design Research. Evaluation shows hallucination detection rates of over 83% for structured entities and 72% for semantic fabrications, with a 30% reduction in report creation time, demonstrating that neuro-symbolic architectures can provide principled safeguards for LLM deployment in data-sensitive domains.
[AI-39] Developing a Totally Unimodular Linear Program for Optimal Conformance Checking: When and Why It Complements A*
链接: https://arxiv.org/abs/2605.26938
作者: Izack Cohen
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Author-accepted manuscript accepted for publication in Expert Systems with Applications. Code and experiment scripts are available at: this https URL . Version corresponding to the accepted paper: v1.0.0
Abstract:Alignment-based conformance checking is the state-of-the-art approach for comparing observed process executions with normative process models. The standard exact solution relies on an A*-based heuristic search, which can exhibit exponential runtime in the presence of long traces or substantial deviations. This paper introduces a reformulation of alignment-based conformance checking as a totally unimodular linear program (LP) defined on the reachability graph of the synchronous product. By exploiting the underlying network-flow structure, the proposed formulation guarantees the existence of an integral optimal extreme-point solution through LP relaxation, thereby avoiding the combinatorial overhead associated with integer variables and branch-and-bound search. We conduct an extensive empirical evaluation on more than 2.1 million conformance checking instances derived from real-world and synthetic benchmark datasets. The results show that A* and the LP approach exhibit complementary performance characteristics: the former performs best on short, well-conforming traces, while the LP formulation provides substantial speedups for longer traces with deviations, precisely where conformance checking is most informative. Based on these findings, we derive simple algorithm-selection guidelines that combine both approaches, achieving average runtime savings of 38.6% with 96% selection accuracy compared to always using A*. Comments: Author-accepted manuscript accepted for publication in Expert Systems with Applications. Code and experiment scripts are available at: this https URL. Version corresponding to the accepted paper: v1.0.0 Subjects: Artificial Intelligence (cs.AI); Optimization and Control (math.OC) MSC classes: 90C05, 90C27, 90C35, 68Q25 Cite as: arXiv:2605.26938 [cs.AI] (or arXiv:2605.26938v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.26938 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-40] From Norms to Indicators (N2I-RAG ): An Agent ic Retrieval-Augmented Generation Framework for Legal Indicator Computation
链接: https://arxiv.org/abs/2605.26926
作者: Youssef Al Mouatamid,Marie Bonnin,Jihad Zahir
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Computing legal indicators from normative texts is a key task in legal monitoring and policy evaluation, but presents significant challenges due to the complexity, scale, and interpretive nature of legal language, as well as the variability in available document quality. Existing natural language processing techniques and generative models can assist in legal analysis, but often suffer from high risk of hallucinations and lack the interpretability and evidence grounding required for reliable indicator computation. This paper presents N2I-RAG (From Norms to Indicators), an agentic retrieval-augmented generation framework designed to automate the computation of legal indicators in a transparent and traceable way. We integrate adaptive retrieval, llm-based agents, and validation mechanisms in a modular pipeline, where each component performs a defined role in filtering, retrieving, and assessing evidence, and in producing binary legal outcomes linked to identifiable legal provisions. The framework emphasizes traceability by requiring explicit explanations of intermediate decisions and final indicator assignments. We evaluate N2I-RAG using an in-house constructed French marine environmental law corpus that includes both scanned and digital sources. Comparative experiments with multiple language model families demonstrate that the proposed approach consistently outperforms baseline systems, and generalizes well when tested on 2 different bans. The results indicate that agentic retrieval-augmented generation can bridge open-text legal language and standardized indicator computation, offering a foundation for transparent and scalable legal observatories.
[AI-41] ADDLE: A Tool-Augmented Agent for Detecting Deficient LLM -Generated Peer Reviews
链接: https://arxiv.org/abs/2605.26911
作者: Hanqi Duan,Xiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-generated peer reviews are increasingly common at major venues, yet their deficiencies are hard to detect because they are uniformly fluent and well-structured. Existing work either classifies authorship without judging quality, or scores quality with features designed for human-written reviews; no prior system detects deficiencies in LLM-generated reviews at the level of individual defect types. To bridge the gap, we introduce TADDLE, a Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews, together with the first expert-annotated benchmark for this task. Our benchmark comprises 1,800 reviews on 50 ICLR 2025 papers, multi-label-annotated by 18 domain experts against a taxonomy of six defect categories (plus a non-deficient label). TADDLE decomposes detection into four specialized analysis tools – Verify, Correct, Complete, and Transform – orchestrated by an agent; an integrator synthesizes their outputs into binary and multi-label classifications via two-stage semi-supervised learning. Extensive experiments show that TADDLE performs strongly on both binary detection and the multi-label classification task. We release the benchmark and code at this https URL.
[AI-42] EEG-FM-Audit: A Systematic Evaluation and Analysis Pipeline for EEG Foundation Models
链接: https://arxiv.org/abs/2605.26910
作者: Xianheng Wang,Yige Yang,Damien Coyle
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages
Abstract:Large EEG Foundation Models (FMs) have shown great potential for decoding EEG signals across diverse cognitive tasks. However, existing EEG-FM studies exhibit three critical limitations: opaque supervised baseline tuning, unverified contributions of complex learning paradigms, and a lack of transparency in model decision-making. To address these, we propose EEG-FM-Audit, a comprehensive evaluation and analysis pipeline designed to systematize the assessment of EEG-FMs. EEG-FM-Audit consists of three primary components: (1) an ASHA-driven benchmarking protocol that ensures fair comparisons by transparently optimizing supervised baselines; (2) paradigm-level ablation studies to evaluate the effectiveness of learning paradigms in FMs; and (3) a neurophysiological probing (NPP) framework, which explores whether FMs leverage valid temporal, spatial, and spectral EEG properties. We apply EEG-FM-Audit to four state-of-the-art EEG-FMs and five representative supervised models across three public datasets. Our results reveal that properly tuned supervised baselines can match or outperform advanced FMs, despite requiring significantly fewer parameters. Furthermore, we find that the effectiveness of learning paradigms of FMs is highly dependent on dataset scale and architecture. Finally, NPP analysis demonstrates how FMs rely on specific physiological features, establishing a framework for more interpretable neural decoding.
[AI-43] On the Detection of Commutative Factors in Factor Graphs: Necessary and Sufficient Conditions
链接: https://arxiv.org/abs/2605.26908
作者: Malte Luttermann,Ralf Möller,Marcel Gehrke
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
备注:
Abstract:Exploiting the indistinguishability of objects in a probabilistic graphical model such as a factor graph is key to lifted probabilistic inference algorithms and allows for tractable probabilistic inference problems with respect to domain sizes. A central building block for the exploitation of indistinguishable objects in factor graphs is the identification of commutative factors, i.e., factors whose output values are invariant under permutations of input values assigned to a subset of their arguments. In this paper, we revisit the theoretical foundations underlying the state-of-the-art algorithm to detect commutative factors. Specifically, we show that in its current form, the state-of-the-art algorithm relies on a central theorem that is mistakenly regarded as a sufficient condition to identify commutative factors, while it actually only implies necessary condition. Consequently, the state of the art might, as we show in this paper, deliver incorrect results. To fix the flaws currently present in the state of the art, we prove a slightly modified version of the aforementioned theorem, which serves as a necessary condition to identify commutative factors. Moreover, we present a corrected version of the state-of-the-art algorithm, which keeps its efficiency while ensuring correctness and introduce a complementary algorithm with tighter worst-case bounds.
[AI-44] Practical Anonymous Two-Party Gradient Boosting Decision Tree
链接: https://arxiv.org/abs/2605.26903
作者: Huang Chenyu,Zhang Fan,Du Minxin,Chow Sherman SM,Chen Huangxun,Rao Huaming,Huang Danqing,Qian Bo,Chen Peng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 19 pages; 2026 IEEE Symposium on Security and Privacy (SP)
Abstract:Structured data is well handled by gradient-boosted decision trees (GBDT), which are usually trained on vertically partitioned features across mutually distrustful parties. High speed and interpretability make GBDTs popular in finance and healthcare, where neural networks may fall short. Enabling secure computation for GBDTs poses unique challenges, requiring secure record alignment for comparison. Relying on private set intersection (PSI) is a de facto approach. Mistaking PSI for a safety measure actually exposes which record identifiers (IDs) are shared between the datasets. Although circuit-PSI could help, it is costly for generic uses. New ideas are needed to efficiently train in a “dark forest”. Aiming to hide the IDs, we initiate the study of anonymous GBDT training on split data held by two parties. Dual circuit-PSI in our design lets the parties alternate as receiver to run pick-then-sum over local features. Via oblivious programmable pseudorandom functions, we propagate circuit-PSI outputs as shared state across runs. Avoiding universal alignment, we resolve the neglected dilemma that ID hiding incurs a cost that scales with domain size. Next, we halve the cost of ciphertext packing used to convert single-instruction multiple-data homomorphic encryption from (ring) learning with errors in prior secure GBDT (Usenix Security’ 23) and related secure machine-learning computations. Comparative experiments show our protocol remains competitive with leaky approaches in efficiency. Enabling ID-hiding aggregation, our techniques can extend to other vertically partitioned analytics.
[AI-45] Strategies for Guiding LLM s to Use Software Design Patterns: A Case of Singleton
链接: https://arxiv.org/abs/2605.26898
作者: Viktor Kjellberg,Farnaz Fotrousi,Miroslaw Staron
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at PROMISE 2026
Abstract:Large Language Models (LLMs) can generate functional source code from natural-language prompts, but often fail to consistently follow higher-level architectural structures or design patterns. Since LLMs are increasingly used in software engineering, their ability to apply established design principles to generated code is crucial to the long-term success of software products. Therefore, the goal of this paper is to identify strategies for guiding LLMs to incorporate design patterns into the generated source code. We designed a computational experiment to evaluate the ability of 13 LLMs to generate code that follows the Singleton design pattern, using four prompting strategies: instructions, binary automated feedback, extensive automated feedback, and extensive feedback with few-shot prompts, in 164 Java coding challenges from HumanEval-X. Our results shows that the optimal strategy to guide LLMs to include design patterns depends heavily on the type of model. Still, overall, iterative binary feedback provides the best alignment with Singleton while preserving or improving the code’s functionality. With guiding with instructions, Llama 3.3 generated Singleton classes in 100% of cases and improved code functionality, increasing the number of tests passed by 34.1 percentage points. It achieved a similar result with guidance through instructions and binary feedback. Qwen 3 (8B) increased the alignment with Singleton to 99.2% and the functionality to 58.6% using binary feedback. Our result suggests that even simple strategies can be used to guide LLMs to use design patterns.
[AI-46] Negligible in Size Significant in Effect: On Scale Vectors in Large Language Models
链接: https://arxiv.org/abs/2605.26895
作者: Mingze Wang,Shuchen Zhu,Yuxin Fang,Binghui Li,Kai Shen,Shu Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 36 pages
Abstract:Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.
[AI-47] Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation
链接: https://arxiv.org/abs/2605.26878
作者: Lulu Zheng,Wenjin Yang,Xiangwen Zhang,Rong Yin,Yulan Hu,Zheng Pan,Xin Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-stakeholder tasks require one output to satisfy users with conflicting preferences. Holistic LLM judges conflate utility estimation and utility aggregation, yielding unstable implicit weights. We show empirically and theoretically that this aggregation-specific \emphweighting noise can create large score shifts when stakeholder satisfaction is dispersed; in our experiments, these weight-induced shifts also increase with stakeholder count. We propose \textscDecompR: counterfactual-calibrated weights are fixed from query structure before candidate scoring, while per-role utilities are estimated independently, removing candidate-dependent weight drift and reducing estimation noise.
[AI-48] Knowledge Graphs as the Missing Data Layer for LLM -Based Industrial Asset Operations KDD2026
链接: https://arxiv.org/abs/2605.26874
作者: Madhulatha Mandarapu,Sandeep Kunkunuru
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 12 tables. Positions a typed knowledge-graph data layer orthogonally to the LLM-orchestration paradigms (Agent-As-Tool vs Plan-Execute) compared in AssetOpsBench (KDD 2026). Adds a same-model gpt-4.1 NLQ row and the IBM 3-axis rubric re-scoring. Code: this https URL
Abstract:LLM-based agents for industrial asset operations show limited accuracy when reasoning over flat document stores. AssetOpsBench (KDD 2026) establishes that GPT-4 agents achieve 65% on 139 industrial maintenance scenarios backed by CouchDB, YAML, and CSV. It compares LLM orchestration paradigms (Agent-As-Tool vs Plan-Execute) on a fixed data layer; we ask a complementary, orthogonal question: how much does the data model behind the tools affect agent performance? Building on the same scenarios, we introduce a knowledge graph layer (781 nodes, 955 edges, 16 relationship types) and evaluate three architectures: (1) deterministic graph handlers (no LLM) at 99% (137/139); (2) LLM-generated Cypher over the graph at 82-83% with the same GPT-4 model the baseline uses; and (3) the original tool-augmented LLM baseline at 65% (91/139, matching the published KDD 2026 leaderboard ceiling). Our key finding is inverted LLM usage: rather than asking the LLM to reason over raw data, we ask it to generate structured queries from a typed schema. The graph executes deterministically. We additionally contribute 40 graph-native scenarios (multi-hop dependency, vector similarity, PageRank criticality), and evaluate against the expanded HuggingFace AssetOpsBench release (467 scenarios, 6 domains), where deterministic handlers achieve 100% (467/467) with average score 0.848. These results suggest that for structured operational domains, the data layer – not the LLM orchestration – is the primary bottleneck, and that knowledge graphs serve as an integration layer between raw industrial data and LLM-based reasoning. Comments: 16 pages, 12 tables. Positions a typed knowledge-graph data layer orthogonally to the LLM-orchestration paradigms (Agent-As-Tool vs Plan-Execute) compared in AssetOpsBench (KDD 2026). Adds a same-model gpt-4.1 NLQ row and the IBM 3-axis rubric re-scoring. Code: this https URL Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) ACMclasses: H.2.8; I.2.7; I.2.4 Cite as: arXiv:2605.26874 [cs.DB] (or arXiv:2605.26874v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2605.26874 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sandeep Kunkunuru [view email] [v1] Tue, 26 May 2026 11:31:46 UTC (18 KB)
[AI-49] Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLM s
链接: https://arxiv.org/abs/2605.26835
作者: Yunbo Long,Haolang Zhao,Ge Zheng,Alexandra Brintrup
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-based multi-agent systems have been widely adopted for knowledge retrieval and report generation, synthesizing known information through web search and textual reasoning. However, many critical information tasks in supply chains are not simple one-shot queries: they are structural inference problems requiring multi-hop reasoning across complex, fragmented web resources. Questions such as \textit``Which Tesla components use lithium from Australian mines?‘’ have no answer in any single document; answers must be computationally synthesized through the autonomous construction and analysis of dynamic knowledge graphs assembled from fragmented, heterogeneous sources. Moreover, such discovery processes must be uncertainty-aware: decisions depend not only on answers but on calibrated confidence in their reliability, traceable to source quality and reasoning consistency. To address this capability gap, we propose \textitHelicase, an autonomous multi-agent LLM system for uncertainty-guided supply chain knowledge graph construction. \textitHelicase decomposes high-level supply-chain queries into executable investigation plans, coordinates specialized web-search, reasoning, and coding agents through iterative verification loops, and incrementally constructs query-specific supply chain knowledge graphs with per-fact uncertainty annotations. Its three-layer uncertainty framework tracks uncertainty at the action, trajectory, and memory layers, enabling both structural inference and calibrated confidence assessment. To evaluate autonomous reasoning across the full complexity spectrum, we introduce SCQA (Supply Chain Query Assessment), a benchmark of 80 supply chain queries organized into four quadrants spanning single-hop to multi-hop inference under both high and low data visibility.
[AI-50] Periodic Topological Deep Learning for Polymer Design and Discovery
链接: https://arxiv.org/abs/2605.26833
作者: Yasharth Yadav,Tze Kwang Gerald Er,Atsushi Goto,Kelin Xia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 3 figures, 3 tables
Abstract:Polymers underpin applications across energy, healthcare, and materials science, yet their vast chemical space makes systematic discovery challenging. Most machine learning approaches represent polymers as molecular graphs of a single repeating unit, thereby missing both the periodicity of polymer chains and many-body interactions beyond pairwise bonds. We introduce Periodic-TDL, a deep learning framework built on periodic Vietoris-Rips complexes that capture many-body interactions across multiple spatial scales, followed by a hierarchical simplicial message-passing (HSMP) encoder that propagates information from long-range interactions to covalent bonds, yielding representations enriched by higher-order topological features. Periodic-TDL outperforms all state-of-the-art models across polymer property prediction tasks spanning electronic, optical, physical, and thermal targets. Furthermore, we quantitatively validate how ester-to-amide substitution and \alpha -methylation enhance thermal stability. Using a computationally synthesized dataset of 48,208 structures-generated via systematic substitution of acrylate and acrylamide polymers-we observed a mean T_g increase of \sim 55^\circ C for ester-to-amide substitutions and \sim 14^\circ C for backbone \alpha -methylation across matched polymer pairs. To verify these predicted trends, we use our Periodic-TDL model to analyze six novel polymer pairs from independent experimental measurements, including three newly synthesized polymers previously unreported in the literature. The experimental data successfully confirmed the model’s predictions. Ultimately, these findings demonstrate that Periodic-TDL captures the underlying physical effects of specific functional group modifications, rather than merely optimizing predictive performance on benchmark datasets.
[AI-51] Innovation: An Almost Characterization of Hallucination
链接: https://arxiv.org/abs/2605.26808
作者: Nishant P. Das,Piyush Srivastava
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:Hallucination is a central limitation of large language models (LLMs), and substantial effort has been devoted to understanding and mitigating it. Towards this, Kalai and Vempala (STOC 2024) introduced a probabilistic framework formalizing calibration and hallucination, and showed that, with high probability, calibrated LLMs hallucinate roughly at the rate of the “missing mass”, a measure of how incomplete the training data is relative to its source. This raises two fundamental questions: (i) what property of a calibrated LLM makes hallucinations unavoidable? and (ii) can hallucinations be avoided by giving up calibration? We answer these questions by introducing a simpler property we call innovation that measures the tendency of a model to produce outputs outside the training data. We show that innovation is implied by the condition for hallucination identified by Kalai and Vempala, and, further, that it is an almost characterization of hallucination: hallucination implies innovation, and conversely, innovation implies hallucination with high probability. We also provide lower bounds on the hallucination rate based on the “innovation rate”, and by relating innovation rate back to missing mass, we obtain new hallucination rate lower bounds based on missing mass that extend the results of Kalai and Vempala.
[AI-52] HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML
链接: https://arxiv.org/abs/2605.26807
作者: Jiajun Wu,Jian Yang,Tuney Zheng,Wei Zhang,Haowen Wang,Yihang Lou,Xianglong Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 27 pages, 11 figures. Code: this https URL
Abstract:LLMs can now produce full HTML pages, but many of those pages are only superficially correct: they render once, then fail under scroll, hover, click, resize, or gameplay. Evaluation from screenshots can miss these failures, and filtering discards many pages that are still repairable. We introduce HTMLCure, a browser experience framework that evaluates HTML after the system has interacted with it. The evaluator executes the page across viewports and interaction states, records deterministic browser evidence, and gives the VLM curated keyframes from the executed trajectory rather than isolated screenshots. The same state signal drives a closed loop repair engine: HTMLCure diagnoses the current page, chooses a state specific repair family, runs each candidate again, and exports quality cleared pages for SFT. On a 97K prompt corpus, this expands the directly usable seed into a candidate pool of 63703 quality cleared pages, from which we construct the final refined SFT set of 40K pages. Under the same backbone and training recipe, HTMLCure-27B-Refined reaches 50.6 on HTMLBench-400 with 45.2% deterministic test case pass, placing it in the same performance band as strong reference rows such as Kimi-K2.6 and GPT-5.4. On the released MiniAppBench validation split, it reaches 81.2 average, improving raw 27B SFT by 15.3 points and approaching the level of strong reference systems.
[AI-53] What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation
链接: https://arxiv.org/abs/2605.26795
作者: Xiang Wang,Wei Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chain-of-thought (CoT) prompting reliably improves language-model accuracy, but which properties of a rationale text drive the improvement is poorly understood. Prior work has largely studied generation-time behavior. We instead ask a probe-time question: given a fixed rationale in context, what in that text changes the answer? We identify two complementary sources of the gain. First, even a globally word-shuffled rationale substantially outperforms the no-rationale baseline, indicating a strong lexical activation effect. More importantly, the additional gain from structured text appears to arise less from sentence-level logical ordering and more from short-range token adjacency. Preserving contiguous windows of just n^\star=2 – 3 tokens recovers most of the remaining gain toward full CoT performance. Supporting experiments rule out copying of explicit answer declarations or answer values, as well as full grammatical realization, as primary drivers. Further generalization experiments show that the qualitative pattern remains stable across multiple model families, parameter scales, and datasets. These results support a local co-occurrence activation (LCA) account of probe-time CoT, in which the observed gains appear to arise primarily from lexical activation and short-range token co-occurrence rather than sentence-level logical derivation.
[AI-54] Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning
链接: https://arxiv.org/abs/2605.26789
作者: Zhe Yu,Wenpeng Xing,Yunzhao Wei,Jie Chen,Hongzhi Wang,Xuyang Teng,Meng Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Post-training is routinely evaluated through aggregate benchmark scores that treat multi-hop reasoning as a single capability – as if a model that answers more questions correctly must be better at assembling facts. We show that this assumption can be misleading: recipes with statistically indistinguishable atomic knowledge produce composition behaviour separated by over 40 percentage points, a phenomenon we call composition collapse: the systematic failure to assemble stably-known facts into chains, invisible to aggregate metrics. We introduce a double-gate protocol that changes the estimand from an aggregate compositionality gap to residual composition failure conditioned on stable atomic access, decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth. On a benchmark of temporal factual chains spanning depths 2–11 across four post-training recipes, this decomposition reveals that post-training objectives shift composition capability in directions that aggregate metrics mask, and suggests that claims about multi-hop reasoning improvement should be accompanied by atomic-gate-controlled composition metrics. Diagnostic probes further show that a substantial share of measured composition failure reflects generation-time computation constraints rather than permanent inability to compose.
[AI-55] Implementation of Big Data Analytics for Diabetes Management: Needs Assessment in the Rwanda Healthcare System
链接: https://arxiv.org/abs/2605.26786
作者: Silas Majyambere,Tony Lindgren,Workneh Y. Ayele,Celestin Twizere
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Diabetes is a chronic metabolic disease that can lead to serious health problems if not diagnosed and managed early. Big Data Analytics (BDA) and machine learning offer practical tools for analyzing large health datasets and supporting early detection and better treatment decisions. However, their use in routine clinical practice is still limited. This study examines the readiness of Rwanda’s healthcare system to adopt big data analytics for diabetes management. As the country continues to expand its use of electronic medical records and health information systems, new opportunities arise for improving prediction, monitoring, and clinical decision-making. A five-day workshop involving 25 key stakeholders, including clinicians, data managers, policymakers, medical researchers, nutritionists, and technology providers, was conducted to assess preparedness and identify existing gaps. The findings highlight both the potential and the main challenges of BDA implementation. Based on these results, the paper proposes a practical BDA framework to support diabetes management strategies using explainable machine learning models.
[AI-56] Ratio-Variance Regularized Policy Optimization
链接: https://arxiv.org/abs/2605.26784
作者: Yu Luo,Shuo Han,Yihan Hu,Lei Lv,Huaping Liu,Fuchun Sun,Jianye Hao,Dong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost by indiscriminately truncating high-return yet high-divergence updates. We demonstrate that explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints, eliminating the need for binary hard clipping. By acting as a distributional ``soft brake’', this approach preserves critical gradient signals from novel discoveries while naturally down-weighting and enabling the reuse of stale, off-policy data. We introduce \bf R^2\bf VPO (Ratio-Variance Regularized Policy Optimization), which implements this constraint via a primal-dual optimization framework. Extensive evaluations across 7 LLM scales, spanning both fast and slow reasoning paradigms, and 10 robotic control tasks demonstrate the generality of the proposed approach. R ^2 VPO achieves substantial performance gains on mathematical reasoning benchmarks, with particularly pronounced improvements on smaller models, while significantly improving sample efficiency. Furthermore, it consistently outperforms PPO baselines in continuous control domains, particularly in sparse-reward and dynamic environments. Together, these findings establish ratio-variance regularization as a principled foundation for stable and data-efficient policy optimization.
[AI-57] LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
链接: https://arxiv.org/abs/2605.26781
作者: Xiaohan Wang,Mingze Yin,Yilin Zhao,Gang Liu,Dian Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments. Specifically, most datasets are static, prone to data contamination, and are often confined to restricted modalities, disciplines, and evaluation criteria. To address these issues, we introduce LiveK12Bench, a dynamic, holistic, multi-disciplinary benchmark designed to evaluate the reasoning abilities of LMMs in realistic examination scenarios. LiveK12Bench comprises 2K+ verified questions spanning Mathematics, Physics, Chemistry, and Biology, sourced from the latest real-world exam papers and designed to grow over time. Our framework features several core innovations: 1) featuring an automated pipeline that continuously ingests and parses the latest examination papers to mitigate data leakage; and 2) proposing a novel `Mock Exam’ evaluation scheme, which assesses the ability to complete end-to-end exams autonomously with accurate and efficient reasoning paths. Extensive experiments on 12 LMMs reveal that advanced models suffer substantial performance degradation under exam-realistic constraints: GPT-5’s score drops from 79 to 53 (out of 100) when process rigor and efficiency are jointly evaluated. Our findings expose critical vulnerabilities, such as sensitivity to complex visual layouts, highlighting the gap between idealized reasoning capabilities and true educational readiness. Both code and dataset are publicly available.
[AI-58] he Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context
链接: https://arxiv.org/abs/2605.26778
作者: Zhe Yu,Wenpeng Xing,Yunzhao Wei,Bo Yang,Chen Ye,Gaolei Li,Meng Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented generation promises to ground language model outputs in external evidence, yet the field has no reliable way to verify whether retrieved context actually governs generation – a prerequisite for any high-stakes deployment. The standard assumption, that context-consistent output implies context-governed output, breaks when the retrieved document overlaps with the model’s pretraining data: the model can produce faithful-looking text entirely from parametric memory, and both pathways yield indistinguishable output. We name this failure the attribution blind spot and introduce Computational Reality Monitoring (CRM) to address it. CRM operationalizes a principle adapted from cognitive science’s reality monitoring framework: comparing internal representations with and without context reveals membership-conditioned representational divergence that output-level monitors systematically miss. CRM does not certify which source an individual generation used; it detects whether pretraining exposure leaves a measurable internal trajectory signature, establishing a necessary substrate for source attribution. Across nine model variants spanning three families, this divergence concentrates in architecture-specific layer patterns, receives converging support from block-level noise intervention, and generalizes across tasks and datasets while collapsing on domain-confounded benchmarks. The attribution blind spot is measurable and partially addressable: internal representations carry a diagnostic signal invisible at the output level, establishing a foundation for systems whose internal awareness of evidence provenance governs their external behavior.
[AI-59] owards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts
链接: https://arxiv.org/abs/2605.26776
作者: Changhao Miao,Yuntian Zhang,Tongyu Wu,Fang Deng,Chen Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In recent years, Deep Reinforcement Learning (DRL) has achieved substantial progress on Vehicle Routing Problems (VRPs). However, existing DRL-based methods are typically trained on instances generated from a uniform distribution, which limits their performance under real-world distribution shifts. In this paper, we aim to develop a generalization-oriented model that partitions the policy network into multiple modules and adaptively recombines modules to form specific policies during inference. Specifically, we propose Residual Refined Experts with Instance-level Gating (R2E-IG) to improve cross-distribution generalization. Our contributions are threefold: (1) We introduce a Residual Refined Expert (R2E) architecture that enhance expert expressiveness via residual refinement; (2) We design an instance-level gating mechanism that learns distribution-aware instance representations and routes inputs to suitable modules; (3) We propose a mixed-distribution training mechanism equipped with Dynamic Weight Adaption (DWA), which dynamically reweights training data from different distributions to emphasize more informative ones. Extensive experiments show that R2E-IG achieves competitive performance against state-of-the-art baselines on both in-distribution and out-of-distribution instances across synthetic and benchmark datasets. Moreover, R2E-IG is generic and can be easily integrated into existing DRL-based methods to further improve performance.
[AI-60] Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
链接: https://arxiv.org/abs/2605.26772
作者: Kia-Jüng Yang,Dominik Meier,Jiachen Zhao,Terry Ruas,Bela Gipp
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal is mediated by a single directional subspace, refusal in large reasoning models (LRMs) additionally depends on the CoT. In DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in only 39% of cases when the CoT is kept fixed, but removing the CoT entirely increases this to 70%, indicating that the CoT actively reinforces refusal. In a two-stage intervention where the model regenerates its CoT under activation steering, refusal is reversed in 94% of cases, while the resulting CoT alone retains 48% of this effect even after steering is removed. This suggests that the CoT can carry and reconstruct the compliance signal independently. These findings indicate that refusal in LRMs is jointly encoded in residual stream activations and CoT. This joint activation makes LRM more robust against activation-level interventions alone, but exposes CoT to a possible alternative surface attack.
[AI-61] Generative artificial intelligence and the marginalization of minoritized knowledges in higher education: the case of disability
链接: https://arxiv.org/abs/2605.26769
作者: Fatiha Tali-Otmani(EFTS, Grhapes)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative artificial intelligence redefines higher education by restructuring the processes through which scientific knowledge is produced and validated. These systems are not neutral; they actively contribute to the marginalization of non-hegemonic epistemologies. This research draws upon educational sciences, critical technology studies, and disability studies to demonstrate that training datasets, which remain predominantly Anglophone and Western-centric, reinforce epistemic coloniality. The situation of persons with disabilities provides a particularly clear illustration of this phenomenon. Technological architectures frequently confine these individuals to reductive stereotypes or exclude them from the design process, leading to a double marginalization. This article examines whether a hybridization between the researcher and the machine might preserve epistemic plurality, while acknowledging the structural limitations inherent in algorithmic correction when used as a purely palliative strategy.
[AI-62] Adversarial Training for Robust Coverag e Network under Worst-case Facility Losses
链接: https://arxiv.org/abs/2605.26763
作者: Changhao Miao,Yuntian Zhang,Tongyu Wu,Fang Deng,Chen Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Maximal Covering Location-Interdiction Problem (MCLIP) is a classic bi-level optimization problem, which is fundamental to resilient infrastructure planning yet remains computationally intractable. Specifically, the upper level determines facility locations to maximize coverage, while the lower level executes worst-case interdiction to minimize the coverage. The strong coupling between the upper and lower levels, combined with their respective high combinatorial complexity, renders traditional methods ineffective. To bridge this gap, we propose a Dual-Agent Deep Reinforcement Learning (DADRL) framework based on adversarial learning, comprising a location agent corresponding to the upper level and an interdiction agent corresponding to the lower level. Our contributions are threefold: (1) The location agent is trained simultaneously against an evolving interdiction agent, making it effectively capture the dynamic competitive interplay between the upper and lower levels; (2) To fully exploit the learned capabilities of the interdiction agent, we propose a Surrogate-based Ensemble Inference Strategy that utilizes the trained interdiction agent as a high-fidelity surrogate to guide the decisions of location agent; (3) Extensive experiments on synthetic and real-world datasets demonstrate that our approach achieves superior computational efficiency while maintaining highly competitive solution quality compared to other baselines. Furthermore, our DADRL framework is model-agnostic to network structures, while its underlying adversarial learning paradigm demonstrates strong potential for solving other bi-level optimization problems.
[AI-63] Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control
链接: https://arxiv.org/abs/2605.26754
作者: Zhe Yu,Wenpeng Xing,Gaolei Li,Shuguang Xiong,Hongzhi Wang,Xuyang Teng,Meng Han
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented generation (RAG) increasingly underpins high-stakes applications, yet remains vulnerable to Confundo-style poisoning where adversarially optimized documents manipulate generated outputs. Existing defenses assume that detecting poisoned evidence prevents harm. We show this assumption is incorrect: models exhibit a monitoring-control gap – they can detect contradictions in retrieved evidence yet still act on poisoned claims. We introduce the Cordon Principle – no agent capable of final synthesis may access untrusted natural-language evidence – and realize it through CORDON-MAS, a compartmentalized framework that enforces this principle architecturally by separating evidence extraction, cross-source audit, and answer synthesis into agents with asymmetric memory privileges. Across five BEIR datasets, CORDON-MAS reduces attack success rate by 92.4% relative to undefended RAG. This reframes RAG poisoning from a detection problem to an information-flow control problem.
[AI-64] A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks
链接: https://arxiv.org/abs/2605.26747
作者: Heriberto Cuayahuitl,Grace Jang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have brought huge improvements to Artificial Intelligence (AI), which can be applied to general-purpose tasks. However, their application to textual or spoken medical consultations is still an open research problem. This paper proposes MeDial-Speech, a novel speech dataset for training and evaluating Med-AIs that can carry out consultations with patients. It was collected in realistic environments from robot-patient and doctor-patient dialogues, contains 111+ hours of speech data (without data augmentation), and covers four health conditions: Lewy body dementia, heart failure, shoulder pain, and angina. In addition, we propose a dialogue benchmark via sentence selection (with 20 options) to evaluate three state-of-the-art LLMs: GPT-5 mini, DeepSeek-V3, and Claude Sonnet 4. Experimental results reveal that Claude Sonnet 4 is the best in sentence selection, with 71.1% accuracy using manual transcriptions and 74.7% using automatic transcriptions, and that all LLMs are highly overconfident in their probabilistic predictions, regardless of selecting correct or incorrect sentences in medical dialogues. This dataset is free of charge for non-commercial purposes at: this https URL
[AI-65] Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models ICML2026
链接: https://arxiv.org/abs/2605.26733
作者: Xiao-Wen Yang,Ziyu Han,Xi-Hua Zhang,Wen-Da Wei,Jie-Jing Shao,Lan-Zhe Guo,Yu-Feng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026
Abstract:Looped Language Models (LoopLMs) enable efficient latent reasoning through depth recurrence, yet exhibit unreliable test-time scaling behavior: performance often peaks at a certain iteration depth and then collapses with further recurrence. Through latent dynamics analysis, we find an inherent trade-off between stability and effectiveness in existing architectures and strategies. By conceptualizing reasoning as uncertainty reduction, we propose that convergence toward stable fixed points while preserving effectiveness represents a promising way. To this end, we propose STARS (STAbility-driven Recurrent Scaling), a training framework that constrains latent states to approach asymptotically stable fixed points. This is realized via efficient Jacobian Spectral Radius Regularization with random loop sampling, enabling STARS to maximize effectiveness while ensuring rigorous stability. Experiments on arithmetic tasks show that STARS achieves reliable test-time scaling, and on complex mathematical reasoning it substantially mitigates performance degradation as recurrence depth increases while also improving peak performance.
[AI-66] owards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation ICML2026
链接: https://arxiv.org/abs/2605.26720
作者: Yee Hin Chong,Jiaming Wu,Youhui Zhang,Peng Qu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICML 2026 accpeted, camera-ready in progress
Abstract:Large language models (LLMs) have shown strong empirical gains as self-evolving agents for CUDA kernel generation, driven by feedback-conditioned planning across generations. However, how planning decisions attribute and combine heterogeneous feedback signals remains opaque. Standard end-to-end ablations fail to resolve this question, as iterative planning amplifies early perturbations and conflates feedback effects with trajectory-dependent drift. We introduce \textttCUDAnalyst, a unified analysis layer for controlled, generation-level attribution of planning decisions to feedback components via trajectory freezing and selective feedback injection. \textttCUDAnalyst enables stable generation-level evaluation and principled coalitional-style attribution of feedback effects and interactions. Our results show that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi-feedback interactions, and that high-level plans from stronger reasoning models can partially transfer to weaker ones. These trends hold across reference backbones, representative workloads, and reference induction regimes, indicating that the identified feedback-to-plan structure is robust within the controlled axes studied. Comments: ICML 2026 accpeted, camera-ready in progress Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.26720 [cs.AI] (or arXiv:2605.26720v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.26720 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-67] SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation KDD2026
链接: https://arxiv.org/abs/2605.26704
作者: Haochun Wang,Sendong Zhao,Jingbo Wang,Yanrui Du,Bing Qin,Ting Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ACM SIGKDD 2026
Abstract:Epidemic forecasting faces a fundamental challenge: human behavior dynamically responds to disease spread, creating feedback loops that induce distribution shifts at policy intervention points. This renders data-driven models unreliable under distribution shift. We propose \textbfSL-BiLEM (Structured Learnable Behavior-in-the-Loop Epidemic Model), leveraging physical constraints as regularization for robust extrapolation. The framework decomposes effective transmission as \beta_\texteff(t,g) = \beta_0(g) \times m_\textpolicy(t) \times m_\textmedia(t) \times m_\textcomp(t,g) , where monotonicity, smoothness, and bounded-jump constraints on the learned compliance function maintain predictive validity under novel policy regimes. Beyond forecasting, SL-BiLEM enables counterfactual analysis for intervention decision support. We validate forecasting on three real-world datasets (cruise ship, school influenza, and school-district COVID-19 surveillance) and evaluate counterfactual recovery on synthetic benchmarks with known ground truth. SL-BiLEM demonstrates: (1) 76% improvement over neural-mechanistic baselines, with only 53% OOD degradation versus 1142% for neural baselines under policy-induced shift; (2) 100% bootstrap CI coverage across 27 synthetic counterfactual experiments; and (3) Treatment Effect Accuracy exceeding 0.85. These results establish SL-BiLEM as an interpretable tool for public health decision-makers seeking accurate prediction and principled intervention planning.
[AI-68] Model Merging on Loss Landscape: A Geometry Perspective CVPR2026
链接: https://arxiv.org/abs/2605.26693
作者: Juanwu Lu,Anand Bhaskar,Brian Axelrod,Ekaterina Tolstaya,Tristan Emrich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: CVPR 2026 Findings Track. 18 pages, 4 figures, 6 tables
Abstract:Model merging offers a promising avenue for knowledge integration and parallel development without retraining. Yet, existing methods either ignore the geometry of the loss landscape or rely on intractable full-space Hessian approximations. We propose EpiMer, a framework that casts model merging as solving the Fréchet mean on a Riemannian manifold and restricts the computation to a low-rank subspace spanned by the task vectors. With the expected Hessian as the metric, we reveal a connection between local curvature and epistemic uncertainty of the parameters. Our theoretical analysis decomposes the merging error bound into the subspace Fréchet variance and the residual energy, and provides a closed-form characterization of when curvature-aware merging provably outperforms flat-geometry methods. In addition, our framework unifies both curvature-aware methods and recent spectral methods as special cases of the subspace Fréchet mean with different geometric metrics. Merging fine-tuned CLIP-ViT models on eight image classification tasks, Epistemic Merging strictly outperforms the baselines on all three CLIP-ViT backbones at matched rank, improving the across-task average accuracy and worst-task accuracy on every backbone.
[AI-69] Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents
链接: https://arxiv.org/abs/2605.26691
作者: Yunhui Gan,Tan Pan,Kaiyu Guo,Limei Han,Weimiao Yu,Guangnan Ye,Chen Jiang,Yuan Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Medical AI agents increasingly use external tools for diagnosis, treatment recommendation, and evidence retrieval, yet most existing approaches assume that task-appropriate tools are reliable within their intended scope. This assumption is fragile in real clinical settings, where even relevant tools may fail on challenging instances and lead to unsafe downstream decisions. To address this issue, we study medical tool use under imperfect-tool settings to correct failure instances missed by individual tools. Instance-dependent failure patterns create a gap between the best fixed single tool and an ideal instance-wise selector, which we refer to as the Single-Oracle risk gap. The core challenge is that conventional task-level tool selection cannot realize this gap, as it is inherently bounded by the performance of the best single tool. Motivated by this observation, we therefore account for instance-level heterogeneity and formulate tool use as an instance-level selection problem. Particularly, we propose a GRPO-based reinforcement learning framework with rewards for probabilistic risk minimization and disagreement-aware synergy learning, which promotes instance-level correction of erroneous tool consensus. Furthermore, an entropy-guided sampling strategy is adopted to upweight high-disagreement instances, which provide stronger signals for learning instance-specific tool synergy. These two components complement each other in mitigating instance-level heterogeneity and improving tool synergy. Experiments on two tasks and seven medical benchmarks show that our method consistently achieves robust and stable improvements over a broad range of baselines, highlighting the importance of synergy-aware tool use for reliable medical agentic systems.
[AI-70] Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets
链接: https://arxiv.org/abs/2605.26690
作者: Ashima Khanna,Dominik Grimm
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often degrade under surrogate noise, and position-agnostic mutation proposals risk disrupting functionally critical residues. We introduce SILO, a trajectory-level self-improvement imitation framework for oracle-budgeted protein design. SILO uses a hierarchical edit policy that decomposes each mutation into a position choice followed by a residue choice. In each active-learning round, the policy samples candidate trajectories via incremental stochastic beam search without replacement (SBS), and a UCB-based proxy ensemble, combined with an alanine-scan fitness score (AFS), selects candidates with functionally relevant edits for in silico oracle evaluation. The policy is then updated by next-action cross-entropy imitation on the round’s best oracle-labeled trajectories, avoiding value-function estimation. Across eight reproduced protein fitness landscapes and five strong baselines from prior work, SILO achieves the highest maximum and top-100 mean fitness on 8 of 8 landscapes within our evaluations, often exhibiting faster early-stage improvement. In low-data and noisy-proxy stress tests on two landscapes per setting, SILO remains competitive or best when several baselines degrade. Ablations show that SBS with AFS account for much of the gains, with iterative imitation providing additional improvement. Code is available at: this https URL
[AI-71] Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agent ic Reinforcement Learning ICML2026
链接: https://arxiv.org/abs/2605.26684
作者: Xin Cheng,Shuo He,Lang Feng,HaiYang Xu,Ming Yan,Lei Feng,Bo An
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLMs) and have been rapidly extended to agentic tasks. However, their credit assignment relies heavily on coarse-grained trajectory-level attribution according to final outcomes, making it difficult to capture the contribution of individual steps, such as valuable steps obscured within failed trajectories. To uncover latent information and enable more faithful step-level credit assignment, we propose Graph-based Group Policy Optimization (GraphGPO), which first aggregates all rollout trajectories into a unified state-transition graph and then estimates the distance from each state to the task goal using the global information encoded in the graph. Finally, GraphGPO assigns credit to each edge by estimating a graph-based advantage, based on how much the transition reduces the distance to the task goal. In this way, GraphGPO significantly improves training efficiency and achieves state-of-the-art performance across a range of challenging benchmarks.
[AI-72] Certified Causal Attribution for Real-Time Attack Forensics in 6G Network Slicing
链接: https://arxiv.org/abs/2605.26679
作者: Minh K. Quan,Pubudu N. Pathirana
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY
Abstract:Cross-slice attack attribution in 6G networks requires identifying causal propagation chains through shared infrastructure in under 100 ms. Existing methods struggle to satisfy this strict SLA without sacrificing accuracy, because shared resource contention creates spurious correlations that are indistinguishable from genuine causal links under standard Granger tests. We propose DA-GC, a certified causal attribution framework that integrates resource-conditioned Granger causality with an axiomatically derived Resource Contention Model (RCM) to systematically block resource-mediated confounding. On a 15-slice production-emulation 6G testbed with 1,100 attack scenarios, DA-GC achieves 89.2% attribution accuracy at 87 ms. This represents a 7.9 percentage-point improvement over the strongest baseline at 2.7x lower latency, alongside demonstrated cross-topology generalization and concept-drift resilience. Crucially, DA-GC is backed by a comprehensive formal certification stack. We provide mathematically proven validity certificates for statistical soundness under serially dependent telemetry and piecewise-stationarity. Furthermore, we establish strict security bounds, including an adversarial utilization spoofing breakdown point of \delta^* \approx 0.95 , and define the minimum differential-privacy noise required for a provably private and robust deployment.
[AI-73] MemFail: Stress-Testing Failure Modes of LLM Memory Systems
链接: https://arxiv.org/abs/2605.26667
作者: Ishir Garg,Neel Kolhe,Dawn Song,Xuandong Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language model (LLM) agents increasingly rely on external memory systems to remain consistent across long-horizon interactions, but little empirical work has been done to understand the specific failure modes and design choices that these systems present. Existing benchmarks report aggregate question-answering accuracy and treat memory systems as black boxes, making it impossible to attribute an incorrect answer to a particular failure mode of the system. We introduce MemFail, a diagnostic benchmark that isolates the failure modes of modern LLM memory systems. We begin by formalizing memory systems as the composition of three canonical operations – summarization, storage, and retrieval – and identify the potential failure modes induced by each. Based on these hypothesized failure modes, we construct five datasets spanning four tasks, each adversarially designed to test a specific operation of a memory system. Using these datasets, we evaluate four state-of-the-art memory systems on MemFail and demonstrate how MemFail can be used to empirically understand the tradeoffs induced by differences in memory system architectures.
[AI-74] Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
链接: https://arxiv.org/abs/2605.26657
作者: Wolfgang Maass,Sabine Janzen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two orthogonal failure modes for policy-gradient methods on this class and propose a decomposition that separates them: \emphcompletion (reaching the terminal horizon rather than exiting via an implicit terminal constraint) and \emphoptimality (matching the dynamic-programming reference given completion). Under PPO with a linear soft penalty, granting horizon access alone reduces the completion rate: the penalty’s equilibrium drives the dominant-activity share to zero, while action-space restriction combined with horizon access achieves completion but leaves an optimality gap ( \Delta M_\textfinal = 0.271 ) that we trace to first-phase greedy commitment at the damage origin. We derive four testable predictions and evaluate them in two separately calibrated environments that share the same abstract structure but differ in domain, horizon, activity set, and calibration data: a 49-step bricklayer career and a 20-season NBA power-forward career. All four predictions replicate qualitatively. The horizon-invariance prediction is met at three of four tested horizons, with the exception at H = 15 consistent with the H^* boundary ( H^* \in [6, 14] under the NBA parameters).
[AI-75] Bilevel Optimization over Saddle Points of Zero-Sum Markov Games ICML2026
链接: https://arxiv.org/abs/2605.26654
作者: Zihao Zheng,Irwin King,Songtao Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: Accepted to the International Conference on Machine Learning (ICML 2026)
Abstract:Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-level (LL) decision-making process responds, naturally leading to a bilevel optimization problem. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and therefore fail to capture competitive structures arising in applications such as incentive design, where multiple policies interact. We study bilevel optimization problems in which the LL problem is a regularized min-max zero-sum Markov game and the UL objective is optimized through the saddle-point equilibrium induced by the LL game. In this work, we propose penalty-augmented Nikaido-Isoda descent-ascent (PANDA), a penalty-based first-order policy-gradient method based on the Nikaido-Isoda function. By exploiting the min-max game structure, PANDA avoids computing UL hypergradients and does not require second-order information. We prove that PANDA converges to stationary points without convexity assumptions on either the UL or LL objectives. Moreover, PANDA reaches an \epsilon -stationary point in \tilde\mathcalO(\epsilon^-1) iterations with sample complexity \tilde\mathcalO(\epsilon^-3) , matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments demonstrate the superior performance of PANDA over closely related baselines.
[AI-76] More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations
链接: https://arxiv.org/abs/2605.26647
作者: Mingze Wang,Jinbo Wang,Yikuan Xia,Kai Shen,Shu Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 31 pages
Abstract:Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens. In this work, we propose Mixture of Activations (MoA), a token-adaptive FFN design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. As an input-independent counterpart, we also introduce learnable activations (LA), which form linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs. Theoretically, we establish strict finite-width expressive separations among fixed-activation FFNs, LA, and MoA: LA strictly contains fixed-activation FFNs, while MoA strictly contains LA, with the additional expressivity arising from input-dependent nonlinear hybridization. Empirically, we evaluate MoA through extensive pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters under different token budgets, optimizers, and learning rate schedules. MoA consistently achieves lower terminal loss and exhibits more favorable scaling behavior than well-tuned baselines, with minimal parameter and computational overhead. These results suggest that token-adaptive activation mixing is a simple and effective mechanism for improving FFN expressivity in LLMs.
[AI-77] ail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2
链接: https://arxiv.org/abs/2605.26628
作者: Zhanfeng Feng,Shuai Guo,Xin Di,Long Peng,Yang Cao,Zhengjun Zha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This report describes Tail-Aware HiFloat4, our submission to the low-bit text-to-video generation quantization challenge. Our method adapts the public ViDiT-Q post-training quantization pipeline to Wan2.2 under the HiFloat4 numerical format. We quantize the main linear layers in both Wan2.2 transformer modules with W4A4 HiFloat4 fake quantization, keep numerically sensitive boundary modules in high precision, and introduce an activation-tail-aware percentile calibration module for channel-mask construction. Together with compact PTQ-state restoration, this design reduces the influence of rare calibration outliers while keeping the runtime HiFloat4 arithmetic and sampling pipeline unchanged.
[AI-78] FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning
链接: https://arxiv.org/abs/2605.26615
作者: Hyungyu Choi,Young Kyun Jang,Chanho Eom
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures, IEEE/TIP 2026 accepted
Abstract:Vision-language models such as CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions due to pre-training on short and concise captions. We present FAST-GOAL (Fast and Efficient Global-local Object Alignment Learning), an efficient fine-tuning method that enhances ability of CLIP to handle lengthy text through global-local semantic alignment. Our method consists of two key components. First, Fast Local Image-Sentence Matching (FLISM) efficiently extracts local image regions through object detection and spatial division, then matches them with corresponding sentences. Second, Token Similarity-based Learning (TSL) maximizes the similarity between patch tokens from specific regions in the image and their corresponding region embeddings, applying the same principle to text, which enhances the ability of the model to capture detailed correspondences. Additionally, we introduce GLIT100k, a dataset that provides both global image-lengthy caption pairs and context-derived local pairs, where local descriptions are extracted from global captions to maintain semantic coherence. Through extensive experiments on long caption datasets (DOCCI, DCI) and short caption datasets (MSCOCO, Flickr30k), we demonstrate that FAST-GOAL achieves significant improvements over baselines, enabling effective adaptation of CLIP to detailed textual descriptions while maintaining computational efficiency.
[AI-79] Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training
链接: https://arxiv.org/abs/2605.26606
作者: Woojeong Kim,Ziyi Yang,Jing Nathan Yan,Jialu Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) is the dominant paradigm for post-training large language models. However, in the online, on-policy setting, rollout generation dominates the computational cost of training. Group-based policy optimization methods compute advantages from multiple rollouts per prompt, yet they indiscriminately allocate budget to prompts with collapsed reward distributions, wasting expensive rollouts on negligible learning signals. We demonstrate that group-based updates are most effective in regimes of high reward variance. Since the policy evolves throughout training, prompt informativeness must be estimated online rather than precomputed, but exhaustively evaluating every prompt is computationally prohibitive. We introduce Pilot-Commit, a budget-aware rollout allocation framework for group-based RL post-training. Pilot-Commit decouples prompt evaluation from exploitation: a pilot stage estimates per-prompt informativeness using a fraction of the budget, and the remaining rollouts are allocated to high-leverage prompts while low-signal prompts are skipped. Across multiple math reasoning benchmarks and model scales from 1.5B to 14B parameters, Pilot-Commit matches baseline accuracy with significantly lower sampling costs, reaching target accuracy up to 1.9\times faster than GRPO and 4.0\times faster than DAPO in cumulative rollouts.
[AI-80] Geometry-Aware Contrastive Learning for Few-Shot Automatic Modulation Recognition
链接: https://arxiv.org/abs/2605.26600
作者: Guanqun Zhao,Yitong Liu,Jiaxuan Fang,Yufei Mao,Hongwen Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Standard Self-Supervised Learning (SSL) for Automatic Modulation Recognition (AMR) struggles with ineffective isotropic augmentations, spectral instability, and semantic drift. To address these challenges, we propose Dynamic-Consistency Contrastive Learning (DyCo-CL), a geometry-aware framework that couples Virtual Adversarial Augmentation (VAA) with a semantic consistency loss. We provide a theoretical analysis indicating that this strategy acts as an implicit spectral regularizer for the encoder, enabling stable manifold exploration. Complementing this, our Signal-Adaptive Swin Backbone with fixed-window attention improves structural stability by constraining attention locality, while a Hybrid Knowledge Fusion module anchors representations with physical priors. Experiments on RML benchmarks show that DyCo-CL achieves a 6.27% accuracy gain in 1-shot settings over prior methods.
[AI-81] AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents
链接: https://arxiv.org/abs/2605.26596
作者: Haoran Zhang,Zhaohua Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures. Code and data: this https URL
Abstract:The token-level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, backbone, method) cells spanning two independent token-level method families, every cell collapses to mean reward = 0.05 despite 1.3-13.3x realized compression. We name and characterize this failure mode as action-grammar destruction – the tokens carrying action semantics (identifiers, brackets, action verbs) are exactly those self-information ranks lowest, so a general-purpose compressor reliably removes them and the environment rejects the residual. The diagnosis points to step-granularity compression. We introduce AGORA, an inference-free step-level compressor combining a structural prompt parser, an always-keep floor for format- and recency-critical content, and a 125M-parameter relevance scorer trained on counterfactual next-action-change labels (~2ms/step, zero per-step LLM toll). Across the compared inference-free and LLM-based methods, AGORA is the only one retaining = 75% uncompressed performance in 8 of 9 cells (with the lone exception at 73%); a four-way component ablation isolates the structural floor as the dominant quality lever and the learned scorer as the source of 1.0-11.5x adaptive end-to-end compression from a single fixed keep ratio.
[AI-82] Cordyceps: Covert Control Attacks on LLM s via Data Poisoning
链接: https://arxiv.org/abs/2605.26595
作者: Zedian Shao,Charles Fleming,Teodora Baluta
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean-data regularization, or online monitoring can neutralize. In this paper, we propose a data poisoning method that teaches an LLM an information hiding scheme reliably and stealthily through semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases. The induced hiding scheme can encode and decode arbitrary malicious instructions, thus revealing a new and subtle poisoning-induced vulnerability: covert control attacks. We precisely characterize covert control attacks and evaluate them across 5 LLMs, 3 backdoor defenses, and 4 prompt injection defenses. With a small poisoned fraction, covert control attacks outperform heuristic-based prompt injection attacks in average attack success rate by about 40% relative to clean fine-tuned models. They also circumvent defenses based on detection and fine-tuning, maintaining up to 93% attack success rate after backdoor defenses and up to 98% after prompt injection defenses. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.26595 [cs.CR] (or arXiv:2605.26595v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.26595 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-83] Examining the Challenges of Intellectual Property in AI-Generated Productions
链接: https://arxiv.org/abs/2605.26590
作者: Ali Mazhar,Mohammad Zare,Marjan Veysi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:With the advancement of artificial intelligence systems capable of autonomously generating artistic, literary, musical works, and even inventions without direct human intervention, the intellectual property (IP) regime faces unprecedented questions and challenges. The most critical issue concerns the ownership of moral and economic rights in the absence of a human creator, and how such outputs can be granted legal protection. This paper first reviews the theoretical foundations and existing literature in this domain, then comparatively examines Iranian legal frameworks such as the 1969 Law for the Protection of Authors, Composers, and Artists Rights and the Patent and Trademark Registration Law-alongside other legal systems, including the European Union, the United Kingdom, and the United States. Furthermore, existing legal perspectives on the intellectual property of AI-generated works and the related enforcement challenges are analyzed. The findings reveal significant regulatory gaps within the current Iranian legal framework. To balance the promotion of innovation with the preservation of human creativity, revising existing laws and introducing novel approaches such as defining a specific intellectual property right for AI-generated works or designating ownership among associated human agents appears to be essential.
[AI-84] Few-shot Cross-country Generalization of Tabular Machine Learning and Foundation Models for Childhood Anemia Prediction under Distribution Shift
链接: https://arxiv.org/abs/2605.26589
作者: Yusuf Brima,Marcellin Atemkeng,Lansana Hassim Kallon,David Niyukuri,Antoine Vacavant,Samuel Saidu,Ding-Geng Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Childhood anemia affects around 40% of children aged 6-59 months globally and arises from heterogeneous factors, limiting model generalizability. We evaluate a transformer-based tabular foundation model against classical supervised methods under cross-country and data-scarce settings. We used DHS data from 16 countries across Africa, Asia, Latin America, the Caucasus, and the Middle East (n=68,856). We compared Logistic Regression, XGBoost, LightGBM, and TabPFN v2.6. Performance was assessed using AUC-ROC, Brier score, and ECE. Generalization was evaluated using leave-one-country-out (LOCO), reverse-LOCO, and few-shot settings. Subgroup analyses included sex, age, residence, maternal education, and wealth. Feature importance was estimated using SHAP. TabPFN outperformed classical models in low-data regimes (200 samples), showing higher discrimination and better calibration. Across countries, it achieved the lowest Brier score (0.042) and ECE (0.203). Under full-data settings, AUC-ROC ranged from 0.59-0.76 with small between-model differences ( \leq 0.05 ). LOCO performance was stable (0.58-0.69), driven by country context. Reverse-LOCO showed asymmetric transferability. Subgroup performance was consistent with no systematic demographic bias. SHAP identified child age, altitude, and height-for-age z-score as dominant predictors, followed by wealth and maternal education. Performance in childhood anemia prediction is driven more by population variation than model choice. TabPFN provides advantages in low-resource settings through improved discrimination and calibration, highlighting foundation models as promising tools for data-scarce global health prediction. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2605.26589 [cs.LG] (or arXiv:2605.26589v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.26589 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yusuf Brima [view email] [v1] Tue, 26 May 2026 06:20:20 UTC (646 KB)
[AI-85] On the Error-Correcting Effects of Stochasticity in Discrete Diffusion
链接: https://arxiv.org/abs/2605.26582
作者: William Yuan,Sungwon Jeong,Amirali Aghazadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Discrete diffusion models achieve strong performance in text and image generation, but their inference remains slow and must inherently balance sampling efficiency and sample quality. In this work, we present a systematic study of how the \emphdegree of stochasticity in Markov transitions governs the sampling tradeoff. We show that highly deterministic transitions converge rapidly but suffer from error accumulation, while more stochastic transitions converge more slowly yet can achieve higher final sample quality. Using an information-theoretic analysis, we identify the underlying mechanism as an error-correcting effect induced by \emphredundant transitions that symmetrically exchange mass between states, and show that these transitions can provably contract sampling errors. Motivated by this analysis, we propose \emphDiscrete Churn and Restart Sampling (DCRS), a novel inference algorithm that injects controlled stochasticity by alternating between forward and reverse diffusion processes. Experiments on synthetic datasets and large-scale benchmarks show that DCRS improves the speed-quality tradeoff in the low number of function evaluations regime. On image datasets, DCRS achieves up to a 10\times reduction in sampling steps compared to standard samplers while maintaining competitive sample quality, whereas on language benchmarks, we observe more nuanced behavior depending on the corruption process and sampling procedure.
[AI-86] Bridging Control with Neural Network Verifier alpha-beta-CROWN: A Tutorial
链接: https://arxiv.org/abs/2605.26577
作者: Haoyu Li,Xiangru Zhong,Hao Cheng,Bin Hu,Huan Zhang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注: ACC 2026 Tutorial
Abstract:Learning-based methods for synthesizing controllers have gained popularity due to their high expressiveness and strong empirical performance. However, in safety-critical scenarios such as autonomous driving, robotics, and power systems, empirical performance alone is insufficient, and formal verification of controller properties such as stability and safety is highly desirable. Unfortunately, many prior verification approaches are either tied to specific structural assumptions on the system or the certificate, making them difficult to transfer across settings, or suffer from poor scalability on higher-dimensional neural network systems. In this tutorial, we present a unified framework that aims to mitigate this gap via bridging control with the state-of-the-art neural network verifier \alpha,!\beta -CROWN (alpha-beta-CROWN). At its core, \alpha,!\beta -CROWN is a general-purpose bounding engine for nonlinear functions represented as computation graphs: given an input domain, it can produce certified bounds and explicit linear relaxation of the nonlinear function. These certified bounds are useful on their own for tasks such as reachability analysis, and they also provide the foundation for more complex routines that perform satisfiability checking and optimization. More specifically, many control problems reduce to verifying real-valued inequalities over a state domain (e.g., Lyapunov theory). Consequently, \alpha,!\beta -CROWN enables scalable verification of such conditions by computing tight bounds and recursively partitioning and pruning subdomains based on the bounds. Thanks to GPU parallelization, this pipeline demonstrates superior scalability on verification and optimization problems that are challenging for traditional approaches. In this tutorial, we discuss the basics of \alpha,!\beta -CROWN and introduce its application to various control-related tasks.
[AI-87] MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning
链接: https://arxiv.org/abs/2605.26567
作者: Yuhao Shen,Lang Cao,Simo Du,Yuqing Wang,Juexiao Zhou,Hao Peng,Yue Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical practice guidelines (CPGs) encode evidence-based decision logic that clinicians apply by evaluating patient variables, conditional criteria, and recommendation rules. However, existing methods often use CPGs as free-text training data or retrieval sources, underutilizing their procedural decision structure. To better exploit this structure, we introduce a guideline-derived training pipeline that transforms CPG recommendations into executable clinical decision logic and uses it to generate factual and counterfactual question-answering data. Theses data teach models both guideline-supported decisions and how decisions change under different patient conditions. Post-training a medical LLM on the generated data yields MedGuideX. Across four clinical reasoning benchmarks, MedGuideX achieves a 10.28% relative improvement in average accuracy. Physician evaluation further shows that MedGuideX better recovers clinician authored reasoning steps and produces physician-preferred rationales in faithfulness, validity, completeness, and clarity. Overall, our results show that executable decision logic from CPGs can be transformed into scalable supervision for building reliable medical LLMs.
[AI-88] Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice ICML2026
链接: https://arxiv.org/abs/2605.26559
作者: Yingshuo Wang,Xian Sun,Yanhang Li,Zhichao Fan,Zexin Zhuang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Econometrics (econ.EM)
备注: 5 pages, 1 table. Accepted at the FMSD Workshop, ICML 2026
Abstract:Tabular foundation models achieve strong accuracy on choice prediction tasks, but their predictions often violate the economic logic those tasks require: raising a price sometimes increases predicted demand, and implied willingness-to-pay estimates are frequently negative or implausible. We propose a two-stage adapter that embeds foundation model predictions within a utility-maximization framework. In the first stage, we estimate a standard choice model whose parameters are constrained to obey economic theory. In the second stage, we freeze those parameters and train a correction term that incorporates the foundation model’s predictions as additional information. The result is a model that inherits the foundation model’s accuracy gains while guaranteeing monotonic price-demand relationships under policy perturbation and producing analytically computable trade-off measures. On two transportation datasets, the adapter recovers up to 13 percentage points of accuracy over a standard logit model while maintaining perfect economic consistency, something neither the raw foundation models nor conventional distillation achieve.
[AI-89] Linear and Neural Dueling Bandits with Delayed Feedback
链接: https://arxiv.org/abs/2605.26554
作者: Xiangyi Wang,Pingchen Lu,Jie Mao,Mingze Kong,Zhi Hong,Zhiyong Wang,Zhongxiang Dai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption of immediate feedback, a condition frequently violated in real-world scenarios such as prompt optimization. This setting introduces a unique theoretical challenge: unlike linear bandits, dueling bandit estimators lack closed-form solutions, rendering naive adaptations of standard weighting techniques biased. To address this, we formalize the problem of Contextual Dueling Bandits with Stochastic Delayed Feedback and propose two novel algorithms: Linear (LDB-DF) and Neural (NDB-DF) Dueling Bandits with Delayed Feedback. Central to our approach is a novel estimator that integrates an Inverse Probability Weighting (IPW) mechanism directly into the loss function, ensuring unbiased correction for delayed or missing feedback. We provide comprehensive theoretical analysis, establishing an O(d*sqrt(T)) regret bound for the linear setting and sub-linear guarantees for the neural setting. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness of our propose. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.26554 [cs.LG] (or arXiv:2605.26554v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.26554 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xiangyi Wang [view email] [v1] Tue, 26 May 2026 05:07:25 UTC (343 KB) Full-text links: Access Paper: View a PDF of the paper titled Linear and Neural Dueling Bandits with Delayed Feedback, by Xiangyi Wang and 6 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-90] Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference
链接: https://arxiv.org/abs/2605.26552
作者: Jaewoo Lee,Hyeongyu Kang,Dohyun Kim,Kyuil Sim,Woocheol Shin,Minsu Kim,Taeyoung Yun,Jeongjae Lee,Sanghyeok Choi,Tabitha Edith Lee,Jongchul Ye,Jinkyoo Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV, Few-step Generative Models Alignment via Sample-based Variational Inference, a general alignment framework that requires only sample access to the generator and the reference distribution. We cast alignment as sampling from a reward-tilted distribution anchored to a reference distribution. We leverage Stein Variational Gradient Descent as a sample-based variational inference scheme and amortize its particle updates into the generator parameters via fixed-point regression. We evaluate FAV on two domains: robotics manipulation and image generator alignment. On generative policy alignment for robotic manipulation, FAV outperforms prevailing policy extraction baselines across 56 offline and 30 offline-to-online RL tasks. For image generator alignment, FAV fine-tunes diverse few-step backbones, including GAN, drifting model, consistency models, and flow maps, scaling from ImageNet- 256 to 1024 ^2 text-to-image synthesis. Code is available at this https URL.
[AI-91] MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration
链接: https://arxiv.org/abs/2605.26546
作者: Runxi Huang,Liyu Zhang,Shengzhong Liu,Xiaomin Ouyang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mobile graphical user interface (GUI) agents enable AI models to autonomously operate smartphones on behalf of users. However, most existing systems focus primarily on optimizing task accuracy and rely on cloud-hosted models for inference, which introduces privacy concerns and network-dependent latency. As a result, fully on-device deployment of mobile GUI agents remains underexplored. We propose MobileExplorer, a new framework that accelerates on-device inference for vision-based mobile GUI agents via online exploration. The key idea is to exploit the long per-step reasoning time of vision-language models (VLMs) by performing lightweight, parallel exploration of UI elements. During model inference, the agent proactively probes semantically relevant UI elements and records these exploration traces as structured memory. To ensure reliable execution in live mobile environments, we design a two-level rollback mechanism that robustly restores the initial UI state when a fast but naive backtracking strategy fails. The collected exploration traces are then summarized into concise contextual hints and injected into the prompt to enhance the subsequent reasoning step. We evaluate MobileExplorer on multiple off-the-shelf devices using the AndroidWorld benchmark, as well as newly designed, more complex tasks and dynamic on-device environments. MobileExplorer reduces the average number of reasoning steps and end-to-end latency by 23%, while maintaining or improving task success rates by up to 5%. A video demonstration of MobileExplorer performance in the real world is available at this https URL .
[AI-92] PolyFusionAgent : A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design
链接: https://arxiv.org/abs/2605.26543
作者: Manpreet Kaur,Xingying Zhang,Qian Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 5 figures, 2 tables; Supplementary material included
Abstract:Polymer discovery is central to fields ranging from energy storage to biomedicine, but it is hindered by an astronomically large chemical design space and fragmented representations of structure, properties, and prior knowledge. This fragmentation leaves many AI models disconnected from physical and experimental reality, restricting their ability to support directly actionable design decisions. Here we introduce PolyFusionAgent, an interactive framework coupling a multimodal polymer foundation model (PolyFusion) with a tool-augmented, literature-grounded design agent (PolyAgent). PolyFusion aligns complementary polymer views including sequence, topology, 3D geometry, and fingerprints across millions of polymers to learn a shared latent space transferable across chemistries and data regimes, improving thermophysical property prediction and enabling property-conditioned generation of chemically valid, structurally novel polymers beyond the reference design space. PolyAgent closes the design loop by linking prediction and inverse design with evidence retrieval from the polymer literature, proposing, evaluating, and contextualizing hypotheses with explicit precedent in one workflow. Together, PolyFusionAgent enables interactive, evidence-linked polymer discovery combining large-scale representation learning, multimodal chemical knowledge, and verifiable scientific reasoning.
[AI-93] ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation ICML2026
链接: https://arxiv.org/abs/2605.26542
作者: Xiaochong Jiang,Shiqi Yang,Ziwei Li,Lifei Liu,Haoran Yu,Yichen Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026
Abstract:Tool-using agents increasingly operate in open-ended deployment environments, where they compose file systems, web APIs, code interpreters, and enterprise services at runtime. This creates a safety gap in tool composition: an agent can satisfy every per-tool permission check and still produce an unsafe end-to-end effect, such as reading a confidential document, summarizing it, and sending the summary to an external endpoint. We call this failure mode permission laundering. ChainCaps addresses it with a runtime rule: every value carries a sink-specific capability budget, and tool composition propagates budgets by intersection. A value can preserve or lose authority as it moves through a tool chain, but it cannot gain new authority through composition. We implement ChainCaps as a transparent MCP proxy that requires no changes to the agent or tool servers. On 82 tasks across five frontier models from three providers, ChainCaps reduces attack success rate from 25-68% to 0-4.8% while preserving 96-100% benign completion. In replay experiments, it also outperforms scalar-IFC and per-function-isolation baselines. Manifest quality is the dominant deployment bottleneck: expert manifests reach 100% attack blocking, while naive manifests fall to 27.3%. Our claims are limited to explicit-flow composition safety under trusted manifests and proxy-visible data movement, a practical gap in deployed tool-using agents today.
[AI-94] Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning
链接: https://arxiv.org/abs/2605.26530
作者: Chen Linze,Cai Yufan,Hou Zhe,Dong Jin Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant perturbations, but should change when perturbations alter legally material points. We formulate this requirement as a legal-relevance-sensitive evaluation problem: LLMs should only be sensitive to the legally relevant change. We introduce a unified evaluation suite covering should-change and should-not-change evaluation across judicial fairness, robustness, and statute-confusion scenarios. Our evaluation shows that existing legal LLMs are systematically sensitive to legally irrelevant variations and often fail to distinguish related legal elements and statutory rules. To mitigate these failures, we present LexGuard, an adversarial multi-agent framework grounded in formal reasoning. LexGuard formalizes statutes into executable constraints, uses adversarial agents to extract competing fact-statute arguments, and invokes SMT solvers to verify legal satisfaction and logical consistency. Experiments show that LexGuard improves legal reasoning reliability by reducing vulnerability to manipulative framing, improving disambiguation among similar statutes, limiting the influence of legally irrelevant attributes, and increasing consistency under benign reformulations. We show that legal trustworthiness requires not only accuracy, but calibrated sensitivity to legally material changes.
[AI-95] StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting
链接: https://arxiv.org/abs/2605.26523
作者: Minh K. Quan,Pubudu N. Pathirana
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ACM MobiSys 2026
Abstract:Large-batch Contrastive Learning (CL), the foundation of modern representation learning, is fundamentally incompatible with the volatile resource constraints of edge devices. This conflict creates a dilemma: small on-device batches degrade model fidelity, while offloading to the cloud incurs unacceptable latency and bandwidth costs. Existing solutions often resort to static model compression, which fails to adapt to the runtime volatility of edge environments. To bridge this gap, we present StreamSplit, a novel framework that makes streaming CL practical across heterogeneous ARM client platforms. StreamSplit resolves the conflict between the continuous nature of ambient audio and the discrete batch requirements of models like CLAP and COLA. We introduce: (1) A distribution-based streaming framework that decouples representation quality from local batch size, using a tractable Hybrid Loss to maintain fidelity despite sparse updates; and (2) An Uncertainty-Guided Adaptive Splitter that uses a lightweight Reinforcement Learning (RL) policy to dynamically partition computation. Uniquely, this policy integrates real-time resource monitoring with embedding ambiguity to optimize the accuracy-latency trade-off on the fly. We evaluate StreamSplit on diverse hardware, from the resource-constrained Raspberry Pi 4 to the high-performance Apple M2. Results demonstrate that StreamSplit reduces per-sample latency by up to 4.7x and cuts bandwidth by 77.1% and energy by 52.3% compared to server-centric baselines. Crucially, it maintains accuracy within 2.2% of server-centric models, proving that adaptive, distributed learning is a viable path for the modern edge ecosystem.
[AI-96] Dense2MoE: Pushing the Pareto Frontier of On-Device LLM s via Unified Pruning and Upcycling
链接: https://arxiv.org/abs/2605.26496
作者: Fengfa Li,Hongjin Ji,Yifeng Ding,Lei Ren,Chen Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages
Abstract:The Mixture of Experts MoE architecture is highly promising for resource constrained on device deployments yet training these models from scratch incurs prohibitive costs Current methods attempt to alleviate this by upcycling dense models into MoEs however they often introduce parameter redundancy that degrades inference efficiency Alternatively standard layer pruning mitigates redundancy but inevitably compromises model accuracy To resolve this dilemma we propose Dense2MoE a novel framework that unifies pruning and upcycling through Layer Fusion UpCycling LF UC Guided by hardware Roofline theory Dense2MoE systematically overcomes the inference memory wall by pruning bandwidth heavy attention modules from redundant layers while repurposing their Multi Layer Perceptrons MLPs into MoE experts This structural innovation preserves the models core capabilities and strictly limits active parameters via selective token routing With a modest continual pre training budget Dense2MoE efficiently converts publicly available dense LLMs into on device ready MoE models Extensive experiments demonstrate that Dense2MoE significantly advances the Pareto frontier for on device inference latency versus model accuracy outperforming dense baselines state of the art compression and standard upcycling methods
[AI-97] Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection
链接: https://arxiv.org/abs/2605.26468
作者: Yuxuan Yin,Chen He,Todd Jacobs,Jialei He,Boxun Xu,Robert Jin,Peng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures
Abstract:Latent defect screening is challenged by extremely low failure rates, high-dimensional test data, and absence of labeled anomalies. We propose the first unsupervised anomaly detection framework incorporating a Diffusion Transformer. Raw test measurements are first compressed by an autoencoder, then reshaped into a structured token sequence enriched with sinusoidal and per-device wafer-position embeddings. Anomaly scores are derived from the noise-prediction error over mid-range diffusion timesteps, enabling fast wafer-scale screening without any labeled defects or manual feature engineering. Our approach achieves state-of-the-art performance on industrial 16nm IC test data under extreme class imbalance, offering interpretable failure localization through latent-space reconstruction residuals.
[AI-98] DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection
链接: https://arxiv.org/abs/2605.26446
作者: Yuxin Yang,Limei Hu,Feng Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph anomaly detection (GAD) aims to identify nodes or substructures whose behavior or attributes deviate significantly from the overall pattern in graph-structured data, with critical applications in financial risk control, social network analysis, and cybersecurity. However, existing GCN-based methods suffer from the fundamental problem of contamination propagation, where anomalous nodes pollute the representations of their neighbors through message passing, leading to degraded detection performance. In this paper, we propose DDGAD, a novel diffusion-based graph anomaly detection framework that leverages trajectory dynamics to distinguish normal and anomalous nodes. Our key insight is that normal nodes exhibit consistent and stable representation trajectories under the coupled effects of diffusion regularization and reliability-aware neighborhood consensus, while anomalous nodes exhibit unstable and conflicting dynamics due to the directional disagreement between the global manifold prior and locally contaminated message passing. To mitigate contamination propagation, we introduce a distributed reliability-aware consensus refinement mechanism and define three complementary anomaly signals: neighbor inconsistency, reliability weight, and dynamical conflict energy. We further provide a preliminary theoretical analysis on normal node stability under the coupled dynamics. These signals collectively characterize anomalous behaviors from the perspectives of local inconsistency, consensus reliability, and dynamical instability. Extensive experiments on five real-world datasets demonstrate the effectiveness of the proposed framework.
[AI-99] Aperiodic and Low-Frequency Spectral Bias in Reconstruction based EEG Foundation Models
链接: https://arxiv.org/abs/2605.26434
作者: Aditya Kommineni,Emily Zhou,Kleanthis Avramidis,Simon Bock Segaard,Jeppe Roden Münster,Andreas Peter Juhl Hansen,Takfarinas Medani,Tiantian Feng,Richard Leahy,Shrikanth Narayanan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 13 figures, 3 tables
Abstract:EEG foundation models, pre-trained on large-scale unlabelled EEG data, have emerged as a promising direction towards learning generalizable EEG representations. Despite showing positive results in data-rich regimes, they often fail to outperform significantly smaller supervised models in low-resource settings compared to fully supervised models. We provide a mechanistic account of this shortcoming, attributing it to a fundamental mismatch between reconstruction-based pretext tasks and the idiosyncratic spectral structure of EEG signals, which decompose into distinct high-power aperiodic and low-power oscillatory components. Using controlled, synthetically-generated EEG inputs, we demonstrate that EEG foundation model embeddings are biased to capture the aperiodic components of the EEG signal while under-representing oscillatory components, particularly at higher frequencies. Additionally, linear probe evaluations on real-world BCI datasets further reveal that embeddings encode subject identity more strongly than task-relevant information, thereby reinforcing the low-frequency and aperiodic component bias in foundation model embeddings trained primarily on reconstruction based objectives. Together, these findings elucidate a failure mode in reconstruction based EEG foundation models and motivate future work to incorporate auxiliary losses explicitly targeting high-frequency oscillatory structure as a path toward more capable and generalizable EEG representations.
[AI-100] When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control
链接: https://arxiv.org/abs/2605.26418
作者: Guilin Zhang,Chuanyi Sun,Kai Zhao,Shahryar Sarkani,John Fossaceca
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reproducible benchmark and evaluation protocol for DRL on adaptive resource control, where an agent allocates compute to a dynamic workload under cost and service-level constraints. We evaluate PPO, DQN, A2C, SAC, TD3, and DDPG under matched architectures, training budgets, and reward functions against a calibrated rule-based baseline across six workload patterns and five seeds (240 runs), instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling, and probe distribution-shift generalization. Three findings challenge common assumptions: (i) the calibrated controller achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; (ii) discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and (iii) no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck in RL-based resource control is not algorithm selection but baseline calibration, reward engineering, and realistic evaluation protocols.
[AI-101] Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models
链接: https://arxiv.org/abs/2605.26409
作者: Hayden Helm,Xiaodong Liu,Weiwei Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Evaluating and mitigating a generative system’s susceptibility to jailbreak attacks is critical to its safe deployment. Given the number of deployable systems, full per-configuration evaluation and optimization is impractical. In this paper, we formalize the behavioral geometry of a population of models that, by leveraging previously evaluated and defended models, supports both efficient susceptibility prediction and effective defense transfer across a population. We apply the framework to 79 models spanning 24 providers and to 100 system configurations of a single base model. Simple methods that use the behavioral geometry reach an AUPRC of 0.94 for susceptibility detection with \approx98% fewer probes relative to a full evaluation. Using the behavioral geometry to select which model to transfer an optimized defense from outperforms same-provider assignment ( +2% , p = 0.03 ) at no additional probe cost, with a set of three models sufficient to cover the population. Results are robust to hyperparameter selection and judge.
[AI-102] From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator
链接: https://arxiv.org/abs/2605.26403
作者: Xiaohua Wang,Jiakang Yuan,Zisu Huang,Muzhao Tian,Changze Lv,Kaitao Song,Tao Chen,Xiaoqing Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A long-standing goal of the research community is to develop highly interactive LLM-based dialogue agents. Recent research focuses on optimizing policies based on fixed offline logs (Static Context RL) or using a prompt-based simulator (Interactive RL). In this work, we theoretically show that both paradigms are fundamentally limited by context distribution shift–a mismatch between dialogue histories observed during training and those encountered in real conversations. This shift compounds quadratically over turns and severely degrades dialogue quality. Specifically, we attribute this shift to two distinct sources: (i) policy-induced shift, arising from training on static histories rather than self-generated trajectories; and (ii) simulator-induced shift, stemming from discrepancies between simulated and real human behaviors. To address these challenges, we propose Calibrated Interactive RL, a unified framework that couples interactive RL with simulator alignment. By aligning the simulator with human interaction patterns, our approach reduces the sim-to-real gap and mitigates compounding distribution shifts. Experiments across multiple dialogue tasks confirm our theoretical analysis: (i) Interactive RL significantly outperforms the Static Context baseline by mitigating policy distribution shift; and (ii) calibrating simulators with our alignment method further bridges the sim-to-real gap, yielding state-of-the-art downstream performance.
[AI-103] Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL
链接: https://arxiv.org/abs/2605.26371
作者: Sarthak Dayal,Abhinav Peri,Carl Qi,Claas Voelcker,Alexander Levine,Caleb Chuck,Amy Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Hierarchical Reinforcement Learning (HRL) promises to solve long-horizon Reinforcement Learning (RL) tasks more efficiently than non-hierarchical counterparts by discovering and reusing temporally-extended skills. However, obtaining skills that are actually reusable remains an open challenge. Towards this end, we focus on abstractions that exploit the intuition of local dynamics: local transitions in different global contexts require similar kinds of action sequences. By aligning these contexts with the action sequences they require, we are able to learn which skills to reuse and where to reuse them. In principle, this information should benefit many HRL algorithms, where high-level policies have to reason about the low-level skills they use. The resulting algorithm CARL (Contrastive Action-based Representations for Reusable Local Control) shows both qualitative clustering of meaningful skills in complex humanoid environments and improved downstream performance on the OGBench benchmark when integrated with HIQL.
[AI-104] Automatic Layer Selection for Hallucination Detection ICML2026
链接: https://arxiv.org/abs/2605.26366
作者: Xinpeng Wang,William Cao,Andrew Gordon Wilson,Zhe Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICML 2026
Abstract:Recent studies on hallucination detection have shown that hallucination-related signals are more strongly encoded in intermediate layers than in the final layer of large language models (LLMs). Although a growing body of work has sought to exploit this property for hallucination detection, how to automate the selection of high-performing layers remains underexplored, and principled methods for this purpose are still lacking. To address this gap, we first propose several hypotheses for why such signals emerge in intermediate layers and evaluate corresponding criteria for automatic layer selection across diverse LLM architectures, scales, and tasks, covering both question answering and summarization hallucination detection benchmarks. However, we find that none of these criteria consistently delivers satisfactory performance. We therefore propose a new selection criterion, First Effective Peak of Intrinsic Dimension (FEPoID), which consistently identify optimal or near-optimal layers and outperforms both the aforementioned criteria and existing hallucination detection baselines. FEPoID is training-free and incurs negligible computational overhead. In addition, we study the generation behaviors of LLMs and introduce a simple yet effective truncation strategy, which further amplifies hallucination-related signals and substantially improves overall detection performance. Code is publicly available at this https URL
[AI-105] When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning
链接: https://arxiv.org/abs/2605.26350
作者: Chenghao Qiu,Chunli Peng,Yufeng Yang,Kuan-Hao Huang,Yi Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In-context learning (ICL) is often motivated by the intuition that demonstrations help because they provide correct input-output examples. However, we reveal a counterintuitive phenomenon: correctness does not guarantee exemplar utility, and some correct demonstrations can even reduce ICL accuracy. To study this correctness-utility gap, we introduce task-preserving perturbations, where only the exemplar input is changed, while the example remains a correct instance of the same task. Concretely, each perturbed exemplar is assigned the target induced by the task mapping. This framework covers both label-updating perturbations, where task-relevant semantics change and targets are recomputed, and stricter target-preserving perturbations, where the original target remains valid. We formalize the resulting failure mode as contextual evidence shift: task-preserving perturbations can change the effective mixture of evidence used by the model for contextual inference, thereby separating exemplar correctness from exemplar utility. Across sentiment classification, logical reasoning, and math word problems, we find that task-preserving perturbed demonstrations can substantially degrade ICL performance, especially for smaller models, harder tasks, and higher perturbation ratios. Our results show that robust ICL requires evaluating not only whether demonstrations are correct, but also how they influence contextual inference. Code is available at this https URL.
[AI-106] Managing Uncertainty in LLM -Generated Procedural Knowledge for Virtual Laboratory Planning
链接: https://arxiv.org/abs/2605.26333
作者: Polychronis Karpodinis,Dimitris Kalles
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Educational virtual laboratories can make experimental training more scala-ble, adaptive, and accessible, especially when students have limited access to physical laboratory facilities. However, authoring new simulated laboratory procedures remains costly: educators must describe new equipment, define how instruments and materials interact, and specify valid procedural flows that can be executed or assessed inside the virtual environment. Large lan-guage models can assist in this authoring process by generating detailed ex-perimental procedures, but their output should not be treated as directly exe-cutable plans. They may omit necessary actions, arrange steps in the wrong order, or produce instructions that are logically incorrect or incompatible with the laboratory equipment. This paper presents a prototype framework for managing uncertainty in LLM-generated procedural knowledge for virtu-al laboratory planning. The framework aims to reduce procedural uncertainty by using structured domain representations and uncertain LLM-generated state-transition samples to extract candidate procedural rules, transform them into explicit and inspectable constraints, and use them to repair uncertain procedural steps. Although the motivating domain refers to educational vir-tual laboratories, the underlying problem is more general: managing uncer-tain procedural knowledge for action planning in structured interactive envi-ronments. We illustrate the approach in a virtual laboratory domain involving laboratory instruments, containers, tools, and material-transfer actions.
[AI-107] JobBench: Aligning Agent Work With Human Will
链接: https://arxiv.org/abs/2605.26329
作者: Yuetai Li,Yichen Feng,Zhangchen Xu,Zixian Ma,Kaiyuan Zheng,Fengqing Jiang,Xinghua Sun,Rulin Shao,Zichen Chen,Yue Huang,Xinyang Han,Brian Lee,Kayla Xu,Shenglai Zeng,Hang Hua,Xiangliang Zhang,Basel Alomair,Ranjay Krishna,Luke Zettlemoyer,Pang Wei Koh,Bhaskar Ramasubramanian,Luyao Niu,Xiang Yue,Radha Poovendran
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community’s target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.
[AI-108] Semigroup Consistency as a Diagnostic for Learned Physics Simulators
链接: https://arxiv.org/abs/2605.26324
作者: Lennon J. Shikhman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注: 10 pages, 3 figures, 3 tables. Accepted to the AI4Physics Workshop at the 43rd International Conference on Machine Learning
Abstract:Learned physics simulators are often evaluated by one-step or short-horizon prediction error, but these metrics can miss failures in temporal composition and long-horizon rollout. For autonomous, state-complete systems, exact solution maps satisfy a semigroup law: direct evolution over s+t should agree with evolution over s followed by t . We propose normalized semigroup error as a post hoc, model-agnostic diagnostic comparing these direct and composed learned predictions. On one-dimensional heat and Burgers dynamics with time-conditioned ConvNet and FNO baselines, semigroup error is positively associated with rollout degradation, with trajectory-level Spearman correlation \rho = 0.635 and 95% CI [0.621, 0.649] . Semigroup regularization has mixed effects, supporting semigroup consistency primarily as an evaluation diagnostic rather than a universally beneficial training objective.
[AI-109] OmniToM: Benchmarking Theory of Mind in LLM s via Explicit Belief Modeling
链接: https://arxiv.org/abs/2605.26322
作者: Adam Bawatneh,Sagar Sapkota,Amrit Singh Bedi,Santu Karmaker,Mubarak Shah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 8 figures, 19 tables; includes appendix
Abstract:Theory of Mind (ToM), the ability to infer others’ knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, where performance is judged solely by the final answer to a social reasoning query. This paradigm obscures whether the model actually constructs the underlying mental-state representations required for robust reasoning, particularly in scenarios involving divergent, evolving, or mistaken beliefs. In order to address this research gap, we introduce OmniToM, a benchmark that directly evaluates these representations by requiring explicit modeling of belief structures for all relevant actors within a narrative. These structures are composed of belief propositions: minimal statements of what an actor takes to be true about the world or another actor’s mental state, allowing knowledge, intentions, emotions, and false beliefs to be analyzed in a common format. Models are evaluated in two stages: Stage 1: Belief Extraction, which extracts from the story the beliefs relevant to its social dynamics, and Stage 2: Belief Labeling, which assigns each belief a seven-dimensional schema label covering recursive order, truth status, knowledge access, explicitness, content type, mental source, and context. Built from 895 stories from the existing ToMBench story corpus and augmented with 22,343 labeled belief propositions, OmniToM uses a human-calibrated LLM-assisted annotation pipeline. Across diverse models in zero-shot evaluation, OmniToM reveals an actor-specific belief-tracking bottleneck: current LLMs struggle with the knowledge-access and representational decisions required to transform narrative facts into actors’ beliefs and shared mental states.
[AI-110] Anchor: Mitigating Artifact Drift in Agent Benchmark Generation
链接: https://arxiv.org/abs/2605.26321
作者: Maksim Ivanov,Abhijay Rana
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to RLEval '26 (Workshop at ACM Conference on AI and Agentic Systems 2026)
Abstract:AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale. Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward-hackable, or inconsistent. We introduce Anchor, a task-generation pipeline that formalizes domain experts’ specifications of business workflows into constraint optimization programs. From a single parametric specification, the pipeline jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness. We apply Anchor to produce ERP-Bench: a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows in a production-grade ERP system. We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials. Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work. We release the task generator and ERP-Bench dataset at this http URL
[AI-111] Curriculum Learning for Safety Alignment ICML2026
链接: https://arxiv.org/abs/2605.26315
作者: Sandeep Kumar,Virginia Smith,Chhavi Yadav
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the ICML 2026 GlobalSouthML Workshop
Abstract:Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate whether Curriculum Learning can improve the robustness of DPO-based safety alignment. We propose Staged-Competence, a curriculum-based framework that organises preference data by difficulty, employs competence-based sampling, and progressively updates the reference model during training. Averaged across three model families, Staged-Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities with near-zero over-refusal. We further show that Staged-Competence (1) matches baseline safety with only 75% of the training data and (2) yields better separation between safe and unsafe responses. Staged-Competence is agnostic to the policy optimisation loss and can extend to other DPO variants and alignment domains. Our code and data are available at this https URL.
[AI-112] Intelligent Detection and Mitigation of Carpet-Bombing DDoS Attacks in SDN Using Retrieval-Augmented Generation and Large Language Models
链接: https://arxiv.org/abs/2605.26307
作者: Mohammed N. Swileh,Shengli Zhang,Kai Lei
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Software-Defined Networking (SDN) provides flexible and programmable network management; however, its centralized control architecture remains highly vulnerable to Distributed Denial-of-Service (DDoS) attacks, particularly Carpet-Bombing DDoS attacks that distribute malicious traffic across multiple targets to evade conventional detection mechanisms. In this paper, a Retrieval-Augmented Generation (RAG)-based framework is proposed for real-time detection and mitigation of Carpet-Bombing DDoS attacks in SDN environments. The proposed framework combines interface-level traffic features representation, semantic embedding generation, FAISS-based similarity retrieval, and Large Language Model (LLM)-driven contextual inference to classify traffic behavior without requiring conventional supervised model training or retraining. To evaluate the effectiveness of the proposed framework, extensive experiments were conducted under multiple Carpet-Bombing DDoS attack scenarios with different attack intensities. In addition, two traffic representation strategies, namely structured JSON-based representation and natural language-based representation (NLR), were investigated using multiple state-of-the-art LLMs. The experimental results demonstrate that the proposed framework achieved highly accurate and stable attack detection performance, while the framework configuration utilizing the Gemma-4-31B-IT model achieved the strongest overall detection results. Furthermore, real-time experiments confirmed the capability of the proposed framework to rapidly detect and mitigate Carpet-Bombing DDoS attacks while maintaining stable SDN network operation. The obtained results highlight the effectiveness of integrating RAG mechanisms with LLM for intelligent and adaptive SDN security analysis.
[AI-113] Experiments in Agent ic AI for Science
链接: https://arxiv.org/abs/2605.26305
作者: Judy Fox,Geoffrey Fox
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY); High Energy Physics - Phenomenology (hep-ph)
备注:
Abstract:This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local Body, Remote Brain architecture via Google Colab, utilizing Python-based local orchestrators to invoke large language model (LLM) cloud backends. The first agent, DeepTS/DeepCollector, automates the large-scale curation, extraction, and deduplication of time-series datasets. The second, DeepScribe, is an autonomous presentation analyzer that converts visually dense, mathematically complex physics lectures into structured scientific reports. Through practical systems engineering-such as granular attribute extraction (Cellular RAG), remote data inspection, and distributed concurrency controls-we demonstrate how agentic AI can overcome the context and reasoning limitations of current state-of-the-art systems to rigorously support scientific workflows. Finally, we outline a generalization of DeepTS to support deep knowledge graphs and discuss the application of this conceptual approach to high-energy physics (DeepQCD).
[AI-114] Constraint acquisition needs better benchmarks
链接: https://arxiv.org/abs/2605.26279
作者: Rafał Stachowiak,Tomasz P. Pawlak
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 12 pages, 1 figure, for the associated dataset, see this https URL
Abstract:Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain knowledge artifacts are currently limited by inadequate benchmarks. This deficiency impedes reproducibility and cross-study comparability, slowing the maturation of CA methods. Existing benchmarks were designed for solver evaluation rather than for assessing CA algorithms. They are loosely organized, treat individual problems inconsistently, and omit the domain knowledge artifacts required by CA methods. This work presents MPMMine, a benchmark suite designed to assess algorithms that discover, validate, and enhance MP models using diverse domain knowledge artifacts. MPMMine is guided by consistency, standardization, completeness, extensibility, openness, and version control. It adopts a uniform structure and relies on open formats: MiniZinc, CommonMark, and JSON. It provides multiple models per problem, tens of instances per model, and thousands of solutions and non-solutions in both integer and continuous domains, alongside natural-language descriptions to support text-to-model methods.
[AI-115] Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions
链接: https://arxiv.org/abs/2605.26256
作者: Jeongeun Lee,Chanyoung Park,Dongha Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real-world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized context accumulated over time. In this work, we propose POLAR, a multiomodal memory-augmented framework for personalized embodied agents over long-term user interactions. POLAR organizes prior interactions into a multimodal knowledge graph that captures semantic memory for personalized context and visual concepts, and episodic memory for embodied experiences such as agent trajectories. To execute embodied tasks, POLAR retrieves relevant memories to interpret the current request and guide task execution. We evaluate POLAR across multiple MLLM backbones and diverse evaluation scenarios to study the role of memory in long-term personalization. Results show that the proposed memory mechanism consistently improves performance by enabling more effective use of information accumulated over prior interactions. The gains are especially pronounced when the agents are required to reason across multiple interactions, perform multi-hop inference, or tracking updates in user-specific context over time.
[AI-116] Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory
链接: https://arxiv.org/abs/2605.26252
作者: Abdelghny Orogat,Essam Mansour
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Long-running AI agents need persistent memory. Memory supports learning across sessions, reduces repeated context injection, and enables auditing of past decisions. Current agent memory systems and database paradigms treat memory as storage. They localize correctness at records, embeddings, or edges. Each supplies only some of the capabilities that long-term memory requires. The result is four recurring failure modes: unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval. In our vision, long-term agent memory is a new data-management workload. Its correctness is a property of the state trajectory, not of individual records. We formalize this as Governed Evolving Memory (GEM). GEM replaces record-level database operations with four state-level operators: ingestion, revision, forgetting, and retrieval. Six correctness conditions govern how the state evolves. Three structural observations establish that no record-level system can satisfy these conditions, regardless of the storage model. We realize the abstraction in MemState, a prototype on a property-graph backend. MemState validates feasibility and exposes the gap to a native engine. We outline three research directions that define memory-centric data management as a workload.
[AI-117] Unified Neural Scaling Laws
链接: https://arxiv.org/abs/2605.26248
作者: Ethan Caballero,Priyank Jaini,David Krueger,Irina Rish
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:We present a functional form (that we refer to as a Unified Neural Scaling Law (UNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks as multiple dimensions all vary simultaneously (i.e. how the evaluation metric of interest varies as one simultaneously varies the number of model parameters, training dataset size, number of training steps, number of inference steps, amount of compute, and various hyperparameters) for various architectures and for each of various tasks within a varied set of upstream and downstream tasks. This set includes large-scale vision, language, math, and reinforcement learning. When compared to other functional forms for neural scaling, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set.
[AI-118] Can LLM s Introspect? A Reality Check
链接: https://arxiv.org/abs/2605.26242
作者: Shashwat Singh,Tal Linzen,Shauli Ravfogel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model’s own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.26242 [cs.AI] (or arXiv:2605.26242v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.26242 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-119] Workflow Closure Is Not Scientific Closure in Auto-Research Systems
链接: https://arxiv.org/abs/2605.26200
作者: Shuai Wang,Xinyuan Tian,Pangpang Liu,Yize Zhao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 26 pages, 1 figure, 2 tables
Abstract:This paper argues that workflow closure is not scientific closure in auto-research systems. Current systems can increasingly complete research-like loops internally, moving from idea generation to experiment execution, writing, and self-evaluation. That achievement is real, but it does not by itself give the resulting outputs scientific standing. We argue that trustworthy auto-research should not aim for autonomous self-sufficiency, but should aim for autonomous execution under non-autonomous epistemic control. Based on a survey of more than 100 recent papers and repositories in this rapidly emerging area, together with a structured audit of 21 representative systems, we diagnose a recurring and structurally connected failure pattern: objective collapse, in which single-proxy targets replace multi-objective scientific aims; validation collapse, in which internal self-evaluation replaces independent validation; and acceptance collapse, in which benchmark scores or publication-shaped artifacts replace mechanisms for domain-level critique, reuse, and integration. These collapses are not inherent limits of autonomy but correctable design choices. Accordingly, we outline potential remedies across objective signal, validation, and output pathway to spark community discussion.
[AI-120] CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly
链接: https://arxiv.org/abs/2605.26195
作者: Yihe Fan,Changyi Li,Lichen Xu,Xudong Pan,Jiarun Dai,Hong Geng,Min Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-based agents are increasingly used for cybersecurity tasks, but most existing systems rely on fixed, human-designed scaffolds that struggle to adapt across diverse targets and failure modes. We introduce \textscCyberEvolver, a self-evolving cybersecurity agent framework that iteratively revises its own scaffold based on experience from failed execution attempts. Self-evolution in cybersecurity is challenging because the space of possible scaffold changes is largely unstructured, execution feedback is sparse and often obscured by the environment, and low-diversity updates can cause errors to compound over repeated iterations. \textscCyberEvolver addresses these challenges with a four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, a trace-to-diagnosis mechanism that converts noisy execution logs into actionable revision signals, and a population-based beam search strategy that preserves diverse agent variants during evolution. We evaluate \textscCyberEvolver on CTF challenges, vulnerability exploitation, and penetration-testing tasks using four open-source LLMs. Across these settings, \textscCyberEvolver improves the seed agent’s success rate by 13.6 ,% on average, and outperforms six human-designed cybersecurity agents as well as two self-improvement methods adapted from other domains. These results suggest that scaffold self-evolution is a promising direction for building adaptive LLM agents for security testing.
[AI-121] Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection KDD2026
链接: https://arxiv.org/abs/2605.26193
作者: Qideng Tang,Dai Chaofan,Wubin Ma,Yahui Wu,Haohao Zhou,Tao Zhang,Huan Li,Dalin Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026
Abstract:Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure in detecting subtle and prolonged anomalies. Outlier Exposure (OE) and Masked Autoencoder (MAE) emerge as two promising paradigms (classification and reconstruction) for solving the above problems. However, OE-based methods are constrained by poor generalization, while MAE-based methods are limited by masking misalignment issues. To address these limitations, this paper proposes a novel framework, CoAD, which unifies the two paradigms to leverage their complementary strengths while mitigating their respective weaknesses. In this framework, the classification module generates probability-informed soft masks for the reconstruction module, which in turn alleviates the generalization problem of the classification module. This cooperative design enables CoAD to effectively detect subtle and complex anomalies that are often overlooked by existing methods. Additionally, the classification module is carefully designed to resolve issues related to improper classification granularity and the neglect of frequency information. Extensive experiments on high-quality benchmark datasets, conducted under rigorous evaluation protocols, demonstrate that CoAD significantly outperforms both state-of-the-art deep learning and traditional data mining methods, highlighting the potential of deep learning in TSAD. Moreover, CoAD is lightweight and substantially faster than existing SOTA methods, demonstrating its practical value for large-scale, real-time applications.
[AI-122] Co-folding model guided by structural proteomics
链接: https://arxiv.org/abs/2605.26192
作者: Alon Shtrikman,Nitzan Simchi,Michal Ran Shchory,Sagie Brodsky,Eran Seger,Kirill Pevzner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:
Abstract:Protein structure generative models excel at predicting single protein static structures from sequence, but routinely fail to capture the correct conformational state of protein complexes, critical for protein design and induced proximity modalities such as antibodies and PROTACs. While structural proteomics techniques like Cross-Linking Mass Spectrometry (XL-MS) and Hydrogen-Deuterium Exchange (HDX-MS) offer valuable spatial and dynamic insights, integrating these sparse, heterogeneous measurements into these models remains an open challenge. Here, we bridge this gap by combining structural proteomics data with the rich biophysical priors learned by pretrained diffusion models. We introduce AIMS-Fold, an inference-time guided-diffusion framework that actively steers the generative sampling trajectory using differentiable physical potentials derived from XL-MS spatial restraints and HDX-MS solvent accessibility profiles. We demonstrate that these structural methods individually enhance predictive accuracy, and their integration yields synergistic improvement. Crucially, by leveraging these experimental restraints, AIMS-Fold achieves higher accuracy on challenging induced proximity targets than purely computational, unguided state-of-the-art models like Boltz-2. This establishes our framework as a powerful, integrative computational approach for the structure based drug design of induced proximity drugs. Evaluation code will be made publicly available upon publication.
[AI-123] Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series IJCAI2026
链接: https://arxiv.org/abs/2605.26191
作者: Ren Fujiwara,Yasuko Matsubara,Yasushi Sakurai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI 2026
Abstract:This research addresses the problem of adaptive modeling in time-series data streams with clear input-output relationships. This problem is challenging because rapid system changes (regime shifts) caused by environmental factors or input delay changes degrade model performance, and the trade-off among accuracy, robustness, and memory usage arises when using multiple small models for each time-series pattern. To address these issues, this paper presents an online framework/method that treats streaming time series as dynamic mixtures of time-delay systems. This framework maintains robustness of model tracking and reduces memory usage by summarizing past regimes using a fixed-length representation that captures both the system dynamics and input-output delays. Concretely, this approach constructs a summary system tensor using the system’s Markov parameter series, capturing both dynamic behavior and delay characteristics. If necessary, a tensor decomposition algorithm extracts relevant past models from the tensor and helps select the system that best fits the current regime. This method enables rapid adaptation to environmental changes and is computationally efficient. Tests on real datasets show that DelayMix consistently outperforms other methods, achieving superior forecast accuracy and faster adaptation to delays, especially for highly non-stationary data.
[AI-124] HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals
链接: https://arxiv.org/abs/2605.26190
作者: Shuwen Yu,William P Marnane,Geraldine B. Boylan,Gordon Lightbody
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Paper submitted to Journal of Engineering Applications of Artifical Intelligence
Abstract:This paper presents the HRVConformer, a novel deep learning architecture for the classification of hypoxic-ischemic encephalopathy (HIE) using the instantaneous heart rate (HR) signal. Unlike conventional approaches that rely on handcrafted features, HRVConformer directly processes raw HR signals in an end-to-end manner, capturing both local and long-range dependencies through a hybrid Convolution-Transformer framework. By integrating convolutional layers for local feature extraction and Transformer-based attention mechanisms for global context modelling, the architecture effectively enhances signal representation and classification performance. The model was trained using supervised learning on a large HR dataset consisting of 1,573 one-hour epochs, including 259 one-hour expert-annotated epochs and a substantial set of weakly labelled data. A 314-hour validation set provided a robust performance estimation, while an independent 215-hour dataset with expert annotations was reserved for final testing. HR signals were extracted from electrocardiogram (ECG) recordings using an improved Pan-Tompkins algorithm, which significantly enhanced both signal quality and data availability. Experimental results demonstrate that the HRVConformer achieves an AUC of 83.23% and accuracy of 74.56% on the test set. These results surpass the performance of the Transformer, ResNet50 and fully convolutional networks baselines, highlighting the advantages of integrating convolutional and Transformer-based components for HR-based HIE classification. The proposed method provides a promising step toward a more accurate and automated assessment of HIE using HR signals. The code is available at: this https URL.
[AI-125] Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training
链接: https://arxiv.org/abs/2605.26189
作者: Yingying Cheng,Jinquan Shi,Li Zhou,Zhiyang He,Zhaoyi Sun,Fan Zhang,Jie Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Quantization-aware training (QAT) with low-bit floating-point formats enables efficient LLM deployment, yet introduces subtle failure modes invisible to standard training metrics. We present a systematic study of HiF8 W8A8 QAT for OpenPangu-Embedded-1B through the lens of Delayed Tensor Scaling (DTS). Across eight controlled experiments, we identify and disentangle two orthogonal failure modes: (i)amax saturation, where delayed scale estimates silently corrupt knowledge-sensitive representations via forward-pass clipping, and (ii)catastrophic forgetting, where an aggressive learning rate overwrites pretrained commonsense knowledge independently of quantization. Neither is detectable from training loss alone. We address amax saturation with a conservative max-algorithm DTS strategy over a 64-step history window, and mitigate forgetting via a 500-step BF16 warmup followed by QAT at lr=10^-5. Both fixes are necessary and sufficient: our final configuration achieves 0.43% MMLU drop, 0.58% HellaSwag drop, and 0.22% ARC-Challenge drop versus a matched BF16 baseline, with a training loss APE of only 0.11% over 10,000 steps.
[AI-126] GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training
链接: https://arxiv.org/abs/2605.26184
作者: Yuelin Hu,Zhenbo Yu,Zhengxue Cheng,Wei Liu,Li Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures, 22 tables
Abstract:Hybrid post-training usually combines supervised fine-tuning and reinforcement learning, but fixed mixing schedules cannot adapt when the relative noise of the two signals changes over time. We propose GAC, a noise-aware controller that derives an adaptive mixing weight from online estimates of gradient variance and disagreement between the two training signals. The method adds smoothing, prior guidance, and bounded updates while reusing existing training tensors. Experiments on math, code, science, and logic benchmarks show that GAC consistently improves hybrid post-training over strong fixed and rule-based baselines, with larger gains at larger model scales and less than 1% training overhead.
[AI-127] BrickAnything: Geometry-Conditioned Buildable Brick Generation with Structure-Aware Tokenization
链接: https://arxiv.org/abs/2605.26182
作者: Zhengyang Ni,Feng Yan,Yu Guo,Fei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
Abstract:Generating physically buildable brick structures from 3D shapes requires more than geometric reconstruction: the output must also satisfy discrete part constraints and structural stability. Existing brick generation methods either rely on heuristic optimization, which can break down when the target 3D shape does not admit a feasible structure under predefined constraints, or generate brick sequences without explicitly modeling the underlying 3D geometry and assembly relations. In this work, we present BrickAnything, a geometry-conditioned autoregressive framework for generating buildable brick structures from diverse 3D representations. BrickAnything uses point clouds as a unified geometric interface and predicts brick sequences that reconstruct the target shape under assembly constraints. To model structural dependencies among bricks, we introduce a structure-aware tree tokenization, which represents brick structures through local attachment relations. This formulation makes sequence generation more consistent with the physical construction process, and reduces invalid intermediate states. We further introduce preference-based alignment post-training, validity-constrained decoding and adaptive rollback to improve buildability objectives such as stability and geometric fidelity. Extensive experiments demonstrate that BrickAnything produces geometrically faithful and physically realizable brick structures, and that the proposed tokenization effectively reduces rollback and regeneration compared with conventional ordering strategies.
[AI-128] RepoMirag e: Probing Repository Context Reasoning in Code Agents with Perturbations
链接: https://arxiv.org/abs/2605.26177
作者: Hanyu Li,Yichi Zhang,Speed Zhu,Hang Su,Jun Zhu,Yinpeng Dong
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code agents are currently having skillful performance on repository-level software engineering benchmarks, but it remains unclear whether success on end-to-end tasks such as issue resolution truly reflects repository context reasoning, the ability to identify the task-relevant information across multiple files and reason over the relations among them. To investigate this question, we introduce RepoMirage, a two-stage evaluation suite built on SWE-Bench Verified that adopts perturbation as a diagnostic tool to increase the demand for context reasoning by transforming how the repository is exposed. First, RepoMirage-Perturb applies three types of semantics-preserving repository-level perturbations, revealing a clear performance drop when correct solving requires broader context access. RepoMirage-Extend further turns perturbation-targeted structural bottlenecks into explicit tasks beyond issue resolution, where the average performance declines from 66.8% in the original setting to 25.3%, indicating a significant deficiency in repository context reasoning. Further trajectory analysis reveals an exploration drift, where agents access broader repository context but fail to turn it into effective structure information. Motivated by this observation, we propose RepoAnchor, a structure-first prototype workflow that separates repository exploration from downstream problem solving, and show that explicit structural scaffolding yields notable gains. These results uncover an previously overlooked gap in repository context reasoning for code agents and suggest that stronger structure-aware methods are potential to improve them.
[AI-129] PitchBench: Measuring Pitch Hearing in Audio-Language Models
链接: https://arxiv.org/abs/2605.26176
作者: Milan Liessens Dujardin,Song-Ze Yu,Craver Corbyn Thomas-Smith,David M. Chan,Karina Nguyen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal AI systems that must reason from sensory input rather than text alone. This makes reliable musical perception a critical prerequisite: if a model cannot accurately hear the structure of sound, it cannot be trusted to reason about, teach, transcribe, or act on audio in the real world. Yet existing benchmarks rarely assess one of the most fundamental musical abilities underlying such perception: pitch hearing. Current evaluations tend to probe pitch hearing only indirectly, through higher-level tasks and often in multiple-choice formats, leaving open how reliably ALMs identify fine-grained pitch across instruments, acoustic conditions, and response formats. We introduce PitchBench, an evaluation suite that systematically measures pitch hearing in ALMs. PitchBench comprises 28 experiments spanning absolute and relative pitch perception within sequences and chords, while varying loudness, note duration, sound source, time stretching, background noise, and other acoustic conditions. Tasks range from identifying individual pitches in isolation to tracking a melodic line within a four-part musical texture. Evaluating frontier ALMs, we find that pitch hearing remains highly unreliable: models perform consistently poorly across settings, with accuracy varying sharply by sound source, note duration, and notation format. Current ALMs do not yet possess stable pitch perception, even for controlled synthetic and instrumental stimuli. Alongside the benchmark, we release PitchBench as a Python package containing the evaluation data and data generation tools to support future work on pitch-aware audio-language modeling.
[AI-130] InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization
链接: https://arxiv.org/abs/2605.26175
作者: Ke Li,Dong An,Xiaoling Zang,Can Ye,Liang Xie,Qibo Qiu,Chen Shen,Xiaofei He,Wenxiao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-bit activation quantization remains a major bottleneck in efficient large language model (LLM) deployment. The difficulty is not only that activations contain outliers, but that their distributions are often poorly matched to a low-bit uniform quantizer. Existing post-training quantization (PTQ) methods suppress peaks, balance channels, or minimize reconstruction error, yet they rarely specify what activation distribution is actually easy to discretize. As a result, activations may appear numerically smoother while still incurring large quantization error because the quantization range remains wide or most values collapse into a few levels near the mean. We recast activation transformation as quantizer-facing distribution design and analyze quantization error from an information-theoretic perspective. Our analysis shows that quantization-friendly activations should jointly have a smaller numerical range and sufficient dispersion within that range. Guided by this analysis, we propose InfoQuant, a train-free method that employs Peak Suppression Orthogonal Transformation (PSOT) to shape activations into more quantization-friendly distributions. We further introduce adaptive outlier-token selection to improve the robustness of PSOT during optimization. Across multiple LLM families, InfoQuant consistently outperforms prior PTQ and end-to-end training baselines. Under W4A4KV4, it preserves 97% of floating-point accuracy on average and reduces the LLaMA-2 13B performance gap by 42% over the previous state of the art. Code is available at [this https URL](this https URL)
[AI-131] Planning Neural Dynamics with Lie Group Embedding through Supervised Projective Manifold Learning
链接: https://arxiv.org/abs/2605.26167
作者: Tianwei Wang,Bryan Chen,Qian Zuo,Qiyue Xia,Xin Li,Wei Pang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Rings and Algebras (math.RA)
备注: Preprint. Under review
Abstract:We propose Lie group embedded dynamical neural networks (LieEDNN) and the corresponding learning algorithms based on gradient descent and metric projection on smooth manifold, where we treat Lie group as an intrinsic representation for continuous symmetry of manifold geometry. Thereby we achieve learnable and stable dynamics on the underlying manifold for general Lie group, and we are able to utilize the powerful representation capability of Lie group such as SO(3) and SE(3) to solve real world engineering problems in areas such as robotics, graphics, and control. Two core challenges are: (i) General Lie groups are incompatible with addition arithmetic, which is necessary for neural network interactions. (ii) The dynamics evolve in the nonlinear representation space of special algebra rather than the normal Euclidean space, which violates the paradigm of common neural ODEs. To address these two challenges, we firstly introduce adjoint Lie group action on the Lie algebra, which induces a linear mapping and transfer to the block-wise structure of weight matrices, such that addition could operate on the Lie algebra as a vector space. Then we parameterize the Lie algebra and the adjoint action as linear transformation so that the architecture is aligned with neural network perceptrons. Explicitly, this embedding appears as block-wise manifold constraints on weights, and we develop algorithms to learn the equilibrium with stability guarantees of the temporal neural network dynamics. Experiments are implemented on a specific Lie group SE(3), with the application scenario of telescopic manipulators.
[AI-132] Enhancing Autonomous Online Intrusion Detection for IoT with Balanced Learning Reliable Pseudo-Labels and Lightweight Architectures
链接: https://arxiv.org/abs/2605.26166
作者: Hanzala Afzaal,Danish Memon,Chouhdary Bilal Raza,Muhammad Khurram Shahzad
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 5 figures; Code available at this https URL
Abstract:The rapid proliferation of Internet of Things (IoT) devices has created an urgent demand for adaptive, resource-efficient Intrusion Detection Systems (IDS) capable of handling dynamic and evolving cyber threats. This paper investigates AOC-IDS, a state-of-the-art autonomous online IDS published at IEEE INFOCOM 2024, which employs an Autoencoder (AE) with Cluster Repelling Contrastive (CRC) loss and an autonomous Gaussian-based decision module. We first successfully replicate AOC-IDS on the UNSW-NB15 benchmark, achieving 89.39% accuracy in close agreement with the published 89.19%. We then identify four key limitations: class imbalance, unreliable pseudo-label generation, limited generalization, and computational overhead for IoT deployment, and propose targeted improvements for each. Our XGBoost-BalSamp method achieves 95.45% accuracy on UNSW-NB15, a gain of 6.26% over the baseline. Our combined deep learning approach (PseudoFilter, MixupAug, and LiteAE) achieves a best-run accuracy of 90.88% (F1: 91.45%), surpassing the base paper while reducing model parameters by 55%.These results demonstrate that targeted improvements to AOC-IDS yield consistent accuracy gains while improving practical deployability on IoT edge devices.
[AI-133] On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach KDD2026 KDD
链接: https://arxiv.org/abs/2605.26162
作者: Jiahui Bai,Hai Dong,A. K. Qin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026). This is the extended version with full appendix
Abstract:Asynchronous decentralized federated learning (ADFL) eliminates central coordination and global synchronization, making it attractive for large-scale and heterogeneous systems. However, frequent peer-to-peer communication, asynchronous updates on directed topologies, and non-IID data jointly lead to excessive communication overhead, biased aggregation and severe model drift. We propose PushCen-ADFL, a communication-efficient ADFL framework that enables stable training under asymmetric communication and delayed client participation. PushCen-ADFL couples communication, aggregation, and local stabilization in a shared centroid representation space, forming a closed loop between compression and optimization. Clients exchange centroid-form messages, apply average-preserving push-sum mixing to correct aggregation bias, and use a lightweight centroid regularization anchored in the same centroid space to mitigate drift under heterogeneity and staleness. A bounded, sender-deduplicated buffer further improves robustness under irregular asynchronous arrivals. Experiments on vision datasets demonstrate that PushCen-ADFL improves accuracy under data heterogeneity by up to 6% while reducing per-push communication cost by more than 80%, achieving a favorable accuracy-communication trade-off.
[AI-134] SFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models
链接: https://arxiv.org/abs/2605.26161
作者: Hongkai Li,Shifeng Xie,Lefei Shen,Zhuo Li,Mouxiang Chen,Xiaobin Zhang,Han Fu,Jianling Sun,Xiaoxue Ren,Chenghao Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures, 9 tables
Abstract:Time series foundation models (TSFMs) are increasingly pretrained on large corpora, raising concerns that evaluation datasets may have been exposed during pretraining and thus yield overly optimistic performance estimates. Auditing such contamination is challenging in time series because signals are continuous and heterogeneous, and often lack corpus documentation. To the best of our knowledge, this is the first work to study pretraining contamination auditing for TSFMs. We formalize the problem of pretraining contamination auditing for TSFMs and propose TSFMAudit, a method based on probe adaptation dynamics. Our key intuition is that contamination manifests as unusually efficient adaptation: after a fine tuning probe, contaminated datasets tend to exhibit faster loss reduction with smaller backbone movement. We evaluate TSFMAudit on 6 TSFMs and 187 datasets using documented training source evidence as supervision, and compare against 10 competitive baselines adapted from the LLM literature.
[AI-135] Furina: Frag mented Uncertainty-Driven Refusal Instability Attack ICML2026
链接: https://arxiv.org/abs/2605.26158
作者: Tongxi Wu,Jian Zhang,Yang Gao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work is accepted as a regular paper at ICML 2026
Abstract:Safety alignment in large language models (LLMs) and multimodal large language models (MLLMs) is commonly assumed to operate as a near-binary threshold mechanism. We challenge this assumption by revealing that safety behavior is governed by an instability region where small perturbations induce stochastic refusal decisions rather than deterministic outcomes. We develop a multi-metric diagnostic framework combining external and internal signals to characterize this instability. Through systematic experiments, we identify a characteristic diagnostic signature: inputs in unstable regimes exhibit elevated output uncertainty yet decreased internal safety activation, a decoupling phenomenon that explains why detection-based defenses fail against sophisticated attacks. Building on this framework, we introduce Furina, a jailbreak attack that deliberately induces this signature through fragmented, scene-anchored prompts without model-specific optimization. Furina outperforms strong single-turn and multi-turn baselines on HarmBench and achieves competitive results on MM-SafetyBench, demonstrating that uncertainty amplification provides a principled and transferable mechanism for understanding safety vulnerabilities. Code is available at: this https URL.
[AI-136] urning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges ICML2026
链接: https://arxiv.org/abs/2605.26156
作者: Xianglin Yang,Bryan Hooi,Gelei Deng,Tianwei Zhang,Jin Song Dong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the Forty-Third International Conference on Machine Learning (ICML 2026)
Abstract:The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a black-box adversarial framework that learns semantics-preserving edits to mislead an LLM judge and artificially inflate the scores it assigns. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge’s score without access to model parameters or gradients. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks. BITE achieves an attack success rate exceeding 65% and raises scores by 1-2 points on a 9-point scale, all while preserving semantic equivalence. We further assess the attack’s stealthiness, showing that BITE evades standard style-control methods and several detection baselines. Our findings expose a fundamental weakness in the LLM-as-a-judge paradigm and motivate robust, attack-aware evaluation. Our code is available at this https URL.
[AI-137] When Does Adaptive Guidance Help? Belief-Aware Privileged Distillation for Autonomous Driving Under Partial Observability CVPR2026
链接: https://arxiv.org/abs/2605.26155
作者: Mehmet Haklidir
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 figures, 7 tables. Accepted at CVPR 2026 Workshop on Autonomous Driving (WAD)
Abstract:Guided Soft Actor-Critic (GSAC) distills knowledge from a privileged full-state teacher to a partial-observation student for autonomous driving, but uses a fixed distillation coefficient lambda regardless of the agent’s uncertainty. We present Belief-Aware GSAC (BA-GSAC), which modulates lambda via ensemble disagreement, and use it as a testbed for a systematic empirical study asking: when does adaptive guidance actually help? Evaluating five strategies (fixed lambda in 0.01, 0.1, adaptive, linear decay, and vanilla SAC) across three POMDP difficulty levels on Highway-Env, we find that preliminary single-seed runs suggest benefits under mild and moderate partial observability, but under severe occlusion (evaluated with 3 seeds for all methods) the adaptive coefficient collapses to lambda_min within about 3K steps. We trace this to an observability blindness phenomenon: because the ensemble predicts partial observations, it achieves low disagreement even under heavy occlusion, modeling what is visible but unable to detect what is missing. We diagnose the root cause and propose an architectural fix (training the ensemble on full-state predictions using the guiding actor’s privileged access); while not validated here, we show that even with current limitations, the warmup phase provides measurable stabilization (CV=13.3% vs. 29.8% for constant lambda=0.01). In fact, a simple deterministic linear decay schedule achieves the best severe-POMDP performance across all metrics (mean 116.5, CV=8.9%), suggesting that the scheduling effect, not the ensemble, drives the stability benefit. These findings provide practical guidance for designing uncertainty-aware teacher-student frameworks and highlight ensemble prediction targets as an important design choice.
[AI-138] MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning
链接: https://arxiv.org/abs/2605.26154
作者: Xuanye Zhang,Yongsen Zheng,Zhuqin Xu,Kaiyu Zhou,Bowen Shen,Haoran Ou,Tianwei Zhang,Kwok-Yan Lam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Preprint. Under review
Abstract:LLM-driven agents are capable of selecting external tools to complete users’ tasks. However, attackers could compromise such process, steering agents toward inappropriate/wrong tools and enabling malicious actions. Most existing attacks primarily manipulate the tool metadata, which is easily detectable by auditing and may lose effectiveness as modern agents increasingly adopt memory modules to refine tool selection policies through accumulated experience. This paper proposes MemMorph, the first attack that bias tool selection by poisoning the agent’s long-term memory. Rather than explicitly dictating the tool invocation decision, MemMorph injects a small number of crafted records that are disguised as technical facts, incident reports, and operational policies. These poisoned records reshape the agent’s contextual perception and decision-making process, leading it to autonomously infer and select the tool preferred by the attacker. Experiments across 3 benchmarks, 10 agent backbones, and 3 memory-module implementations show that MemMorph achieves up to 85.9% attack success rate with only three injected records, outperforming the strongest baseline by up to 25% while retaining potency under 3 representative defenses. Our findings expose long-term memory as a critical and under-explored attack surface in tool-augmented agents, urging the development of memory-level integrity safeguards.
[AI-139] Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception
链接: https://arxiv.org/abs/2605.26136
作者: Nicolas M. Müller,Wei Herng Choong
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Audio deepfakes have improved rapidly recently, yet their effect on human trust in real speech remains unstudied. We present the largest listening study on audio deepfake perception to date, collecting 35,532 judgments from 1,768 participants across 138 text-to-speech and voice conversion systems. Our central finding is a skepticism shift: compared to a 2021 baseline, human accuracy on fake samples barely changed (72.9% to 71.2%), but accuracy on real samples dropped from 72.7% to 64.1%. Participants are not worse at detecting synthesis artifacts; rather, they increasingly distrust authentic speech. Samples generated by commercial and autoregressive language model systems proved hardest to detect (61.3 - 65.9%), while those from traditional seq2seq and flow-matching models remain easier to spot (75.4 - 76.8%). An ML detector that served as a reference point maintained over 94.5% accuracy across all conditions. Our results suggest that the primary threat posed by modern deepfakes may not be mere deception, but the erosion of trust in genuine audio.
[AI-140] GEM: Geometric Entropy Mixing for Optimal LLM Data Curation ICML2026
链接: https://arxiv.org/abs/2605.26121
作者: Yue Min,Ziyun Qiao,Ruining Chen,Yujun Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to ICML 2026
Abstract:LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorization flaws: human taxonomies suffer from ontological misalignment, and Euclidean clustering fails to address embedding anisotropy. We introduce GEM (Geometric Entropy Mixing), a framework reformulating data curation as a variational problem on the hypersphere augmented with a mixing-balance regularizer. By decoupling the generative prior and optimizing the objective via a provable MM (Minorize-Maximize) algorithm, GEM effectively counteracts the cluster collapse to discover balanced semantic structures invisible to Euclidean heuristics. We employ teacher-student distillation to scale this geometric fidelity to web-scale corpora and introduce the Geometric Influence Score (GIS) for interpretable taxonomy generation. Experiments with 1.1B-parameter models demonstrate that GEM establishes a new state-of-the-art when integrated into mixing strategies like DoReMi and RegMix, improving average downstream accuracy by up to 1.2% and offering a robust coordinate system for predictable data mixing.
[AI-141] Edge AI Deployment Beyond Models: A BSP-Aware Systems Framework for Industrial Embedded Platforms
链接: https://arxiv.org/abs/2605.26119
作者: Pitchai Muthu M
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures, industrial white paper
Abstract:Industrial Edge AI programs often begin with the model and only later confront the platform. That sequencing is attractive because it allows early demonstrations, but it breaks down when the deployment target is an embedded system with long product lifecycles, vendor-specific kernels, heterogeneous accelerators, safety constraints, and nontrivial I/O paths. In that environment, a model is only one component of a larger execution chain that begins at the sensor, traverses the board support package (BSP), and ends in a production service loop. This paper argues that robust Edge AI deployment must be treated as a systems problem rather than a late-stage application packaging exercise. The paper presents a BSP-aware framework for industrial embedded platforms organized around five layers: hardware, BSP/operating-system adaptation, runtime and acceleration, application/inference, and operations/validation. The discussion is grounded in vendor architecture documentation for Android, NXP this http URL, NVIDIA Jetson, ONNX Runtime, and TensorRT, and in systems literature on embedded AI benchmarking, device instability, and heterogeneous edge fleets. The result is a practical framework that connects low-level platform work to measurable deployment outcomes such as reproducibility, diagnosability, sustained throughput, and field reliability.
[AI-142] Xe-Forge: Multi-Stage LLM -Powered Kernel Optimization for Intel GPU
链接: https://arxiv.org/abs/2605.26118
作者: Marcin Spoczynski,Daniel Fleischer,Moshe Berchansky,Gabriela Ben-Melech Stan,Shira Guskin,Weilin Xu,Adam Siemieniuk,Alexander Heinecke
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Porting deep learning algorithms to new hardware accelerators requires developers to repeatedly apply the same low-level optimizations – quantization, memory access coalescing, tile size tuning, and architecture-specific workarounds – to every Triton kernel in their code-base. This manual, repetitive effort is a major bottleneck: each kernel demands the same cycle of trial-and-error profiling against hardware constraints that vary across devices, yet the underlying optimization patterns remain largely consistent. We present Xe-Forge, a multi-stage LLM-powered pipeline that automates this process for Intel GPU. Given a functionally correct Triton kernel, the system applies up to nine optimization stages – from algorithmic restructuring and operator fusion through block pointer modernization, GPU-specific tuning, and open-ended discovery – each driven by a Chain-of-Verification-and-Refinement (CoVeR) agent that generates candidates, validates them on real hardware, and iterates on failures. A curated knowledge base encodes Intel GPU constraints (power-of-two warp counts, GRF modes, SLM sizing) that are absent from LLM training data, keeping the model within architecturally valid bounds. We evaluate Xe-Forge on 97 Level-2 KernelBench kernels and Flash Attention on the Intel Arc Pro B70, achieving a 1.17x geometric mean speedup over PyTorch eager with 67% of kernels improving, nine kernels exceeding 5x (up to 82x), and 2–13.3x speedups on Flash Attention across all tested configurations without regression – demonstrating that structured domain knowledge with hardware-in-the-loop verification can systematically eliminate the repetitive porting effort that currently gates algorithm deployment on new accelerators.
[AI-143] A governance horizon for ethical-use constraints in open-weight AI models
链接: https://arxiv.org/abs/2605.24383
作者: Weiwei Xu,Hengzhi Ye,Haoran Ye,Kai Gao,Vladimir Filkov,Minghui Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
备注:
Abstract:Ethical constraints on open-weight AI models are both a reflection of societal concerns and a foundation for AI governance policy. They are expected to propagate to downstream derivatives while implemented as voluntary metadata disclosures that must be restated at each generation of reuse. We audit 2,142,823 model repositories on Hugging Face Hub to test whether this disclosure-based governance infrastructure can sustain traceability across deep model lineages. Restriction evidence decays with a half-life of 1.31 derivation steps ( R^2 =0.98), and beyond seven downstream generations at least 80% of descendant models lack sufficient public evidence for a governance determination, a depth boundary we formalize as the governance horizon. Platform-level interventions to restore missing licence metadata reveal that policy design (not enforcement alone) is the binding factor: inheritance-only designs require near-complete enforcement to move the horizon, whereas a mandatory-declaration design that explicitly resolves orphan lineage components shifts the horizon already at moderate enforcement. The structural bottleneck is lineages with no inheritable upstream intent: such orphan components remain undecidable under any inheritance-only policy regardless of enforcement rate, and unresolved upstream nodes additionally create direct downstream undecidability bottlenecks that inheritance rules alone cannot recover. Comparison with PyPI, where governance signals are carried by explicit machine-readable declarations, corroborates that the collapse is topology-specific to open-weight derivation rather than inherent to open ecosystems. These results establish that disclosure-based governance has a shallow, structurally determined reach in open-weight AI, and that achieving deep supply-chain accountability requires provenance mechanisms propagating governance signals through derivation itself.
[AI-144] Qiskit QuantumKatas: Adapting Microsofts Quantum Computing exercises for LLM evaluation
链接: https://arxiv.org/abs/2605.27210
作者: Juan Cruz-Benito,Ismael Faro
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:We adapt Microsoft’s QuantumKatas – a well-established quantum computing curriculum – from Q# to Qiskit, the most widely-adopted quantum computing framework, and package it with an evaluation framework for systematic LLM assessment. The resulting benchmark comprises 350 tasks across 26 categories, spanning fundamental gates through advanced algorithms (Grover’s, Simon’s, Deutsch-Jozsa), error correction, key distribution, and quantum games. Each task includes a natural language prompt, canonical solution, and deterministic test verification via classical circuit simulation. By building on the QuantumKatas’ proven pedagogical design rather than creating tasks from scratch, we inherit a principled difficulty progression and comprehensive concept coverage while contributing the framework adaptation, evaluation infrastructure, and empirical analysis. We evaluate 16 LLMs across 7 prompting configurations – a total of 39,200 model runs – to demonstrate the benchmark’s utility. Three key findings emerge: (1) the benchmark effectively differentiates model capabilities, with best-configuration pass rates ranging from 32.3% to 83.1% and a 26.1 pp average gap between frontier and open-source models; (2) models perform well at implementing known algorithms (SimonsAlgorithm 82.1%, BasicGates 81.6%) but struggle with problem encoding (SolveSATWithGrover 34.4%, DistinguishUnitaries 40.0%); and (3) chain-of-thought prompting shows a modestly bimodal effect – it is the best strategy for three models (two of them explicitly reasoning-tuned per vendor documentation) but degrades performance for the rest, leaving it mid-pack in aggregate (56.3% mean) behind few-shot-5 (57.8%). We release the benchmark, evaluation framework, and baseline results to support research on LLM capabilities in quantum computing.
[AI-145] WIST: Closed-Loop token Synchronization for Application-Aware Wireless Digital Twins
链接: https://arxiv.org/abs/2605.27205
作者: Sige Liu,Kezhi Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:
Abstract:Wireless digital twins require repeated synchronization between a time-evolving physical scene and its digital counterpart under limited and time-varying communication resources. For perception-centric twins, pixel-domain transmission or uniformly protected bitstreams can be mismatched to the semantic state consumed by twin-side applications. This paper proposes TWIST, a closed-loop token synchronization framework for application-aware wireless digital twins. TWIST represents each physical observation as a token and synchronizes this state over a wireless link, rather than optimizing visual reconstruction. Token positions are grouped by task relevance and protected through mode-conditioned unequal error protection under low-, medium-, and high-synchronization modes. At the twin side, decoding confidence converts unreliable hard token decisions into erasures, which are restored by a completion model before updating the semantic twin state. The recovered state supports traffic-state inference and generates compact feedback statistics, including channel quality, receiver uncertainty, semantic drift, and application priority, for subsequent mode adaptation. Experiments on a dynamic road-scene digital-twin scenario show that TWIST improves traffic-state inference and semantic twin-state synchronization compared with fixed-mode and channel-only adaptation strategies, while reducing the average synchronization cost relative to always-high transmission.
[AI-146] he Sensation Modulating Network:Haltability as the architectural ground for object-directed phenomenology
链接: https://arxiv.org/abs/2605.26856
作者: G. Nagarjuna,Durgaprasad Karnam
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 64 pages, main body 38 pages + References 6, Appendices 20 pages, Tables 3, and Figures 21
Abstract:Cognitive science remains split between cognitivism - which accounts for recursion and language but cannot ground formal symbols in meaning - and 4E approaches - which ground cognition in the body but rarely specify the body’s architecture in enough detail to support generativity. We argue the impasse stems from an incomplete account of the embodied agent’s architecture, and propose one: the Sensation Modulating Network (SMN), the cognitive agent conceived as the whole body, organized at every anatomical scale by opponent dynamics, built from Sensation Modulators that sense and act through one substrate, paired into Coordinated Action Zones routed by a body-wide broadcast network. Three commitments give the SMN its purchase. Haltability - the recruitment of antagonistic affordance into co-activated equilibrium - provides the architectural locus that object-directed phenomenology, in Husserl’s sense, requires: opponency enables co-activation, co-activation enables halt, halt enables attention, attention enables intentional directedness, with no module added on top. The dual-signal property of self-modulatable action patterns (SMAPs) makes the self/world distinction a structural feature of the wiring rather than a category the agent applies. And a four-level action-pattern hierarchy - Basal, Haltable, Negotiable, Transactional - gives a single trajectory from autonomic regularity to public conventionalization, locating the conditions for grammar-grounded generativity as architectural transitions. The SMN reconciles the cognitivism-4E debate: recursion lives in the modifiable dynamics of Negotiable Action Patterns, embodiment in the opponent substrate that supports them. A tentative formalism and eight predicted registers (seven testable, one hypothetical), with reference simulations, are given in an appendix.
[AI-147] MatFormBench: A Benchmarking Evaluation Framework for Target-Driven Materials Formulation
链接: https://arxiv.org/abs/2605.26741
作者: Linhan Wu,Chenxi Wang,Chuhan Yang,Zhengwei Yang,Yuyang Liu
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 26 pages
Abstract:Inverse design of materials has significantly advanced target-driven formulation optimization, yet existing materials machine learning benchmarks remain limited to forward property prediction, failing to systematically evaluate inverse optimization and generation algorithms, a critical gap that hinders the progress of target-driven materials design. To address this limitation, we propose MatFormBench, a novel benchmarking ecosystem tailored to evaluate and guide generative strategies for target-driven formulation. MatFormBench integrates a physics-driven formulation generation scheme to generate synthetic samples that faithfully emulate realistic materials structure-property response relationships, complemented by five escalating difficulty levels to quantify the complexity of these relationships. To rigorously assess algorithm performance, we further propose MatFormScore, a multi-dimensional metric that comprehensively quantifies performance across five critical axes: target success, search efficiency, exploratory capacity, robustness, and stability. We validate MatFormBench by evaluating 39 diverse inverse design algorithms, covering classical surrogate-assisted black-box search, state-of-the-art deep generative models, and increasingly popular Large Language Model (LLM)-based recommendation strategies. Across 1170 standardized algorithm-task evaluations, diffusion-based models demonstrate the strongest overall performance, while Variational Autoencoder (VAE)-based and Genetic Algorithm (GA)-based methods exhibit distinct advantages in specific scenarios. By establishing a unified evaluation standard for target-driven materials formulation, MatFormBench enables reproducible benchmarking, principled algorithm comparison, and diagnostic analysis of inverse design strategies, providing a foundational tool for advancing materials inverse design.
[AI-148] DGLD: Domain-Gated Latent Diffusion for the Discovery of Novel Energetic Materials
链接: https://arxiv.org/abs/2605.26540
作者: Yehudit Aperstein,Alexander Apartsin
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注: 49 pages, 25 figures
Abstract:Energetic-materials performance gains translate directly into reduced propellant mass, smaller warheads, and more efficient civilian gas-generators, yet no new HMX-class compound has been disclosed in fifteen years. Designing one is a sparse-label problem: of ~66 k labelled CHNO molecules only ~3 k carry experimental or DFT-quality measurements, and naive generative models trained on the full mixture either memorise the high-performance tail or extrapolate without calibration. We introduce Domain-Gated Latent Diffusion (DGLD): a label-quality gate at training time, multi-task score-model guidance at sample time, and a four-stage chemistry-validation funnel ending in first-principles DFT audit. The result is 12 DFT-confirmed novel leads. The headline compound, 3,4,5-trinitro-1,2-isoxazole (L1), reaches \rho_“cal” =2.09 g/cm3 and D_“K-J,cal” =8.25 km/s and is structurally dissimilar from all 65 980 training molecules (nearest-neighbour Tanimoto 0.27). A co-headline lead, E1 (4-nitro-1,2,3,5-oxatriazole), exceeds L1 on calibrated detonation velocity (D_“K-J,cal” =9.00 km/s) from a chemotype family disjoint from L1’s. DGLD is the only method to land in the productive quadrant (simultaneously novel and on-target) at DFT level. SMILES-LSTM memorises 18.3% of its outputs exactly; SELFIES-GA’s best novel candidate loses 3.5 km/s under DFT audit; REINVENT 4 generates novel high-N heterocycles but peaks at D=9.02 km/s. Code, checkpoints, and 918 mined hard negatives are released on Zenodo (DOI https://doi.org/10.5281/zenodo.19821953)%3B the next compound to enter the HMX-class band can be discovered, validated, and recommended for synthesis at the cost of a few GPU-days.
[AI-149] Foundations of a Time-Consistent Counterfactual Actuarial Runtime for Autonomous AI Agents
链接: https://arxiv.org/abs/2605.26508
作者: Hao-Hsuan Chen
机构: 未知
类目: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI)
备注: 10 pages. Foundational paper of a multi-paper program on actuarial runtime for autonomous AI agents; previously posted on SSRN (id 6761960). Empirical companion: arXiv:2605.25632 . Proof companions included as ancillary files
Abstract:We propose a foundational runtime actuarial layer for autonomous AI agents in which every side-effect-bearing action carries a time-consistent, counterfactual risk toll computed against a contractually fixed safe default, inside an explicit underwriting boundary. The framework treats per-action insurance as the primary unit of analysis and replaces post-hoc annual liability cover with a pre-action transaction layer. The paper establishes four structural results: (i) a well-defined counterfactual toll under a chosen safe-default mapping and continuation policy, with explicit non-uniqueness; (ii) a no-splitting property within an underwriting boundary that telescopes path-decomposed actions into a boundary potential, with a corollary tying gaming-resistance to boundary design; (iii) an irreversible-authority premium, split into a strictly positive action-level component and an if-and-only-if characterisation of the set-level robust capital increase; and (iv) a conservative runtime gating theorem that translates high-probability toll envelopes into an executed-action budget guarantee. The result is the mathematical base layer for a broader program: an empirical companion instantiates the runtime through an Actuarial Action Interface and authority-frontier experiments; a mechanism-design companion studies strategic operator incentives and cross-boundary aggregation; and a dynamic-underwriting companion studies experience rating and audit-replay calibration. The present paper states the primitive contract, the toll identity, the within-boundary no-arbitrage result, and the budget guarantee on which those later layers depend.
[AI-150] Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing
链接: https://arxiv.org/abs/2605.26429
作者: Rongyi Sun,Wenguang Sun,Zinan Zhao
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:This paper addresses structured out-of-distribution (OOD) testing in high-stakes machine learning applications. Traditional conformal methods rely on joint exchangeability, making it difficult to incorporate auxiliary information such as spatiotemporal or grouping structures. To overcome this limitation, we propose the structure-adaptive conformal q-value (SCQ), a significance index that integrates individual test evidence with structural patterns. We also develop pseudo-score-guided transductive automated model selection (P-TAMS), which adapts conformalized model selection to structured OOD testing across a toolbox of candidate models. Together, SCQ and P-TAMS form a unified framework under pairwise exchangeability, providing finite-sample error-rate control, improved power, and enhanced interpretability. Experiments on simulated and real data demonstrate that the proposed approach controls the false discovery rate and performs well across diverse settings.
[AI-151] Confounder Detection via Treatment Intent: A New Observational Study Design
链接: https://arxiv.org/abs/2605.26413
作者: Drago Plecko,Patrik Okanovic,Torsten Hoefler,Elias Bareinboim
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Understanding the effects of interventions is central to scientific progress, with randomized controlled trials (RCTs) regarded as the gold standard for causal inference in many applied fields. However, RCTs are costly, time-consuming, and often constrained by ethical or practical limitations, motivating the need for causal methods able to draw conclusions from observational data. While such data is collected at ever larger scale, making its use for causal inference is often hindered by the fact that not all variables affecting treatment allocation and the outcome are observed: an issue known as unobserved confounding. In this paper, we introduce a new study design called confounder detection via treatment intent. The idea is to query a human expert who makes treatment decisions, and ask them to compare pairs of units proposed by a principled matching strategy, with the goal of eliciting unobserved variables that explain why treatment decisions differ. We provide a theoretical basis for such a procedure, ascertaining conditions under which such a study design may elicit unobserved confounders. Building on this newly established foundations, we study treatment effects of interventions in the intensive care unit (ICU). First, we show empirical evidence strongly indicating that electronic health records (EHRs) collected in ICUs are subject to unobserved confounding. By using clinical text notes as a proxy for physicians’ knowledge and leveraging natural language processing, we provide a proof of concept for our methodology in a semi-synthetic environment with a known ground truth.
[AI-152] Prospective evaluation of multimodal respiratory failure prediction: Do chest X-rays improve performance beyond EHR signals?
链接: https://arxiv.org/abs/2605.26255
作者: Xiaolei Lu,Shamim Nemati
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Early prediction of respiratory failure is critical for timely clinical intervention in intensive care units. Existing electronic health record (EHR)-based models can continuously monitor physiologic deterioration, but they may not fully capture pulmonary pathophysiology reflected in chest radiographs (CXRs). In this study, we ask whether CXR information improves prospective prediction of invasive mechanical ventilation beyond EHR signals alone. We develop a gated multimodal framework that integrates structured EHR time-series data with CXR foundation-model representations. The gating module adaptively controls the contribution of imaging features based on patient-specific clinical context, allowing the model to selectively rely on imaging information when it is informative. We prospectively evaluate the framework for predicting invasive mechanical ventilation within 24 hours in ICU patients and compare it with an established EHR-only model (this http URL), physician predictions obtained at matched clinical time points, and alternative multimodal variants. The gated multimodal models achieved higher discrimination than the EHR-only baseline, with AUROC values of 0.860 and 0.858 using REMEDIS and MedInsight CXR representations, respectively, compared with 0.752 for this http URL. Relative to physician predictions, the multimodal framework substantially improved sensitivity while maintaining favorable specificity. Compared with the EHR-only model, multimodal integration increased specificity and positive predictive value, suggesting that CXR information can refine risk estimation in selected patients. These findings support adaptive multimodal fusion as a practical strategy for incorporating imaging into prospective respiratory failure prediction.
[AI-153] AutoDFT: A Closed-Loop Multi-Agent Framework for Autonomous DFT Calculations
链接: https://arxiv.org/abs/2605.26179
作者: Penghui Yang,Zhonghan Zhang,Yue Li,Xinrun Wag,Yanchen Deng,Yuhao Lu,Bijun Tang,Zheng Liu,Bo An
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Density functional theory (DFT) serves as the basis for computational discovery in materials science and chemistry, yet each calculation demands extensive human effort: adjusting algorithms when convergence stalls, revising plans when unexpected physics emerges, and inserting steps as intermediate results reshape the problem. Existing LLM-based agents automate only the initial planning stage, producing a full execution plan upfront and leaving all subsequent adaptation to hand-crafted rules. As a result, these workflows remain fragile, do not generalize well beyond pre-planned scenarios, and often require expert intervention when failures or unexpected intermediate results require changes to the calculation path. Here, we introduce AutoDFT, a closed-loop multi-agent framework that embeds LLM reasoning into every stage of the DFT lifecycle, where a strategic planner produces a skeletal plan of step objectives; a step planner generates numerical parameters just in time from preceding results; and a monitor-recover-reflect cycle diagnoses failures, repairs them, and revises the plan when the evidence justifies it. We demonstrate both breadth and depth: breadth on VASPBench, a purpose-built benchmark spanning 34 tasks and 9 DFT calculation types, where AutoDFT achieves 94.1% task-level success with GPT-5.2; and depth on established materials databases, where AutoDFT produces quantitatively reliable property predictions across electronic, magnetic, and energetic properties. By closing the loop between planning and execution, AutoDFT enables experimentalists without deep computational expertise to obtain reliable first-principles results.
机器学习
[LG-0] From Scores to Gibbs Correctors: Accelerating Uniform-Rate Discrete Diffusion Models
链接: https://arxiv.org/abs/2605.27352
作者: Yuchen Liang,Ness Shroff,Yingbin Liang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Discrete diffusion models have achieved strong empirical performance in text and other symbolic domains, but, especially for uniform-rate models, they often require many steps to generate a single sample. Existing acceleration methods either rely on training additional quantities or suffer from slow mixing. In this work, we propose a novel Gibbs-based corrector for discrete diffusion models, termed Gibbs-Accelerated Discrete Diffusion (GADD). GADD leverages the structure of the concrete score function to construct Gibbs posterior likelihoods directly, without requiring any additional training beyond standard score estimation. We show that GADD achieves an overall sampling complexity of \mathcalO(\mathrmpolylog (\varepsilon^-1)) , yielding the first such rate for diffusion-based samplers for uniform-rate discrete diffusion models. We also conduct numerical experiments demonstrating the practical advantages of GADD across synthetic data, zero-shot text sampling, and zero-shot conditional music generation. These results corroborate the theory and show that GADD consistently improves sample quality and wall-clock efficiency over standard baselines, including vanilla Euler methods and CTMC correctors. Beyond this, our theoretical analysis introduces a novel framework for analyzing predictor-corrector methods in discrete diffusion models, which may be of independent interest. Unlike existing approaches that rely on the Girsanov change-of-measure technique, our method is based on an induction argument that tracks error propagation across predictor iterations while accounting for inaccuracies in the corrector updates.
[LG-1] Probabilistic Smoothing with Ratio-Monotone Transforms for Global Optimization
链接: https://arxiv.org/abs/2605.27316
作者: Kukyoung Jang,Taehyun Cho,Junrui Zhang,Ping Xu,Kyungjae Lee
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Probabilistic smoothing is a standard tool for global optimization, but existing methods rely on Gaussian kernels and specific transforms, often resulting in strong hyperparameter sensitivity and limited robustness. We propose a general smoothing framework that combines flexible symmetric unimodal kernels with monotonic ratio-based transformations. Under mild conditions, we show that the smoothed objective preserves the global maximizer and that all stationary points concentrate near the true optimum for sufficiently large amplification, without requiring a decreasing smoothing schedule. We further provide explicit complexity bounds for stochastic gradient ascent and show that a leave-one-out baseline provably reduces variance. Experiments on high-dimensional benchmarks and black-box adversarial attacks demonstrate improved robustness and competitive performance.
[LG-2] Greening AI Inference with Accuracy and Latency-aware User Incentives
链接: https://arxiv.org/abs/2605.27309
作者: Vasilios A. Siris,Adamantia Stamou,George D. Stamoulis,Konstantinos Varsos,Ramin Khalili
类目: Machine Learning (cs.LG); Other Computer Science (cs.OH)
*备注:
Abstract:The widespread use of AI services has raised concerns for its environmental sustainability, towards which recent studies have identified carbon emissions of AI inference as the major contributor. This paper introduces a framework for designing AI inference incentives based on the users’ valuation for inference quality and latency, together with their environmental consciousness, while accounting for the tradeoff between carbon emissions and the two QoE parameters. Our approach can accommodate different tradeoffs, that depend on the size and complexity of the AI models and the allocation of resources to serve inference requests. The incentives can be offered through a practical two-tier service subscription that offers users a discount in exchange for reduced carbon emissions. The discounted service option gives the AI provider the flexibility to serve some percentage of inference requests at a lower quality and higher latency during periods of high carbon intensity.
[LG-3] Normal Guidance is what Attention Needs
链接: https://arxiv.org/abs/2605.27306
作者: Ethan Harvey,Dennis Johan Loevlie,Michael C. Hughes
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider training classifiers for 3D medical images using only one binary label for the entire volume rather than a label for each 2D slice. In such weakly supervised settings, can we learn accurate classifiers for slice-level predictions? Attention-based multiple instance learning (MIL) can produce an attention score for every slice. Yet recent work demonstrates that a simple center-focused baseline that ignores image content can outperform attention-based and transformer-based MIL at slice-level classification of 3D brain scans. We show this baseline also outperforms existing MIL at slice-level classification of thoracic and abdominal CT scans. Motivated by this baseline, we propose Normal Guidance, a regularization technique that encourages the learned attention distribution to follow a bell-shaped curve. Across three medical imaging datasets totaling over 4 million 2D slices, we show our Normal Guidance enables attention-based and transformer-based MIL methods to deliver significantly better slice-level localization than the state-of-the-art while remaining competitive at whole-scan classification.
[LG-4] BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning
链接: https://arxiv.org/abs/2605.27293
作者: Shijin Gong,Erhan Xu,Kai Ye,Francesco Quinzan,Giulia Livieri,Chengchun Shi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 17 pages, 7 figures
Abstract:Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning. We introduce BASIS, a critic-free post-training algorithm designed to address this tradeoff. At each online training step, BASIS samples only one rollout per prompt, but leverages rich information across prompts in the entire batch to improve value function estimation. Our experiments demonstrate that BASIS reduces MSE in value function estimation by 69% compared to REINFORCE++, a representative single-rollout baseline, and achieves lower MSE with one rollout than group mean estimators with 8 rollouts. This improvement in value estimation translates to better policy optimization: using substantially less training time, BASIS achieves performance close to multi-rollout GRPO-type baselines and often outperforms single-rollout REINFORCE-type baselines.
[LG-5] Detectability in Diversity: Improved Canary Crafting for Privacy Auditing in One Run
链接: https://arxiv.org/abs/2605.27292
作者: Mathieu Dagréou,Aurélien Bellet
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Privacy auditing aims to empirically assess privacy leakage in machine learning models using membership inference attacks (MIAs), and to derive lower bounds on differential privacy (DP) parameters. Recent one-run auditing methods address the high cost of standard approaches by relying on a single training run with multiple “canary” points whose inclusion or exclusion must be detected by the auditor. In this work, we study the problem of efficiently crafting canaries for one-run privacy auditing. Motivated by recent theoretical insights suggesting that interference between canaries contributes to weaker leakage estimates compared to multi-run methods, we propose to optimize canaries to be both highly detectable and minimally interfering. Our approach combines a greedy initialization based on influence functions with a bilevel optimization procedure that maximizes distinguishability while promoting diversity in embedding space, enabling the use of computationally efficient bilevel algorithms. Experiments show that our method achieves stronger privacy leakage estimates at a lower computational cost than existing canary crafting approaches.
[LG-6] Causal Risk Minimization for High-Dimensional Treatments
链接: https://arxiv.org/abs/2605.27281
作者: Nikita Dhawan,Arnav Paruthi,Andrew Kim,Lovedeep Gondara,Jekaterina Novikova,Chris J. Maddison
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 4 figures
Abstract:Predicting the effect of interventions with many possible variations, e.g., therapeutic content that affects mental health outcomes or an earnings call transcript that drives movement in share price, is useful across several domains. However, classical causal estimators tend to assume that all possible interventions are observed, which is infeasible when interventions vary widely, for instance, in the space of all text strings. We adapt a well-known approach of recasting causal inference as a learning problem, to address high-dimensional treatment spaces. Specifically, under standard assumptions like no unobserved confounding, we show that causal error decomposes into a series of moment-balancing errors of increasing order, and design objectives that directly improve causal estimation. We also show how to project the effect of a high-dimensional treatment onto lower-dimensional treatment attributes, which allows a single model to answer several causal questions without additional attribute-specific training. We empirically evaluate our estimators in settings with high-dimensional continuous, discrete, and text treatments, the last of which used a semi-synthetic dataset of Amazon Reviews. Our experiments demonstrate the benefit of higher-order balance error optimization and competitive performance of projected causal estimates with attribute-specific estimators.
[LG-7] ransfer Learning using 66 Diseases for Disease Forecasting Applications
链接: https://arxiv.org/abs/2605.27269
作者: Lauren J Beesley,Alexander C Murph,Dave Osthus,Lauren A Castro
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Disease forecasting models typically rely on a single data stream, making models brittle when histories are short or noisy. Recent top-performing models have shown that synthesizing multiple reporting systems for the same disease improves performance. Other recent work takes this idea a step further, using transfer learning to train a forecasting model for one disease using data from a different disease. We expand upon each of these approaches greatly, training machine learning models on data that span 66 infectious diseases and several data streams. We investigate the value of incorporating different data streams for forecasting 20 different disease data streams. We find that incorporating other data streams improves forecasting in the vast majority (84.9%) of time series and model structures considered. However, our work highlights that the quality of the added data matters, where adding data extremely different from the target data stream can sometimes degrade forecast performance. A major contribution of this work is in compiling a publicly-available database of data for use by the infectious disease forecasting community.
[LG-8] Kan Extension Transformers: A Categorical Unification of Attention Diffusion and Predict-Detach Self-Conditioning
链接: https://arxiv.org/abs/2605.27259
作者: Sridhar Mahadevan
类目: Machine Learning (cs.LG)
*备注: 30 pages
Abstract:We propose Kan Extension Transformers (KETs) as a unifying categorical framework for a diverse group of Transformer implementations. The core claim is that a Transformer layer can be viewed as a weighted structured extension operator: standard attention is the singleton-neighborhood case, Geometric Transformer style incidence mixing is a sparse edge-restricted case, and KET is the higher-order simplicial case. This lens also clarifies a bridge to diffusion-style completion. When the extension operator acts on detached predictive carriers instead of teacher-forced hidden states, it becomes a valid self-conditioning mechanism that exposes noncausal structure without leaking gold future tokens. We include a comprehensive experimental validation of 12 different Transformer implementations varying across strict-causal and predict-detach regimes on Penn Treebank, WikiText-2, and WikiText-103. In the strict-causal setting, quadratic KET is the strongest model among the compared causal architectures on WikiText-2 and WikiText-103. Across all datasets, however, the largest gains come from the predict-detach regime rather than from changing the neighborhood family alone.
[LG-9] Symbolic Regression via Latent Iterative Refinement
链接: https://arxiv.org/abs/2605.27245
作者: Xieting Chu,Sriram Vishwanath,Vijay Ganesh
类目: Machine Learning (cs.LG)
*备注: Preprint. 21 pages, 11 figures
Abstract:Symbolic regression (SR) seeks closed-form mathematical expressions that fit observed data. Neural SR methods amortize the search by training an encoder to map observations directly to expressions in a single pass, but this amortized inference leaves a residual amortization gap between its one-shot prediction and the true posterior. We propose Latent Equation Embedding (LEE), a framework that closes this gap through iterative amortized inference in a functionally grounded latent space. LEE learns a shared latent space Z equipped with three components: an encoder f_theta that jointly embeds symbolic tokens and numerical observations into a single latent vector z; an expression decoder g_expr that reconstructs formulas from z; and an evaluation decoder g_eval that predicts function values from z, explicitly grounding the latent space in functional behavior. At inference, LEE performs iterative refinement by re-encoding decoded expressions jointly with observations, progressively improving the latent estimate. LEE uses the encoder itself as a learned inference optimizer: each re-encoding step implicitly computes the mismatch between the candidate and the data. Because g_eval is differentiable in z, we additionally interleave continuous gradient descent with discrete re-encoding, yielding a hybrid iterative and gradient refinement procedure. On SRBench across three noise levels, against 19 baselines spanning genetic programming, symbolic-neural hybrids, and pre-trained Transformers, LEE produces expressions 2–10x simpler than the strongest accuracy-oriented baselines, including Operon, GP-GOMEA, TPSR, RAG-SR, and GenSR, with complexity 8–11 versus 20–90. These results advance the low-complexity region of the accuracy-complexity Pareto frontier and show graceful degradation as noise increases.
[LG-10] Explainable Comparison of Feature-Based and Deep Learning Models for TROPOMI Methane Plume Screening
链接: https://arxiv.org/abs/2605.27236
作者: Solomiia Kurchaba,Joannes D. Maasakkers,Berend J. Schuit,Ilse Aben
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:Continuous and global detection of large methane emissions is a crucial step for global warming mitigation. Satellite observations, such as from S5P/TROPOMI, combined with plume detection algorithms, can play a key role in this effort. However, not all TROPOMI plume detections that look like methane emission plumes are the result of actual emissions. A significant part of the plume-like features in the data are retrieval artifacts. Such artifacts could be the result of variations in elevation or albedo gradients, high concentrations of aerosols, coastal lines, water bodies, etc. Previous work approached the problem of plume-artifact classification by means of a Support Vector Machine Classifier (SVC), trained on an extensive set of observation-based scalar features designed by domain experts. However, such an approach limits the information scope received by the algorithm to what is deemed to be important by the experts, breaks the spatial relationship between pixels, and loses information during the process of statistical aggregation. In this study, we compare feature-based (SVC, Random Forest, XGBoost) and image-based (ResNet-18, ResNet-34) models for methane plume-artifact classification under balanced and imbalanced evaluation settings. To interpret the results, we apply SHAP-based explainability to both model families. Our findings provide practical guidance for model selection in operational methane-screening workflows such as the CAMS Methane Hotspot Explorer.
[LG-11] Nonlinear Data Integration via Kernel Methods for Data Collaboration Analysis
链接: https://arxiv.org/abs/2605.27219
作者: Yamato Suetake,Yuta Kawakami,Shunnosuke Ikeda,Yuichi Takano
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 50 pages, 7 figures
Abstract:Collaborative analysis of decentralized confidential datasets is important, but direct sharing of original datasets is often restricted by privacy and institutional constraints. Data collaboration (DC) analysis transforms each dataset into privacy-preserving intermediate representations via party-specific obfuscation functions and integrates them into common collaboration representations using an anchor dataset. However, many existing DC analysis methods rely on linear transformations for data obfuscation and integration, which may increase reconstruction risk. Although nonlinear dimensionality reduction can mitigate this risk, conventional linear integration methods cannot accurately align intermediate representations produced by nonlinear transformations. Moreover, existing integration methods mainly minimize discrepancies among parties and do not explicitly incorporate geometric or target-variable information useful for downstream analysis. To overcome these limitations, we first formulate linear kernel integration (LKI) as a linear integration method and then kernelize it to obtain nonlinear kernel integration (NKI). NKI admits a globally optimal solution via kernel ridge regression and an eigenvalue problem. We also introduce graph regularization and a centering constraint so that the target representation can capture geometric and target-variable information useful for downstream analysis. Experiments on image classification tasks demonstrate that NKI improves classification accuracy over existing linear integration methods under nonlinear dimensionality reduction, with further gains from target-variable-aware graph regularization and centering. The results also show that dimensionality reduction choices substantially affect both classification accuracy and reconstruction risk.
[LG-12] he Role of Causal Features in Strategic Classification for Robustness and Alignment AISTATS2026
链接: https://arxiv.org/abs/2605.27163
作者: Antonio Gois,Sophia Gunluk,Nir Rosenfeld,Nidhi Hegde,Simon Lacoste-Julien,Dhanya Sridhar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at AISTATS 2026. 20 pages, 5 figures
Abstract:In strategic classification, an institution (e.g., a bank) anticipates adaptation from users who change their features to increase utility in a classification task (e.g., loan repayment). Since a key challenge is the distribution shift induced by users, we turn to causal models, which have been shown to bound the worst-case out-of-distribution (OOD) risk, and establish several new results that link causality and strategic classification. First, we show that causal classification leads to optimal classification error after any sufficiently large adaptation, when the noise is bounded in a certain way. Second, when these assumptions do not hold, we show OOD cross-entropy risk of optimal classifiers decomposes into an OOD bias term and a term arising from not using all observable features, allowing us to understand when causal classifiers have an advantage. Finally, we show that the use of causal features can allow alignment of long-term incentives between institutions and users, contrasting with previous work that highlights social costs of such approaches. We validate our theory empirically on synthetic data, finding that our results predict behavior in practice.
[LG-13] Mildly Overparameterized ReLU Networks on Orthogonal Data: Incremental Learning and Implicit Bias
链接: https://arxiv.org/abs/2605.27097
作者: James Town,Etienne Boursier,Ben Lewis,Matthias Englert,Ranko Lazic
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 66 pages, 6 figures
Abstract:The successful training of neural networks hinges on the use of first order optimization methods, yet the theoretical characterization of these methods remains incomplete. This is especially true in settings with mild overparameterization. In this work, we study the gradient flow dynamics of two-layer ReLU networks from small initialization with orthogonal training data. We prove the limiting flow converges to a saddle-to-saddle jump process as the initialization scale tends to zero, revealing an incremental learning phenomenon in which a new neuron activates at each saddle. This analysis recovers the known result of Dana et al. (2025, arXiv:2502.16977) that the network interpolates the training data with high probability as soon as m \gtrsim \log(n) , where m is the network width and n is the number of training samples. This incremental process characterization also allows us to derive a novel implicit bias result: the learned interpolator has a squared \ell_2 -norm scaling as \sqrtn , which is within a constant factor of the minimal \ell_2 -norm interpolator. More broadly, our work provides the first rigorous proof of an incremental learning process for ReLU networks, whilst suggesting mildly overparameterized networks can converge to interpolating solutions whose complexity is of the same order as that of the optimal interpolator.
[LG-14] Adversarial Dual On-Policy Distillation from Expressive Flow-based Teacher
链接: https://arxiv.org/abs/2605.27095
作者: Zhenglin Wan,Jingxuan Wu,Xingrui Yu,Chubin Zhang,Mingcong Lei,Bo An,Ivor W. Tsang,Yang You
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning from demonstrations in embodied control is often cast as behavioral cloning, and recent diffusion or flow-matching policies improve this paradigm by modeling multi-modal expert actions. Yet these methods remain offline supervised learners: the policy is trained only on expert states and receives no corrective signal on the states it actually visits. On-policy distillation (OPD) offers a natural remedy, but standard OPD assumes a strong fixed teacher, which is unavailable in demonstration-only control. We propose \textbfFA-OPD, an \emphadversarial dual on-policy distillation method in which a Flow Matching (FM) teacher is learned from demonstrations and co-trained with a lightweight MLP student. The teacher provides two complementary signals on student rollouts. The reward channel learns an expert-likeness objective over state-action pairs and drives online exploration through long-horizon policy optimization. The action channel supplies dense local targets at student-visited states, stabilizing exploitation. FA-OPD couples them so that reward distillation enables generalization beyond point-wise demonstrations, while action distillation keeps exploration anchored near expert-like behavior. Across six robot navigation, manipulation, and locomotion benchmarks, FA-OPD beats strong baselines and shows much stronger robustness under noisy or limited demonstrations.
[LG-15] Learning to Orchestrate Agents under Uncertainty
链接: https://arxiv.org/abs/2605.27073
作者: Mary Chriselda Antony Oliver,Lan Jiang,Aaron Bundi Anampiu,Elaf Almahmoud,Francesco Quinzan,Umang Bhatt
类目: Machine Learning (cs.LG)
*备注:
Abstract:Adaptive orchestration of heterogeneous agents requires making sequential delegation decisions under uncertain and evolving agent behaviour, e.g., coordinating specialised AI models with varying reliability, cost, and response quality. While prior work on agent orchestration focuses on performance or cost, uncertainty in agent reliability and output distributions is typically not modelled explicitly at the orchestration level. In this work, we study the problem of adaptive orchestration of heterogeneous agents under uncertainty, where a meta-controller must decide when to delegate to an agent, accounting for reliability, cost, and uncertainty. We propose BOT-Orch, a lightweight framework that recasts orchestration as a bandit problem over agents, regularized by OT distances between agent output distributions and task-specific reference distributions. We show that the regularised orchestration enjoys \mathcalO(\sqrtT) regret under standard assumptions, and provably induces preference ordering among agents with identical mean rewards but differing distributional alignment. Empirically, we demonstrate that BOT-Orch outperforms standard bandit and heuristic baselines in synthetic but adversarial task allocation settings with heterogeneous, non-i.i.d. agent behaviour.
[LG-16] Learning Dynamic Graph Representations through Timespan View Contrasts
链接: https://arxiv.org/abs/2605.27063
作者: Yiming Xu,Zhen Peng,Bin Shi,Xu Hua,Bo Dong
类目: Machine Learning (cs.LG)
*备注: Accepted by Neural Networks
Abstract:The rich information underlying graphs has inspired further investigation of unsupervised graph representation. Existing studies mainly depend on node features and topological properties within static graphs to create self-supervised signals, neglecting the temporal components carried by real-world graph data, such as timestamps of edges. To overcome this limitation, this paper explores how to model temporal evolution on dynamic graphs elegantly. Specifically, we introduce a new inductive bias, namely temporal translation invariance, which illustrates the tendency of the identical node to keep similar labels across different timespans. Based on this assumption, we develop a dynamic graph representation framework CLDG that encourages the node to maintain locally consistent temporal translation invariance through contrastive learning on different timespans. Except for standard CLDG which only considers explicit topological links, our further proposed CLDG++ additionally employs graph diffusion to uncover global contextual correlations between nodes, and designs a multi-scale contrastive learning objective composed of local-local, local-global, and global-global contrasts to enhance representation capabilities. Interestingly, by measuring the consistency between different timespans to shape anomaly indicators, CLDG and CLDG++ are seamlessly integrated with the task of spotting anomalies on dynamic graphs, which has broad applications in many high-impact domains, such as finance, cybersecurity, and healthcare. Experiments demonstrate that CLDG and CLDG++ both exhibit desirable performance in downstream tasks including node classification and dynamic graph anomaly detection. Moreover, CLDG significantly reduces time and space complexity by implicitly exploiting temporal cues instead of complicated sequence models.
[LG-17] SQARL: A Size-Agnostic Reinforcement Learning approach for Circuit Allocation in Distributed Quantum Architectures
链接: https://arxiv.org/abs/2605.27027
作者: Víctor Carballo,Júlia López-Closa,Mario Martin
类目: Machine Learning (cs.LG)
*备注:
Abstract:The scaling of quantum processors is currently limited by technical challenges such as decoherence and cross-talk. As the number of qubits grows, interference increases the computational noise. Distributed quantum computing addresses these limitations by interconnecting smaller, easier-to-handle quantum processors (cores), but it introduces the challenge of minimizing slow, error-prone inter-core communication. The task of distributing quantum circuits across cores while minimizing communication costs is known as the Qubit Allocation problem. This work focuses on developing a deep learning approach to this problem, emphasizing flexibility to quantum hardware topology and improving state-of-the-art performance. Heuristic and non-learning algorithms, such as the Hungarian Qubit Allocation (HQA), currently represent the state of the art. Reinforcement Learning (RL) approaches leverage learned allocation policies but often lack flexibility, requiring retraining when hardware configurations change, and they fall short of the solution quality achieved by non-learning methods. However, learning mechanisms could outperform human-crafted heuristics. To overcome these limitations, this work proposes a flexible, transformer-based architecture that can handle arbitrary numbers of qubits and cores without retraining. Results show that the trained policy consistently outperforms the previous RL state of the art and narrows the gap between RL and HQA for the most common circuits. It achieves a 33% reduction in allocation cost relative to the HQA for the Cuccaro Adder and 25% on average for random circuits. These findings show that learning-based approaches can effectively match the performance of hand-crafted heuristics, a crucial step towards their application in real-world scenarios. Subjects: Machine Learning (cs.LG) ACMclasses: I.2.6; I.2.1 Cite as: arXiv:2605.27027 [cs.LG] (or arXiv:2605.27027v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.27027 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-18] SCENT: Aligning Mass Spectra with Molecular Structure for Olfactory Perception
链接: https://arxiv.org/abs/2605.27009
作者: Ziqi Zhang,Eunyeong Jin,Miguel Vasco,Farzaneh Taleb,Nona Rajabi,Alexandra Gutmann,Jonathan Williams,Antônio H. Ribeiro,Danica Kragic
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predicting human olfactory perception from molecular structure has seen remarkable progress, yet these approaches require explicit chemical structure at inference, which is not available in practical sensing settings. We address this gap by exploring direct electron ionization mass spectrometry (EI-MS), a sensing technique that acquires chemically informative fragmentation fingerprints in seconds, as an alternative input modality for olfactory prediction. We contribute Spectrum-to-Chemical Embedding alignmeNT (SCENT), a multi-modal contrastive learning framework that aligns EI-MS representations with pretrained chemical structure embeddings, while requiring only mass spectra at inference. On the multi-label odor descriptor prediction task, SCENT significantly outperforms MS-only baselines and achieves performance comparable to structure-based models, despite requiring no explicit molecular structure at test time. The learned representations also better approximate continuous human perceptual ratings and generalize to real-world lab-measured spectra, suggesting that cross-modal alignment is an effective strategy for grounding analytical spectra in chemical semantics.
[LG-19] Sampling Data with Chains of Forward-Backward Diffusion Steps
链接: https://arxiv.org/abs/2605.27006
作者: Hyunmo Kang,Noam Itzhak Levi,Corinna Elena Wegner,Daniel J. Korchinski,Matthieu Wyart
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注:
Abstract:Sampling from learned high-dimensional distributions is a foundational computational problem. We introduce U-turn chains: Markov chains obtained by iterating short forward-backward steps of a diffusion model, in which each step proposes a move that remains on the learned data manifold and, paired with a Metropolis-Hastings correction, samples from energy-modified targets. For synthetic languages, we show that minimal U-turn dynamics undergoes an ergodicity-breaking phase transition driven by fragmentation of the data manifold; ergodicity is restored at larger U-turn magnitude. In the non-ergodic regime, low-level features relax faster than high-level ones, an ordering that inverts only at sufficiently large U-turn magnitude. We test these predictions on natural language and natural images. In both modalities, minimal U-turns relax slowly, especially for high-level features approximated by deep representations in CNNs or LLMs. The layer-ordering inversion appears only at large noise when mixing is efficient – signatures consistent with strongly constrained, weakly mixing local dynamics. We discuss the implications of these results for sampling with diffusion models.
[LG-20] Probabilistic Recurrent Intention Switching Model
链接: https://arxiv.org/abs/2605.26998
作者: Wenyuan Sheng,Hao Zhu,Joschka Boedecker
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Inverse reinforcement learning (IRL) recovers reward functions from observed behavior, yet traditional methods assume a single stationary reward that cannot capture goal switching within an episode. Recent multi-intention IRL methods address this by segmenting trajectories, but model intention transitions as either a memoryless Markov chain or via manual state augmentation with a fixed history window. We propose the Probabilistic Recurrent Intention Switching Model (PRISM), which replaces both mechanisms with a lightweight recurrent network that maps observation history to a per-step intention distribution. We prove that the resulting EM objective decomposes exactly into independent per-intention reward subproblems, each solvable in closed form, yielding an \mathcalO(nK) E-step with no variational approximation. We evaluate PRISM on a non-Markovian gridworld, a mouse labyrinth, and BridgeData~V2 robotic manipulation, the first large-scale robotic application of multi-intention IRL. Across all settings PRISM achieves the highest held-out log-likelihood while recovering nameable, temporally coherent intentions from unlabeled demonstrations, suggesting that discrete goal switching is present in both biological and artificial agents.
[LG-21] ED: Related Party Transaction guided Tax Evasion Detection on Heterogeneous Graph
链接: https://arxiv.org/abs/2605.26984
作者: Yiming Xu,Bin Shi,Bo Dong,Jiaxiang Wang,Hua Wei,Qinghua Zheng
类目: Machine Learning (cs.LG)
*备注: Accepted by Data Mining and Knowledge Discovery (DMKD25)
Abstract:Tax evasion causes severe losses of government revenues and disturbs the economic order of fair competition. To help alleviate this problem, the latest tax evasion detection solutions utilize expert knowledge to extract features and then train classifiers to determine whether a company is suspected of tax evasion. However, existing solutions mainly focus on the statistical features of the company, but fail to exploit the rich interactive information in tax scenarios, which affect the detection performance. In this paper, we first model the tax scenario as a heterogeneous graph and study the tax evasion detection problem under the heterogeneous graph model. To improve the performance of tax evasion detection, a novel graph neural network model is proposed to extract the comprehensive information of heterogeneous graphs. Specifically, we use heterogeneous and complex related party transaction groups to filter low-level noise information. Moreover, a hierarchical attention mechanism is designed to capture the deeper structure and semantic information hidden in the related party transaction group. We apply our method to the real risk management system of the tax bureau, and evaluate it on two human-labeled real-world tax datasets. The results demonstrate that our method significantly outperforms the state-of-the-art in the tax evasion detection task.
[LG-22] Convergence of Spectral Descent for Non-smooth Optimization
链接: https://arxiv.org/abs/2605.26977
作者: Yixuan Yang,Yuqing He,Song Li
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The Muon optimizer has recently demonstrated remarkable empirical success in training large language models. However, the theoretical understanding of its mechanisms remains limited. Current convergence guarantees for Muon rely heavily on smoothness assumptions, leaving its non-smooth convergence behavior largely unexplored. In this work, we take a step toward bridging this gap by investigating Spectral Descent (SD), a simplified variant of Muon, together with its truncated counterpart, Truncated Spectral Descent (TSD). Under convexity, Lipschitz continuity, and sharpness conditions, we establish global linear convergence for both SD and TSD in non-smooth convex formulations. We also study regularized variants equipped with decoupled weight decay and derive sublinear convergence guarantees through their connection with Frank-Wolfe methods. Finally, we apply our theoretical framework to robust low-rank matrix recovery under mixed sparse and dense noise regimes and provide rigorous recovery guarantees. Numerical experiments support the theoretical findings and demonstrate the effectiveness of Muon-type methods for non-smooth optimization.
[LG-23] RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data
链接: https://arxiv.org/abs/2605.26971
作者: Hsiu-Yuan Huang,Weijie Liu,Chenming Tang,Sanwoo Lee,Kai Yang,Yangkun Chen,Saiyong Yang,Yunfang Wu
类目: Machine Learning (cs.LG)
*备注: 7 figures, 12 tables
Abstract:The proliferation of Reinforcement Learning from Verifiable Rewards (RLVR) datasets has exacerbated provenance collapse due to unclear lineage among existing datasets. To bridge this fragmented RLVR data landscape, we propose Atomic-source Tracing via Lineage-Aware Search (ATLAS), a systematic framework for tracing RLVR datasets back to their atomic sources, attributing over 99.7% of 1.45M instances to 20 atomic sources. Our analysis reveals that most RLVR datasets are variants of a small set of shared upstream sources, with few introducing genuinely new data, and many facing data contamination risks. These findings naturally motivate us to curate a new RLVR dataset, DAPO++, and to benchmark existing datasets from a lineage-aware perspective. To this end, we propose Source-level Counterfactual Attribution (SCA) as a guiding principle to curate a decontaminated training dataset with concentrated learning signals. Essentially, SCA measures a sample’s marginal utility by comparing per-atomic-source RL checkpoints against a shared base model. Building upon these attribution signals, we further design a composite dataset quality score Q that strongly correlates with downstream RLVR performance. Experiments on Qwen3 series models verify that DAPO++ consistently improves performance on held-out benchmarks, while Q reliably predicts downstream RLVR training effectiveness. Our code and data is available at this https URL.
[LG-24] When Muon Optimizer Meets Adversarial Training: A Theoretical and Empirical Study
链接: https://arxiv.org/abs/2605.26929
作者: Jun Yan,Weiquan Huang,Jiankai Zuo,Yujian Mo,Xi Fang,Chengliang Wu,Zeming Wei
类目: Machine Learning (cs.LG)
*备注:
Abstract:Adversarial training (AT) remains one of the most reliable empirical defenses against adversarial attacks. Its robustness critically depends on how the underlying min-max objective is optimized. In practice, Stochastic Gradient Descent (SGD) optimizer remains the default optimization choice for AT, whereas adaptive optimizers often improve standard training but may yield inferior robustness. Recently, the Muon optimizer, which orthogonalizes matrix-valued updates via an approximate polar decomposition, has achieved notable success in large-scale training at a memory cost comparable to SGD. This raises a security-relevant question: \textitcan orthogonalized optimization improve AT under strong and heterogeneous threat models? Focusing on this problem, we conduct a comprehensive theoretical and empirical study. Theoretically, we show that Muon imposes a spectral-norm stability ceiling on matrix updates, limiting uncontrolled spectral growth in the training dynamics without explicitly shrinking the learned weights. Empirically, across five architectures and three \ell_p threat models ( \ell_\infty , \ell_1 , \ell_2 ) and their union, Muon is competitive with SGD on CNNs and substantially outperforms AdamW on both CNNs and ViTs. These results identify optimizer geometry as a security-relevant factor in adversarial training, while clarifying the empirical regimes in which orthogonalized updates are beneficial. Overall, our findings highlight optimizer design as a security-critical component of AT.
[LG-25] Agile Online Model Selection: Resolving Adaptation Lag via Safeguarded Large Learning Rates KDD2026
链接: https://arxiv.org/abs/2605.26919
作者: Kei Takemura,Ryuta Matsuno,Keita Sakuma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to KDD 2026
Abstract:Maintaining predictive accuracy in non-stationary environments requires online model selection to adapt autonomously to unknown distribution shifts. However, existing tuning-free algorithms face a fundamental trade-off between robustness and agility. Specifically, to ensure dynamic regret bounds, they must restrict learning rates to small constants (e.g., O(1) ). This restriction inevitably causes significant adaptation lag during abrupt changes. To resolve this, we propose a novel optimistic online mirror descent that utilizes safeguarded large learning rates up to \Theta(T) , where T is the number of rounds. Our key technical contribution is a post-hoc penalty mechanism that dynamically monitors unstable updates and excludes learning rates incurring excessive regret, eliminating the need for restrictive a priori constraints. We show that the cumulative penalty remains O(\log T) , allowing our algorithm to match near-optimal worst-case guarantees while achieving superior rates in benign cases. Empirical evaluations on synthetic and eleven diverse real-world datasets demonstrate that our approach reduces the adaptation lag from hundreds of rounds to a few rounds, consistently outperforming tuning-free baselines.
[LG-26] SPHERE-JEPA: Spherical Prediction with Homogeneous Embeddings
链接: https://arxiv.org/abs/2605.26900
作者: Léo Nicollier(CB, ATT),Max Dunitz(CB, ATT),Marc Pic(ATT),Pablo Musé(CB, IFUMI),Enric Meinhardt-Llopis(CMLA, CB),Gabriele Facciolo(CB)
类目: Machine Learning (cs.LG)
*备注:
Abstract:A fundamental open question in self-supervised learning (SSL) is the explicit characterization of the optimal geometry of the learned representations. Recently, LeJEPA identified isotropic Gaussian embeddings as optimal for minimizing downstream prediction risk in Euclidean spaces. However, the corresponding problem for distributions supported on lower-dimensional manifolds, such as the hypersphere, remains unexplored. In this work, we demonstrate that extending this minimax analysis to smooth distributions on Riemannian manifolds fundamentally changes the optimal solution. We show that, under a worst-case formulation, both k-nearest neighbors and kernel ridge regression induce hyperspherical uniformity. More precisely, we show that uniform distributions on manifolds are optimal for k-nearest neighbors, and that the uniform distribution on the sphere is optimal for kernel ridge regression with both the exponential dot-product kernel and the linear kernel. This theoretical insight reveals a fundamental limitation of Gaussian embeddings: their non-uniform density induces anisotropic k-NN neighborhoods, severely biasing the estimator. To correct this, we introduce SPHERE-JEPA, a theoretically grounded SSL framework. We adapt LeJEPA’s Cramér-Wold projection mechanism to enforce hyperspherical uniformity rather than a Gaussian prior. Empirically, SPHERE-JEPA yields significant improvements, boosting texture retrieval mAP by over 6%, while consistently matching or outperforming LeJEPA on standard benchmarks-including a +1.8% linear probing gain on ImageNet-1K (ViT-B/14).
[LG-27] Parsimonious Learning-Augmented Online Metric Matching ICML2026
链接: https://arxiv.org/abs/2605.26886
作者: Yongho Shin,Phanu Vajanopath
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: To appear in ICML 2026
Abstract:Learning-augmented algorithms have received significant attention in recent years, particularly in the context of online optimization. Motivated by the high computational cost of generating predictions, a growing line of work studies the tradeoff between performance guarantees and the number of predictions used in learning-augmented algorithms for problems such as caching and metrical task systems. In this paper, we extend this line of research to online metric matching by developing parsimonious learning-augmented algorithms and establishing lower bounds on their performance. Our approach extends the Follow-the-Prediction framework to the parsimonious setting by filling in a virtual prediction in the absence of an actual prediction, using an online metric matching algorithm that maintains good intermediate matchings throughout its execution. We complement our theoretical results with an empirical evaluation, demonstrating the practical effectiveness of our approach.
[LG-28] Generalist Graph Anomaly Detection via Prototype-Based Distillation ICML2026
链接: https://arxiv.org/abs/2605.26857
作者: Yiming Xu,Zihan Chen,Zhen Peng,Song Wang,Bin Shi,Bo Dong,Chao Shen
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026
Abstract:Driven by the pressing demand for graph anomaly detection (GAD) in high-stakes domains, the generalist GAD paradigm, which trains a single detector transferable across new graphs, has recently gained growing attention. However, existing methods often rely on scarce and costly annotations for training and sometimes even require few-shot support at inference, which limits their robustness to diverse and unseen anomaly patterns. To address this limitation, we introduce ProMoS, the first unsupervised generalist GAD framework, which detects anomalies by modeling the abundant normality in unlabeled data. ProMoS adopts a knowledge-distillation paradigm to distill normality priors from a frozen self-supervised graph neural network (GNN) teacher to a mixture-of-students model with shared global and lightweight personalized branches, enabling efficient and expressive normality modeling without learning from scratch. We further propose prototype-guided soft-label distillation to align teacher and student in a shared prototype space, enhancing cross-graph generalizability. During inference, ProMoS performs zero-shot anomaly detection on unseen graphs via distillation bias and prototype geometric deviation. Extensive experiments show the effectiveness and efficiency of ProMoS, charting a practical path toward label-free, zero-shot generalist GAD.
[LG-29] RAPNet: Accelerating Algebraic Multigrid with Learned Sparse Corrections
链接: https://arxiv.org/abs/2605.26854
作者: Yali Fink,Ido Ben-Yair,Lars Ruthotto,Eran Treister
类目: Machine Learning (cs.LG)
*备注: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea Code available at this https URL
Abstract:The scalable solution of large sparse linear systems is a bottleneck in scientific computing and graph analysis. While algebraic multigrid (AMG) offers optimal linear scaling, its performance is severely constrained by the trade-off between the sparsity and convergence quality of coarse-grid operators. Classical AMG heuristics struggle to balance these objectives, often sacrificing stability or performance for sparsity. We propose RAPNet, a graph neural network (GNN) framework that resolves this trade-off by learning to generate sparse, robust coarse operators directly from the sparse algebraic system. Key to our approach is a level-wise training strategy that enables learning from small subgraphs and generalization to million-node domains, bypassing the bottlenecks of prior neural AMG attempts. RAPNet executes exclusively during the solver setup phase, ensuring that the solve phase retains its favorable computational properties. We show that our method outperforms classical non-Galerkin baselines on diverse PDE discretizations and graph Laplacians, making it particularly effective for multi-query tasks such as eigenproblems, time-dependent simulations, and inverse or design problems.
[LG-30] Learning Energy-Based Models from Stochastic Interpolants using Spatiotemporal Differences
链接: https://arxiv.org/abs/2605.26850
作者: Hanlin Yu,RuiKang OuYang,Partha Kaushik,Arto Klami,Michael U. Gutmann,Omar Chehab
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning an energy-based model from data samples is a central problem in machine learning. Many recent and popular methods, such as denoising score matching for training energy-based diffusion models, use stochastic interpolants to corrupt data samples at different noise levels indexed by a time variable. This defines a joint density over both the data space and time, and most methods learn its energy through either spatial or temporal differences. We identify distinct failure modes for both of these approaches. To solve them, we propose Spatiotemporal Noise-Contrastive Estimation (stNCE), a framework for learning the energy through joint spatiotemporal differences. stNCE unifies many existing methods and leads to new training objectives. Experiments on images and molecules demonstrate performance competitive with state-of-the-art density estimation methods.
[LG-31] Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation
链接: https://arxiv.org/abs/2605.26844
作者: Yuanyi Wang,Su Lu,Yanggan Gu,Pengkai Wang,Yifan Yang,Zhaoyi Yan,Congkai Xie,Jianmin Wu,Hongxia Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student’s top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student’s current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.
[LG-32] PATE-TabTransGAN: Differentially Private Synthetic Tabular Data Generation via Transformer-Based Student Discrimination
链接: https://arxiv.org/abs/2605.26802
作者: M. Youssef,M. Woźniak
类目: Machine Learning (cs.LG)
*备注: 16 pages, 3 figures, 4 tables. Submitted for publication
Abstract:Generating high-fidelity synthetic tabular data under formal differential privacy guarantees remains an open challenge. Methods that provide strong theoretical protection typically sacrifice the modeling of inter-feature dependencies required for realistic synthesis, while architectures that excel at capturing complex column relationships offer only empirical privacy guarantees. We present PATE-TabTransGAN, a generative framework that integrates the Private Aggregation of Teacher Ensembles (PATE) mechanism with a Transformer-based student discriminator to jointly address both requirements, and employs a GNMax RDP accountant for numerically stable privacy accounting. An ensemble of Logistic Regression teachers trained on disjoint partitions supervise the student via noisy-aggregated labels, and a residual generator is optimized against this differentially private student, inheriting formal (\epsilon, \delta)-DP guarantees by post-processing. PATE-TabTransGAN was compared with PATE-GAN, DP-GAN, and DP-CTGAN, considered state-of-the-art in differentially private tabular synthesis. Experiments conducted on four tabular benchmarks (Adult, Breast, Cardio, Cervical) confirmed the high quality of the proposed method: PATE-TabTransGAN attains the best or tied-best AUROC on all four datasets. On AUCPR it matches the strongest baseline on Cardio, leads on Cervical, and trails on Breast; on Adult, we demonstrate that AUCPR is highly sensitive to positive-class convention, and that the observed gap is consistent with a convention difference between evaluation pipelines rather than a synthesis deficit.
[LG-33] Pretrained Approximators for Low-Thrust Trajectory Cost and Reachability
链接: https://arxiv.org/abs/2605.26790
作者: Zhong Zhang,Giacomo Acciarini,Dario Izzo,Hexi Baoyin,Francesco Topputo
类目: Machine Learning (cs.LG); Space Physics (physics.space-ph)
*备注: Submitted to the Journal of Guidance, Navigation and Control
Abstract:Low-thrust trajectory design relies heavily on repeated evaluations of fuel consumption and transfer feasibility, which require expensive optimal control solutions. In this work, we show these quantities can be accurately approximated by machine learning surrogates, enabling fast and scalable evaluation across a wide range of scenarios. By increasing both dataset size and model capacity, we observe that low-thrust trajectory optimization follows a scaling law, with performance improving linearly with the logarithm of training data and network parameters, and no evidence of saturation within the explored regime. Guided by this observation, we construct a large-scale dataset using the proposed homotopy-ray strategy tailored to mission design requirements. A key is the introduction of a self-similar transformation, which allows generalization across semi-major axes, inclinations, and central bodies avoiding retraining. As a result, the same neural approximator can be applied to diverse orbital environments and mission classes. The proposed models accurately predict optimal fuel consumption and minimum transfer time for single- and multi-revolution transfers. Their performance and generalization are demonstrated on a public dataset, a multi-asteroid flyby problem from the Global Trajectory Optimization Competition, and an asteroid rendezvous mission design. The models and datasets are released as open-source to support the space community.
[LG-34] me Series Causal Discovery via Context-Conditioned and Causality-Augmented Pretraining NEURIPS2026
链接: https://arxiv.org/abs/2605.26759
作者: Biao Ouyang,Tengxue Zhang,Zhihao Zhuang,Yang Shu,Chenjuan Guo,Bin Yang
类目: Machine Learning (cs.LG)
*备注: Submitted to the 40th Conference on Neural Information Processing Systems (NeurIPS 2026). 27 pages
Abstract:Causal discovery from time series is critical for many real-world applications, such as tracing the root causes of anomalies. Existing approaches typically rely on dataset-specific optimization, making it difficult to transfer their causal discovery capabilities to new time series governed by diverse causal mechanisms. In this paper, we propose \textbfPTCD, a novel \textbfPretraining framework for \textbfTime-series \textbfCausal \textbfDiscovery, which improves cross-task generalization through context-conditioned modeling and transferable causal augmentation. To model complex temporal causal dependencies, PTCD employs a dual-scale iterative attention mechanism to capture window-level causal relationships, and a Gaussian mixture with a context-level routing mechanism to handle heterogeneous exogenous distributions. To further address distribution shifts across causal graphs, PTCD adopts a pretraining paradigm on synthetic datasets that integrates intervention-based learning and a causal mixup strategy, promoting stable causal discovery and stronger generalization. Extensive experiments on multiple real-world out-of-distribution (OOD) datasets demonstrate that PTCD excels in both causal discovery and root cause identification.
[LG-35] Localizing Memorized Regions in Diffusion Models via Coordinate-Wise Curvature Differences ICML2026
链接: https://arxiv.org/abs/2605.26756
作者: Gwangho Kim,Sungyoon Lee
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:Diffusion models can unintentionally memorize training samples, raising concerns about privacy and copyright. While recent methods can detect memorization, they often rely on global or model-specific signals and provide limited insight into where memorization appears within a generated image. We provide a geometric characterization of local memorization as a coordinate-wise variance collapse. However, such collapse can also arise from intrinsic data constraints rather than overfitting. To isolate overfitting-driven memorization, we propose curvature-difference methods that subtract the curvature of an underfitted baseline, either the unconditional model or a less-trained version of itself. We further derive a score-difference proxy that provides a geometric explanation for the widely used score-difference-based detection metric. Experiments on Stable Diffusion, evaluated against ground-truth memorization masks, show that our method outperforms the prior attention-based localization method. Code is available at this https URL.
[LG-36] APEX: Amplitude Anchors and Phase Priors for Target-Scarce Higher-Frequency Wave Prediction
链接: https://arxiv.org/abs/2605.26732
作者: Yifan Sun,Lei Cheng,Sijie Chen,Ting Zhang,Jianlong Li,Shikai Fang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning-based surrogates have become increasingly effective for wave-field prediction, and neural operators in particular have shown strong performance within observed frequency regimes. However, higher-frequency prediction under scarce target supervision remains comparatively underexplored, especially in wave problems where higher-frequency data are substantially more expensive to simulate or measure than lower-frequency data. A central difficulty is that cross-frequency transfer is inherently asymmetric: coarse amplitude structure remains relatively stable across frequencies, whereas phase-sensitive oscillatory structure deteriorates much more rapidly as frequency increases. Motivated by this asymmetry, we propose APEX, Amplitude-anchored and Phase-prior-guided Enhancement from eXtrapolated coarse predictions, a framework for target-scarce higher-frequency wave-field prediction. A lower-frequency neural operator first provides a coarse prediction in the target-frequency regime, from which we retain only the amplitude as a transferable structural anchor. A conditional flow-matching enhancer then reconstructs the target higher-frequency field under the guidance of a Green’s-function-inspired phase prior. Experiments on SimpleWave, Helmholtz, and Maxwell benchmarks show that APEX consistently outperforms direct lower-to-higher extrapolation, target-adapted operator, and joint generative baselines under limited target-frequency supervision. Our results suggest that reliable higher-frequency prediction of oscillatory wave fields should not rely on direct end-to-end transfer of the full complex field, but instead on explicitly reusing transferable coarse structure while separately recovering the missing oscillatory detail.
[LG-37] MTL-FNO: A Lightweight Multi-Task Fourier Neural Operator for Sparse Field Reconstruction
链接: https://arxiv.org/abs/2605.26718
作者: Siyu Ye,Shihang Li,Zhiqiang Gong,Benrong Zhang,Weien Zhou,Yiyong Huang,Wen Yao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Efficient onboard multi-field sparse reconstruction is essential for the autonomous operation of aerospace vehicles. While existing deep learning models exhibit promise for single-field reconstruction, deploying multiple independent models leads to prohibitive model size growth and fails to exploit cross-field correlations, particularly under few-shot conditions. To address these challenges, we first propose a lightweight multi-task Fourier neural operator (MTL-FNO), an end-to-end joint training framework based on hard parameter sharing. In each layer, the parameters are divided into shared and task-specific components to capture common features across fields while preserving task-specific characteristics. Moreover, the task-specific fine-tuning parameters are implemented as low-rank terms, achieving substantial model compression. Second, to address the difficulty of co-optimizing shared and task-specific parameters along with their real and imaginary parts, we revisit the FNO’s spectral weight from a polar-form perspective and devise a physically meaningful decoupled optimization scheme. Specifically, we apply polar decomposition to slice-wise disentangle the spectral weight into a unitary tensor encoding phase information and a positive semi-definite tensor characterizing amplitude. By decoupling the optimization of phase and amplitude, our method can effectively mitigate tasks conflict. Meanwhile, to preserve unitary geometric fidelity during training, the Cayley transform is introduced to reparameterize the unitary tensor, converting the constrained optimization problem to an unconstrained one. Finally, the effectiveness of the proposed method under few-shot conditions is validated on two representative engineering cases. Results show that MTL-FNO achieves accuracy comparable to or even surpassing that of standard FNO, while reducing total model size by 76% and 60%, respectively.
[LG-38] Image Feature Fusion-based Federated Client Unlearning (FCU)
链接: https://arxiv.org/abs/2605.26715
作者: Hangyi Shen,Yizhi Pan,Tiansuo Li,Weiqi Jiang,Guanqun Sun
类目: Machine Learning (cs.LG)
*备注:
Abstract:Major data protection regulations all mention the “right to be forgotten,” and that’s what pushed federated unlearning (FU) techniques forward. But one stubborn issue remains: catastrophic forgetting–you erase the target knowledge, yet somehow you also end up throwing out essential retained knowledge, which then hurts the model’s global generalization. To get a better balance between unlearning effectiveness and generalization ability, we propose something called Image Feature Fusion-based Federated Client Unlearning (IFF-FCU). The idea is to bring in a linear Image Feature Fusion mechanism (Mixup) that dynamically creates mixed samples, bridging the gap between forget-distribution and retain-distribution. What this strategy does isn’t just deleting a few discrete data points–it theoretically widens and regularizes the forgetting boundary. We ran extensive experiments on medical imaging benchmarks (RSNA-ICH and ISIC2018), and the results show that our approach achieves reasonably good unlearning. For instance, on the ICH dataset, IFF-FCU achieves a highly competitive Error deviation from the retrained gold standard, demonstrating robust improvements over existing baselines. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.26715 [cs.LG] (or arXiv:2605.26715v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.26715 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-39] WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization
链接: https://arxiv.org/abs/2605.26660
作者: Phong Nam Huu Nguyen,Khoi M. Le,Cong-Duy T Nguyen,Anh Tuan Luu,Thong Thanh Nguyen,Tho Quan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Quantization is an effective approach to reduce the memory footprint and inference cost of large language models (LLMs), yet maintaining performance in the ultra-low-bit regime remains challenging. Existing post-training methods often suffer from severe accuracy degradation, while quantization-aware training requires costly retraining and additional resources. Moreover, most mixed-precision strategies rely on coarse-grained or heuristic sensitivity analysis that overlooks fine-grained variations within weight matrices. We propose WINDQuant, a reinforcement-learning-based allocation controller for ultra-low-bit LLM quantization. Rather than introducing another low-level quantization operator, WINDQuant learns how to assign bit-widths and quantization treatments to fine-grained column chunks under a global storage budget. By operating at the column-chunk level, WINDQuant enables flexible and fine-grained precision assignment within layers under a global target bit-width. The implementation combines PPO with activation-aware calibration, lightweight per-unit quantizer fitting, and explicit effective-bit accounting of the learned mixed-precision plan. Experiments on LLaMA models demonstrate that WINDQuant achieves competitive performance in ultra-low-bit settings while reducing optimization overhead relative to retraining-based approaches, highlighting reinforcement learning as a practical controller for adaptive mixed-precision quantization.
[LG-40] Sample Complexity of Policy Gradient for Log-Growth Control
链接: https://arxiv.org/abs/2605.26640
作者: Qiuhua Pan,Yukai Shen,Liwei Zhang,Cailian Chen,Xinping Guan
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 43 pages, 4 figures, 2 tables; includes supplementary material
Abstract:We study the sample complexity of policy gradient for log-growth control – the problem of learning, from observed state transitions, a feedback gain that optimally stabilizes a scalar linear system driven through a multiplicative-noise actuation channel. The objective J(K) = \mathbbE[\log|1+BK|] is the top Lyapunov exponent of the closed loop. This problem carries a structural difficulty we call the cusp obstruction: the optimal gain K^* always places the noise singularity b_\rm sing(K) = -1/K in the interior of the support. At this singular optimum the policy gradient exists only as a Cauchy principal value, not as a Lebesgue integral, and the natural single-sample gradient estimator has infinite variance. Standard first-order stochastic-optimization analysis is thus inapplicable at the optimum, and merely smoothing the objective does not resolve the difficulty. The obstruction, however, has an exploitable symmetry: the Cauchy kernel is an odd function of the displacement from the moving pole, so pairing each observation with its reflection through the pole cancels the divergent part. This one cancellation simultaneously controls the population curvature, the gradient-estimator variance, and the bias incurred when the noise density is estimated. Combining these bounds with a closed-form single-transition gradient oracle, we prove that projected mini-batch policy gradient, initialized in any compact subset of the stabilizing region, attains total sample complexity \tildeO(1/\eta) when the noise density is known and \tildeO(\eta^-(2s+1)/(2s)) when it must be estimated, for C^s noise densities with s \geq 2 .
[LG-41] RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models ICML2026
链接: https://arxiv.org/abs/2605.26632
作者: Xing Cong,Hanlin Tang,Kan Liu,Lan Tao,Lin Qu,Chenhao Xie
类目: Machine Learning (cs.LG)
*备注: 33 pages, 18 figures, Accepted by ICML 2026
Abstract:Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55x speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.
[LG-42] PIDM-DP: Physics-Informed Diffusion with Dormand-Prince Integration for Chaotic System Identification and State Reconstruction across Multiple Dynamical Regimes
链接: https://arxiv.org/abs/2605.26619
作者: Shailendra Dabral
类目: Machine Learning (cs.LG)
*备注: extended work of my journal paper submission
Abstract:Reconstructing continuous state trajectories of chaotic dynamical systems from sparse, noisy observations remains a fundamental open problem in nonlinear science. We introduce the Physics-Informed Diffusion Model with Dormand-Prince Integration (PIDM-DP), which embeds a fully differentiable 5th-order Dormand-Prince (DP-RK45) ODE integrator directly into the reverse sampling loop of a Denoising Diffusion Probabilistic Model (DDPM). At each denoising step, physics residuals are back-propagated via automatic differentiation, constraining every generated trajectory to satisfy the system’s governing equations to 5th-order accuracy. A linear-scheduled guidance mechanism that ramps the physics weight from zero at high noise levels to its full value near the clean-data limit prevents the gradient explosions that cause naive physics-informed approaches to fail on stiff systems with Jacobian eigenvalues of order O(10^3) . Evaluated across five benchmark systems of increasing complexity 3D Lorenz, 3D Rössler, 5D Hyperchaotic, 20D Lorenz-96, and the stiff 3D Rabinovich-Fabrikant at 10% observation density with additive Gaussian noise ( \sigma=0.05 ), PIDM-DP achieves reconstruction RMSE improvements of up to 15.4\times over an unconstrained diffusion baseline and decisively outperforms the Ensemble Kalman Filter on stiff systems where ensemble covariance collapses. On the Rabinovich-Fabrikant out-of-distribution benchmark, PIDM-DP attains RMSE 0.1097 \pm 0.0269 versus 0.9443 \pm 0.5288 (unconstrained diffusion, 8.6\times worse) and 0.3561 \pm 0.3040 (EnKF, 3.2\times worse), with p0.001 in paired Wilcoxon tests ( N = 30 ). Topological validation via the Rosenstein Lyapunov estimator confirms that PIDM-DP preserves the chaotic invariant measure.
[LG-43] Near-Optimal Regret in Adversarial Kernel Bandits
链接: https://arxiv.org/abs/2605.26585
作者: Yu-Jie Zhang,Hao Qiu,Jonathan Scarlett,Kevin Jamieson
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the adversarial kernel bandit problem, in which the loss at each round is induced by an arbitrary bounded element of a reproducing kernel Hilbert space (RKHS). We propose an exponential-weights algorithm built on a regularized importance-weighted loss estimator, together with an explicit correction term that cancels the bias introduced by the regularization. Our main result bounds the regret by \widetildeO\big(\sqrtT, d_(\lambda),\log|X|\big) , where d_(\lambda) is a widely-adopted notion of effective dimension that captures the complexity of the kernel. Up to logarithmic factors, this matches the known rate achieved in the related stochastic kernel bandit problem. A notable application is the Matérn (\nu,d) kernel with smoothness parameter \nu on \mathbbR^d , for which our bound specializes to \widetildeO\big(T^(\nu+d)/(2\nu+d)\big) , improving over the best-known prior rate of Chatterji et al. [2019] while simultaneously removing the rank-one adversary assumption required by their analysis. Moreover, this rate is the same as the known optimal rate for stochastic kernel bandits, and also matches a lower bound from concurrent work up to a \log T factor.
[LG-44] Focal Reward: Balanced Reinforcement Learning under Rubric-Based Rewards
链接: https://arxiv.org/abs/2605.26579
作者: Yu Huang,Zihua Zhao,Zhaoxin Huan,Wanli Gu,Feng Hong,Xinmu Ge,Lin Yuan,Weichang Wu,Qiang Hu,Xiaolu Zhang,Jun Zhou,Jiangchao Yao
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:The open-ended generation in LLMs usually requires multi-dimensional rubrics to adequately assess quality and guide the improvement of reinforcement learning. However, a critical dilemma inherent in this training paradigm is the imbalanced reward polarization along different rubric dimensions. Under this bottleneck, even if LLMs achieve relatively high rewards after training, they may still exhibit severe deficiencies in certain dimensions, leading to a direct deterioration in user experience. To address this problem, we propose Focal Reward, a novel objective to automatically balance the training of reinforcement learning under rubric-based rewards. Specifically, we first leverage an inverse reward projection mechanism to estimate the saturation degree of each criterion in the rubric, which forms the basis to calibrate the reward direction. Then, the final objective is designed with an automatically reweighting coefficient for each criterion to achieve the fine-grained balancing. Extensive experiments across three model scales and six benchmarks demonstrate that our Focal Reward method outperforms the strongest static aggregation baseline in all 18 model-benchmark comparisons. Rollout, mechanism, and ablation analyses further show that these gains arise from online, saturation-aware reallocation toward rubrics that still have room for improvement.
[LG-45] Separate Aggregation of Split Network for Personalized Federated Learning
链接: https://arxiv.org/abs/2605.26571
作者: Yunseok Kang,Jaeyoung Song
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning enables collaborative model training without sharing raw data, but its performance can degrade substantially under heterogeneous client data distributions. A single global model often cannot satisfy diverse client requirements, so personalized federated learning has therefore been explored to improve client specific performance while preserving global generalization. Existing PFL methods often face a fundamental tradeoff in which stronger global sharing can undermine local specialization, whereas stronger local adaptation can lead to overfitting under limited data, label imbalance, and missing class scenarios. In this work, we propose PGFedSplit, a personalized federated learning framework that improves both personalization and global generalization under severe client heterogeneity. PGFedSplit adopts a split architecture and performs adaptive aggregation scheduling tailored to the roles of different model components, enabling stable knowledge sharing while maintaining client specific adaptation. Each client further leverages a mixture of locally extracted representations and synthetic representations generated from server side Gaussian statistics, improving robustness under label imbalance and missing class conditions. Extensive experiments on Fashion MNIST, CIFAR 10, CIFAR 100, and Tiny ImageNet demonstrate consistent improvements over state of the art PFL methods, with stable convergence and superior personalization in highly heterogeneous settings.
[LG-46] Distribution-Aware Conformal Prediction: A Framework for generating efficient prediction intervals for time series
链接: https://arxiv.org/abs/2605.26569
作者: Daniel Schweizer,Peter Kuhn,Jayant Sharma,Shivali Dubey,Malte von Ramin,Christoph Brockt-Haßauer
类目: Machine Learning (cs.LG)
*备注: submitted to Journal of Machine Learning Research (JMLR)
Abstract:We present Distribution-aware Conformal Prediction (DCP), a unified framework integrating probabilistic predictors like Monte Carlo dropout, deep ensembles, and quantile regression with score-agnostic conformal calibration to produce valid and efficient prediction intervals. Leveraging a numerical inversion approach to construct interval bounds, DCP accommodates arbitrary combinations of distribution generating predictors and nonconformity scores. Benchmark analysis on synthetic and real-world time series data demonstrate DCP’s ability to adaptively calibrate prediction intervals under varying uncertainty regimes. Crucially, DCP’s modular design facilitates plug-and-play experimentation with different predictor-score pairings, quantitatively supported by a newly introduced modified Winkler score that balances validity and efficiency by explicitly penalizing undercoverage. While DCP generalizes and extends existing approaches like Conformalized Quantile Regression and Conformalized Monte Carlo, its modular design allows further extensions, setting a foundation for advancing uncertainty quantification in dynamic environments and high-risk applications.
[LG-47] Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting KDD2026
链接: https://arxiv.org/abs/2605.26562
作者: Shuang Liang,Chaochuan Hou,Xu Yao,Shiping Wang,Hailiang Huang,Songqiao Han,Minqi Jiang
类目: Machine Learning (cs.LG)
*备注: accepted by KDD 2026 Datasets and Benchmarks Track
Abstract:While previous research in multivariate time series forecasting has focused on developing complex holistic models, this work advocates for a shift toward a granular, component-level understanding of their impacts. We propose TSCOMP, the first large-scale benchmark that systematically deconstructs deep forecasting methods into their core, fine-grained components–spanning series preprocessing, encoding strategies, network architectures including specific and large time-series models, and optimization methods. Using constrained orthogonal experimental design and extensive evaluations, we conduct multi-view analyses that reveal component effectiveness across different backbones, data characteristics, and their interactions. Beyond providing insights, this benchmark establishes a fine-grained performance corpus comprising over 20,000 model-dataset evaluations, which supports the learning of automated component selection, enabling zero-shot model construction on new datasets. Our experiments demonstrate that the corpus-driven approach, despite its simplicity, consistently outperforms state-of-the-art methods, validating the soundness of our evaluation design and confirming that systematic component selection surpasses manually designed complex architectures. All code and the performance corpus are publicly available at this https URL.
[LG-48] SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?
链接: https://arxiv.org/abs/2605.26548
作者: Hwiwon Lee,Jiawei Liu,Dongjun Kim,Ziqi Zhang,Chunqiu Steven Xia,Lingming Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) now support automated software security tasks, including vulnerability discovery and proof-of-concept (PoC) generation. Existing benchmarks do not faithfully evaluate LLMs in real-world bug hunting scenarios because they rely on fuzzing harnesses, target-specific descriptions, or vulnerability-reproduction tasks. We present SEC-bench Pro, a benchmark for measuring agent bug hunting on critical, high-complexity software systems. This work discloses reports with concrete PoC inputs and links fixes into reproducible tasks through a three-phase pipeline for vulnerability collection, environment reconstruction, and oracle-based validation. We instantiate SEC-bench Pro with 183 validated vulnerabilities across V8 and SpiderMonkey, including a V8 subset with more than 1.5 million in cumulative Google Vulnerability Reward Program awards. These instances span memory-safety, sandbox, JIT, and race-condition bugs under browser-grade and runtime-grade execution conditions. Our evaluation shows that coding agents with frontier models remain below 40% success on both evaluated engines. The open-weight Kimi-K2.6 baseline reaches 11.7% on V8, while the strongest frontier configuration reaches 32.0% on V8 and 38.8% on SpiderMonkey. ClaudeCode and Codex solve complementary instance sets, and their two-agent union reaches 37.9% on V8 and 48.8% on SpiderMonkey. SEC-bench Pro provides robust environments for assessing LLM-based security agents and exposes limitations in long-horizon bug hunting tasks.
[LG-49] Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks
链接: https://arxiv.org/abs/2605.26526
作者: Kevin Kuo,Chhavi Yadav,Virginia Smith
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: main body: 9 pages, 3 figures
Abstract:Recent defenses for safeguarding open-weight large language models (LLMs) are intended to prevent adversarial usage. Underlying these defenses is an assumption that new harmful behavior is learned through fine-tuning rather than elicited by jailbreaking the model. Yet, pretrained LLMs already encode substantial harmful knowledge across many domains, which raises an important question: can an adversary jailbreak safeguarded models, to achieve harmful usage without fine-tuning at all? In this paper, we show that open-weight safeguards are susceptible to simpler strategies that, despite being well known, have not been systematically evaluated against these safeguards. Specifically, we evaluate two low-cost attacks–abliteration and prefilling–that do not rely on gradient-based optimization. Across three harmfulness evaluation benchmarks (BeaverTails, HarmBench, and AdvBench), these attacks increase attack success rates against safeguarded open-weight models from below 10% to a range of 16%-96%. To mitigate this vulnerability, we introduce abliteration-resistant tuning (ART), which incorporates an abliteration-based objective into training. ART can be layered onto existing defenses and reduces the success rates of abliteration, prefilling, and their combination by 10%-20%. These findings indicate that the attack surface for open-weight models is broader than previously characterized, and that evaluations of safeguarding defenses should incorporate a more diverse set of attack strategies beyond adversarial fine-tuning.
[LG-50] SIKA-GP: Accelerating Gaussian Process Inference with Sparse Inducing Kernel Approximations for Bayesian Deep Learning ICML
链接: https://arxiv.org/abs/2605.26509
作者: Wenyuan Zhao,Rui Tuo,Chao Tian
类目: Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO)
*备注: 20 pages, 8 figures; accepted to International Conference on Machine Learning (ICML) 2026
Abstract:Gaussian processes (GPs) provide a principled Bayesian framework for uncertainty estimation, but their computational complexity severely limits scalability to large datasets. We propose SIKA-GP, which accelerates GP inference using sparse inducing kernel approximations based on a dyadic ordered template basis, incurring only O(\log M) complexity dependence on the number of inducing points. Our approach constructs compact and expressive kernel representations from sparsely activated bases, enabling efficient tensorized GPU computation and seamless integration with modern large-scale models. SIKA-GP can be naturally embedded into Bayesian neural networks (BNNs) with sparse activations, yielding significant speedups in both training and inference without sacrificing predictive performance. The method naturally extends to deep feature learning, addressing the scalability challenges introduced by deep architectures and high-dimensional feature representations. Empirical results on vision and transformer-based language benchmarks demonstrate that our approach consistently delivers fast and accurate GP models, providing a principled path toward scalable kernel learning.
[LG-51] PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film Design
链接: https://arxiv.org/abs/2605.26502
作者: Runtian Wang,Renhao Xue,Baige Chen,Hao Wu
类目: Machine Learning (cs.LG); Optics (physics.optics)
*备注: 8 pages, 3 figures
Abstract:The inverse problem of multilayer thin-film optical coatings design represents a complex combinatorial-continuous optimization challenge. We present PRISM (Position-encoded Regressive Inverse Spectral Model), a unified decoder-only autoregressive transformer that streamlines this process by jointly predicting discrete material selection and continuous thickness regression within a single backbone. PRISM introduces two primary architectural innovations: (1) spectrum prefix conditioning, which utilizes standard prefix tokens for in-context target injection, and (2) cumulative-depth Rotary Position Embeddings, which encode continuous thickness directly into the positional representation to preserve the physical spatial relationships of the stack. Our benchmarks demonstrate that a PRISM-13M model reduces MAE by over 50% compared to other transformer baselines while utilizing only one-fifth of the parameters. Furthermore, a 44M-parameter variant achieves state-of-the-art performance (MAE = 0.010) on our in-distribution validation benchmark and operates significantly faster than simulated annealing, offering a highly efficient alternative to classical optimization methods.
[LG-52] he Stability of Singular Distribution: A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training
链接: https://arxiv.org/abs/2605.26489
作者: Hongtao Zhang,Wenjie Zhou,Chenxi Jia,Wei Chen,Xueqi Cheng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language model pre-training typically exhibits a two-phase trajectory: a fast initial loss drop followed by a prolonged slow improvement. We identify an underlying spectral phenomenon, Stability of Singular Distribution (SoSD), where the trace-normalized singular value spectrum stabilizes early, even as parameter matrices continue to evolve. We demonstrate that synchronization between SoSD and the slow-descent regime is widely observed across diverse architectures (GPT-2, LLaMA) and settings, including various schedules (Step-wise, WSD, Cosine Decay), weight decays, and optimizers (AdamW, Muon). By analyzing a simplified Transformer, we prove that growing weight norms inevitably precipitate an early SoSD threshold, after which the rate of loss decrease becomes theoretically bounded by the variation in the singular distribution. We further interpret strategies like WSD and Muon through their ability to modulate the SoSD scale, offering a spectral lens for understanding efficient pre-training dynamics.
[LG-53] Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training
链接: https://arxiv.org/abs/2605.26484
作者: Wenjie Zhou,Bohan Wang,Hongtao Zhang,Chenxi Jia,Wei Chen,Xueqi Cheng
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model merging has emerged as a lightweight paradigm for enhancing Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. In this work, we analyze late-stage pre-training trajectories and uncover a \textbfRank-1 Subspace phenomenon: while raw optimization steps oscillate violently, consecutive \emphmerged checkpoints collapse onto a stable, approximately one-dimensional linear manifold. We theoretically ground this observation in a \emphriver-valley landscape analysis: averaging acts as a geometric low-pass filter that dampens high-curvature noise to reveal the optimal descent direction. Capitalizing on this insight, we propose \textbfExtra-Merge, a training-free strategy that extrapolates along this subspace to minimize loss without additional gradient updates. Extensive experiments across GPT-2 and LLaMA families (124M to 2B) demonstrate that Extra-Merge consistently outperforms standard merging baselines. Notably, it yields consistent zero-shot accuracy gains on Pythia-12B downstream tasks and generalizes effectively to the Muon optimizer \citepjordan2024muon.
[LG-54] Variational Inference for Evidential Deep Learning
链接: https://arxiv.org/abs/2605.26477
作者: Jiawei Tang,Xinyan Du,Hui Liu,Junhui Hou,Yuheng Jia
类目: Machine Learning (cs.LG)
*备注:
Abstract:While Deep Neural Networks (DNNs) achieve remarkable performance, their tendency to produce overconfident predictions. Evidential Deep Learning (EDL) mitigates this by formulating predictions as a Dirichlet distribution over class probabilities to explicitly quantify epistemic uncertainty. However, we found that the conventional EDL suffers from two fundamental limitations: a Kullback-Leibler (KL) penalty that only suppresses the evidence of negative classes, producing excessively high evidence therefore decreasing the model’s ability to quantify uncertainty, and an absence in theoretical guarantee of setting Dirichlet parameter \alpha=e+1 . In this paper, we propose a mathematically principled framework, Variational Inference Evidential Deep Learning (VI-EDL). By reformulating evidential learning through the lens of variational inference, we derive an Evidence Lower Bound (ELBO), which prevents the evidence from growing excessively. Theoretically, we rigorously establish a generalization bound and reveal how the predicted uncertainty, feature and network complexity affect this bound, and why setting \boldsymbol\alpha = \mathbfe + \mathbf1 can minimize it. Extensive experiments on standard visual and medical datasets demonstrate that VI-EDL achieves state-of-the-art performance, showing excellent performance in out-of-distribution detection, noise detection and autonomous driving scenario. The code is available in this https URL.
[LG-55] MuCon: Clipped Muon Updates for LLM Training
链接: https://arxiv.org/abs/2605.26459
作者: Albert Yi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Muon-style optimizers take a matrix-valued momentum or preconditioned update B = U \operatornamediag(\sigma_1,\ldots,\sigma_r) V^\top and replace it with its canonical partial polar factor \operatornamePol(B) = U V^\top . This maps every nonzero singular value to one. MuCon is the clipped-Muon variant studied here: it applies singular-value clipping to the same Muon matrix, D^\mathrmMuCon_\tau(B) = \operatornameMClip_\tau(B) = U \operatornamediag\bigl(\min\sigma_i,\tau\bigr) V^\top, \qquad \tau 0 . Thus, \operatornameMClip_\tau denotes the mathematical clipping operator, while MuCon denotes the optimizer primitive that substitutes this clipped direction for Muon’s polar direction. The Muon/MuCon scaling parameterization used in this work is called \textSpectralP : it is the hidden-matrix scaling recipe under which polar Muon or clipped MuCon directions are applied. The map \operatornameMClip_\tau is the Frobenius projection onto the spectral-norm ball \X : |X|_2 \le \tau\ : it leaves singular values at or below \tau unchanged and modifies only the violating singular directions. This paper asks when the MuCon clipping step can be approximated without a full dense SVD. We record two exact identities, a polar/absolute-value formula and a scalar-root formulation leading to a rational Newton filter for the clipped positive-semidefinite factor, and identify the numerical obstruction common to both: singular values near the threshold make sign decisions and rational solves ill-conditioned. Matrix-function methods are therefore useful only when paired with stable polar/square-root primitives or explicit regularization near the clipping boundary.
[LG-56] Robust Koopman Control Barrier Filters for Safe Actor-Critic Reinforcement Learning
链接: https://arxiv.org/abs/2605.26452
作者: Dhruv S. Kushwaha,Zoleikha A. Biron
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 17 pages, 7 figures
Abstract:Safe reinforcement learning (RL) for robotic systems requires policies that improve task performance while satisfying state and input constraints during both training and deployment. Control barrier functions (CBFs) provide a principled mechanism for enforcing forward invariance through minimally invasive safety filters, but their use in model-free RL is limited by the need for accurate dynamics and hand-designed barrier certificates. We propose Robust Koopman-CBF SAC, a safety-filtered actor–critic framework that learns a finite-dimensional Koopman predictor from data, constructs affine CBF constraints in the lifted space, and enforces them through a quadratic-program safety layer. To account for finite-dimensional Koopman approximation error, the CBF condition is tightened using a projected residual margin estimated from held-out rollout data. The critic is trained on the executed safe action, while the actor is regularized toward the Koopman-CBF feasible set, reducing dependence on the filter over training. Across safe-control benchmarks, the method achieves zero constraint violations on CartPole stabilization and tracking while matching or exceeding unconstrained SAC returns. On high-dimensional Safety Gymnasium locomotion tasks, the method reduces violations in some settings but also exposes important limitations of first-order velocity barriers and linear EDMD models, motivating high-order and multi-step Koopman-CBF extensions. These results suggest that robust Koopman-CBF filters are a promising bridge between model-free RL and certifiable safety, while clarifying the structural conditions under which such filters remain effective. All code is available at \hrefthis https URLGithub Repository.
[LG-57] FM-fMRI: Event Conditioned Flow Matching for Rest-to-Task fMRI Time-Series Synthesis MICCAI2026
链接: https://arxiv.org/abs/2605.26423
作者: Peiyu Duan,Jiyao Wang,Nicha C. Dvornek,Junlin Yang,Ziqi Gao,Lawrence H. Staib,James S. Duncan
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: MICCAI 2026 Early Accepted
Abstract:Task-based fMRI provides a direct readout of task-evoked neural dynamics, but it is expensive and difficult to acquire at scale, motivating rest-to-task synthesis from widely available resting-state fMRI (rsfMRI). We propose FM-fMRI, an event-conditioned flow-matching model that learns a continuous-time conditional vector field to generate task ROI time series from a subject’s rsfMRI and the task event information. The formulation enables fast ODE-based sampling and flexible conditioning over heterogeneous event schedules. Rather than optimizing for pointwise reconstruction, we evaluated generated signals using complementary criteria that probe temporal and spectral structure, subject and group-level connectome consistency, and distributional alignment. On the public Human Connectome Project and internal BioPoint autism cohort, FM-fMRI achieves the strongest spectral and connectivity agreement and improved distribution-level matching over conditional diffusion, generative adversarial networks (GANs), and variational autoencoders (VAEs) baselines. Furthermore, we augment the BioPoint cohort by synthesizing task-fMRI ROI time series with our method, improving downstream autism classification and demonstrating practical utility in data-limited clinical settings. The code will be available on GitHub.
[LG-58] Amortized Factor Inference Networks for Posterior Inference
链接: https://arxiv.org/abs/2605.26419
作者: Joohwan Ko,Justin Domke
类目: Machine Learning (cs.LG)
*备注:
Abstract:Amortized inference promises fast test-time Bayesian inference, but existing methods are inherently tied to fixed models. Extending amortization to unseen models typically requires retraining or costly test-time finetuning. In this paper, we ask: is it possible to build a single inference network capable of generalizing across varying priors, likelihoods, and dimensionality? We introduce Amortized Factor Inference Networks (AFINs), a family of encode-merge-decode inference networks built on dimension-independent modules that map a model specification and its observations to the parameters of a variational posterior. Experimentally, a single trained AFIN achieves posterior accuracy comparable to NUTS and several variational inference methods, while requiring 2 to 4 orders of magnitude less test-time compute. Code is available at this https URL.
[LG-59] Function-Valued Causal Influence in Nonlinear Time Series
链接: https://arxiv.org/abs/2605.26408
作者: Valentina V. Kuskova,Dmitry Zaytsev,Michael Coppedge
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 26 pages, 6 tables, 8 figures
Abstract:Causal discovery in time series is increasingly performed using nonlinear machine-learning models, yet the resulting causal relationships are almost always summarized by scalar edge scores. We argue that this practice obscures the true object learned by nonlinear autoregressive models: a state-dependent function whose effect varies across regimes, magnitudes, and contexts. We formalize function-valued causal influence for additive, contribution-decomposable architectures and show that scalar causal scores constitute a severe information bottleneck, conflating between-state variation with within-state residual noise. Using Neural Additive Vector Autoregression as a representative architecture, we introduce a practical framework based on Individual Conditional Expectation for estimating causal response functions directly from trained models. Through controlled synthetic experiments, we demonstrate that edges with indistinguishable scalar scores can exhibit qualitatively different functional behaviors, including monotonic, thresholded, saturating, and sign-changing effects. An applied case study on democratic development further shows that function-valued analysis reveals regime-specific and asymmetric causal structure systematically missed by score-centric approaches.
[LG-60] Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret Geometric Barrier and Bandit Feedback
链接: https://arxiv.org/abs/2605.26373
作者: Anas Barakat,Andreas Kontogiannis,Vasilis Pollatos,Ioannis Panageas,Antonios Varvitsiotis
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 43 pages
Abstract:We study adversarial online learning with hidden-convex losses, i.e., nonconvex losses that become convex after a nonlinear reparameterization. Ghai, Lu and Hazan (2022) proved that, under geometric and smoothness assumptions, online gradient descent (OGD) on such nonconvex losses approximately simulates online mirror descent (OMD) on the underlying convex losses with a suitable regularizer, yielding \mathcalO(T^2/3) regret. They left open whether the optimal \Theta(\sqrtT) regret from online convex optimization can be recovered in this hidden-convex setting. We answer this question affirmatively. More specifically, via a sharper discrete-time algorithmic equivalence argument, we prove that OGD achieves \mathcalO(\sqrtT) regret under the same assumptions, matching the optimal worst-case rate for adversarial online convex optimization. We also address another open question of Ghai, Lu and Hazan (2022) by clarifying the geometry required for this algorithmic equivalence. We replace the diagonal-Jacobian sufficient condition with a necessary-and-sufficient Hessian compatibility condition, thereby expanding the class of admissible reparameterizations. We complement our tight regret bound with a lower bound showing that the Hessian compatibility assumption is essential for OGD; when it fails, we construct a smooth reparameterization and an adversarial sequence of hidden-convex losses for which OGD suffers \Omega(T) regret. Finally, we extend our analysis to one-point bandit feedback and prove a \mathcalO(T^3/4) expected regret bound for bandit OGD with spherical smoothing, matching its classical rate on convex losses.
[LG-61] Balancing Plasticity and Stability with Fast and Slow Successor Features ICML
链接: https://arxiv.org/abs/2605.26357
作者: Raymond Chua,Doina Precup,Blake Richards
类目: Machine Learning (cs.LG)
*备注: Main Paper: 9 pages, 9 figures. Accepted at The International Conference on Machine Learning (ICML) 2026
Abstract:A hallmark of intelligence is the ability to adapt in non-stationary environments, yet deep Reinforcement Learning (RL) agents often struggle in such settings. Prior studies introduce non-stationarity through abrupt shifts in features or dynamics, whereas real-world environments often evolve gradually through continual drift. This distinction has important implications for the “stability-plasticity dilemma” in RL, as abrupt task changes may demand more plasticity than naturalistic settings. To address this, we modify existing 3D Miniworld and MuJoCo environments to incorporate naturalistic, continual non-stationarity, and use them to examine how stability and adaptation affect performance under continuous environmental change. We find that methods favoring stability, such as synaptic consolidation, outperform approaches focused on plasticity, such as parameters resetting. Motivated by this result, and prior evidence that Successor Features (SFs) reduce interference, we investigate whether SFs are better consolidation targets than Q-values. Across both environments, applying neuro-inspired synaptic consolidation to SFs yields superior performance on continually changing settings. Moreover, consolidation is most effective when SFs are stabilized across multiple timescales, which capture complementary aspects of gradual environmental change. Together, these results suggest that stability is more critical in continual learning when changes are gradual, and that multi-timescale consolidation of predictive representations is an effective approach.
[LG-62] MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability
链接: https://arxiv.org/abs/2605.26343
作者: Barsat Khadka
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mechanistic interpretability has identified small sets of attention heads that implement specific behaviours in transformer language models, but recovering these circuits typically requires a bespoke analytical pipeline for each new task. We recast circuit discovery as a reinforcement-learning problem. An agent operates over the 144 attention heads of GPT-2 small as a discrete action space; each action triggers a zero-ablation and a contrastive reward that subtracts the ablation’s damage to general next-token prediction from its damage to the target task. A single PPO policy, trained on two tasks (induction and IOI) in a vectorised multi-task environment, attains the per-episode oracle on both training tasks and on a held-out third task (docstring completion). Its preferred heads coincide with the canonical heads of established literature on precisely the axes those papers identify as causally non-redundant under single-head ablation; the categories they identify as redundant are correctly de-prioritised by the agent. On the held-out task, best-of-five planning recovers 96% of the oracle ceiling with no task signal supplied at evaluation. These results indicate that reinforcement learning over causal interventions is a viable, transferable substrate for identifying the single-head bottlenecks of mechanistic circuits, complementary to existing path-patching approaches.
[LG-63] A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning
链接: https://arxiv.org/abs/2605.26341
作者: Thien V. Nguyen,Amaury Habrard,Benjamin Guedj
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Physics-informed machine learning (PIML) integrates mechanistic knowledge, typically in the form of partial differential equations (PDE), into data-driven models. Despite strong empirical performance, its statistical generalisation properties remain poorly understood, particularly in the regression setting with unbounded losses. Existing analyses rely on approximation or stability arguments and do not fully capture how physical structure influences generalisation from finite data. In this work, we develop a PAC-Bayesian framework for PIML that provides high-probability generalisation guarantees in the presence of unbounded losses. We adopt a multi-task perspective that jointly treats data fidelity, PDE residuals, initial and boundary conditions, avoiding the looseness induced by standard union-bound approaches. Our analysis leverages the structure of physics-informed objectives to derive novel bounds where the complexity scales with input-gradient norms of the losses, revealing a direct link between physical regularity and generalisation. We instantiate this framework under Sobolev and Poincaré-type assumptions, yielding two classes of bounds that trade off statistical complexity and smoothness in different regimes. Building on these results, we propose a self-bounding-aware learning algorithm that directly optimises tractable surrogates of the derived bounds, along with a practical procedure to estimate the associated constants in realistic settings. Empirical evaluations on standard PDE benchmarks demonstrate that our bounds are non-vacuous, significantly tighter than union-bound baselines, and can be effectively minimised during training. Overall, our results provide a principled statistical foundation for the generalisation of physics-informed models.
[LG-64] Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storag e
链接: https://arxiv.org/abs/2605.26327
作者: Alan Milligan,Zikun Xu,Simon Lacoste-Julien,Felix Dangel,Wu Lin
类目: Machine Learning (cs.LG)
*备注: Preprint, working in progress
Abstract:Shampoo-based methods, such as KL-Shampoo and SOAP, have demonstrated strong performance in training neural networks and rely on QR decomposition. Because existing QR implementations require single-precision (FP32) arithmetic and remain computationally expensive, these methods become time- and memory-intensive when their preconditioning matrices are large. Moreover, using BFloat16 (BFP16) storage to reduce memory usage can degrade the performance of Shampoo-based methods. We propose a reparametrization of the preconditioner that supports BFP16 storage and forms a complete basis by combining updated basis vectors with unchanged ones. By updating only part of the basis through QR decomposition in a subspace, our approach reduces computational overhead while mitigating the performance degradation caused by BFP16 storage. Our approach applies broadly to Shampoo-based methods that employ QR decomposition, including KL-Shampoo, SOAP, and KL-SOAP. In particular, it improves the performance of SOAP and KL-SOAP under BFP16 storage, enabling KL-SOAP to match or exceed KL-Shampoo. Overall, our approach makes Shampoo-based methods more memory- and time-efficient.
[LG-65] Classification and detection of multiple UAVs using rational Gaussian wavelet neural networks
链接: https://arxiv.org/abs/2605.26310
作者: Ungvári Gergő,Ferenc Braun,Attila Ámon,Péter Kackstädter,János Volk,Péter Kovács,Tamás Dózsa
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 19 pages, 4 figures
Abstract:The detection of unmanned aerial vehicles (UAVs) is important for the protection of civilian and military infrastructure. In this paper we propose a cost effective UAV detection system using sound signals obtained from microphones. The recorded signals are passed through a signal processing pipeline which employs interpretable adaptive feature extractors using so-called rational Gaussian wavelets. These adaptive wavelet transformations are embedded into and trained together with an underlying small neural network which detects and classifies UAVs based on the obtained features. This leads to a physically interpretable machine learning algorithm that in addition to classifying UAVs is also capable of detecting UAV swarms. We demonstrate our results using data collected in indoor studio and noisy outdoor environments. We conclude that the proposed method outperforms traditional machine learning approaches for detecting and classifying single UAVs as well as drone swarms, while retaining a high degree of interpretability. Our implementation of the proposed methods is made publicly available for reproducibility.
[LG-66] Dynamic Link Prediction with Temporally Enhanced Signed Graph Neural Networks
链接: https://arxiv.org/abs/2605.26290
作者: Derek Regier,Andrew Polyak,Aresh Dadlani,Khosro Salmani
类目: Machine Learning (cs.LG)
*备注: 11 pages
Abstract:Temporal signed networks (TSNs) model the time evolution of cooperative and adversarial relationships that arise in applications such as social media analysis, trust and reputation systems, and financial transaction networks. While graph neural networks (GNNs) perform well for static or unsigned link prediction, effective learning in temporal signed graphs remains challenging due to the interaction of signed relations, evolving structure, and balance-theoretic constraints. To address this gap, we propose a \emphmodular temporal enhancement framework for signed GNNs that integrates historical context into otherwise static architectures. The framework introduces a Historical Context Integration Module (HCIM) that combines learnable recency-aware temporal weighting, LSTM-based embedding trajectory modeling, and multi-head temporal attention to capture both short- and long-term signed interaction dynamics. Historical information is fused with current node representations using either global or node-adaptive weighting, allowing the architecture-agnostic framework to accommodate heterogeneous temporal behaviors. We instantiate the approach on the Self-Explainable Signed Graph Transformer (SE-SGformer), preserving interpretability while extending it with temporal awareness. Experiments on real-world and synthetic TSNs, including Bitcoin OTC, Bitcoin Alpha, Reddit, and small-world network models, demonstrate consistent and statistically significant improvements over the static baseline.
[LG-67] Stateful Inference for Low-Latency Multi-Agent Tool Calling
链接: https://arxiv.org/abs/2605.26289
作者: Victor Norgren
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-agent tool calling is becoming the dominant interaction pattern for LLM-based systems, yet existing inference frameworks treat each tool call as an independent request, re-processing the entire conversation from scratch even though 85-95% of the prompt is unchanged from the previous turn. We present a stateful inference architecture that converts the O(n_t) per-turn cost of conventional serving into an O(\Delta_t) delta-only cost: a persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi-agent traffic and a prompt-lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel, fully-generated workloads, the reference implementation is 2.1\times faster per turn on a 6-turn agentic workflow and 4.2\times on the median turn of a 35-turn one, halving end-to-end wall time. The advantage comes from stateful reuse and speculation, not caching. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.26289 [cs.LG] (or arXiv:2605.26289v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.26289 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-68] wo-Parameter Flows for Learning Population Dynamics of Physical Systems
链接: https://arxiv.org/abs/2605.26285
作者: Paul Schwerdtner,Tobias Blickhan,Benjamin Peherstorfer
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:This work addresses the problem of learning the dynamics of high-dimensional probability densities over time using unlabeled samples, without assuming access to trajectory information. We introduce two-parameter flows that learn only sampling-time transports from a base distribution to each marginal and then extract a physics-time velocity by regressing on coupled synthetic trajectories. We prove that the resulting physics-time dynamics are unique and inherit regularity from the sampling-time transports. Because we can build on standard, well-developed conditional flow matching techniques for learning the base-to-marginal transports, our approach scales to high dimensions and avoids per-step optimal-transport couplings, while allowing admissible non-gradient dynamics that can naturally explain rotational or circulating physics phenomena.
[LG-69] Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization
链接: https://arxiv.org/abs/2605.26282
作者: Xiaoyuan Cheng,Wenxuan Yuan,Zhancun Mu,Yuanzhao Zhang,Yiming Yang,Hai Wang,Zhuo Sun,Che Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model-based reinforcement learning (RL) can be effectively supported at scale through the use of world models. However, in practice, scaling such approaches remains fundamentally limited. A commonly recognized challenge is model bias and error compounding, which degrade long-horizon predictions. Beyond these issues, we identify a more critical yet underexplored bottleneck: a structural misalignment between search and value learning in existing world model approaches. In particular, policy improvement often relies on value functions induced by a separate, non-search policy, resulting in training inconsistency and ultimately suboptimal learning. To address this limitation, we propose Model-Based Diffusion Policy Optimization (MBDPO) in world models, a framework that unifies search and policy optimization through diffusion policy representations, thereby unlocking the potential of world models for scalable policy learning. Instead of constructing an explicit planner over a learned world model, we reformulate policy optimization as a diffusion process over searched trajectories in latent world models. In this view, we extract an implicit energy function from the collected dataset that anchors the policy, enabling MBDPO to refine the score field for policy optimization while mitigating misalignment. We evaluate MBDPO across a wide range of settings, including multi-task offline pretraining, online learning, and offline-to-online fine-tuning. In the offline regime, we further investigate its scaling behavior by pretraining on large-scale datasets, observing consistent and monotonic performance gains with increasing model capacity.
[LG-70] he Bridge-Garden Dilemma in LLM Distillation: Why Mixing Hard and Soft Labels Works ICML2026
链接: https://arxiv.org/abs/2605.26246
作者: Guanghui Wang,Kaiwen Lv Kacuila,Zhiyong Yang,Zitai Wang,Jin-Wen Wu,Longtao Huang,Qianqian Xu,Qingming Huang
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2026
Abstract:Knowledge distillation (KD) transfers knowledge from a large teacher model to a smaller student. In language modeling, the student is trained either on tokens sampled from the teacher (hard labels) or the teacher’s full next-token distribution (soft labels). Despite soft labels appear strictly richer, we find that mixing hard and soft labels consistently yields better results. Crucially, we show that this gain cannot be explained by closer teacher matching during training. Instead, it comes from reduced exposure bias, the mismatch between training and inference distributions. To explain this phenomenon, we introduce the Bridge-Garden Decomposition theory, which categorizes generation steps into two types: Bridges, where the next token must be exact, and Gardens, where it can be flexible. We show that hard-only KD excels in Bridges by avoiding risky deviations, while soft-only KD preserves diversity in Gardens. A hybrid strategy handles both cases and, as a result, reduces exposure bias across the sequence. Guided by this theory, we develop a family of Bridge-Garden hybrid supervision methods that adaptively balance hard and soft labels. Across a primary suite of seven teacher-student pairs (including Qwen, Llama, Gemma, and DeepSeek) and benchmarks in reasoning and coding, our approach outperforms divergence-based and on-policy KD baselines while reducing training cost by 9.7x, enabling efficient model compression. Code is available at this https URL.
[LG-71] Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks
链接: https://arxiv.org/abs/2605.26243
作者: Zhishuai Guo,Wenhan Wu,Chen Chen,Lei Zhang,Olivera Kotevska,Ravi K Madduri
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph neural networks (GNNs) achieve strong performance on relational data, but real-world graphs are often distributed across organizations that cannot share raw data due to privacy and policy constraints. Existing federated GNN methods either ignore cross-client links, leading to degraded accuracy, or require frequent embedding exchanges, incurring substantial communication and privacy costs. We propose CE-FedGNN, a communication-efficient and privacy-preserving federated GNN framework for learning over such coupled graphs. Our approach avoids sharing raw data or per-round embeddings by infrequently exchanging aggregated node representations. To handle cross-client dependency and staleness, we introduce a moving-average estimator that continuously tracks node representations and enables their stable reuse across rounds. To provide formal privacy guarantees for the released representations, we adopt the metric differential privacy (metric-DP) framework, which measures privacy with respect to distances in the learned embedding space rather than worst-case input perturbations. This yields meaningful guarantees at noise levels where standard differential privacy becomes overly conservative. We establish convergence to a stationary point at a rate of O(1/\sqrtT) with O(T^3/4) communication complexity. In addition, we derive (\varepsilon,\delta) -metric-DP guarantees via Rényi differential privacy composition under a public-cohort threat model. Experiments on synthetic interbank anti-money laundering benchmarks and citation networks demonstrate that CE-FedGNN achieves strong performance while significantly reducing communication and maintaining robustness under privacy-preserving noise.
[LG-72] From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD
链接: https://arxiv.org/abs/2605.26222
作者: Christoph H. Lampert,Hossein Zakerinia
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 22 pages
Abstract:Understanding the relationship between generalization and privacy remains a central challenge in modern machine learning theory, particularly for deep networks trained by variants of differentially private stochastic gradient descent (DP-SGD). In this work we make progress on this persistent open problem by proving a finite-sample bound on the approximate max-information of DP-SGD that exhibits scaling properties comparable with (Dwork et al, 2015)'s classic result for \epsilon -differentially private algorithms, namely at most linear in the dataset size. From our result we obtain a general-purpose PAC-Bayes generalization bound in which the necessary prior distribution can be learned by DP-SGD, as well as a generalization bound for DP-SGD-trained models themselves, with a complexity term that is fully explicit and controlled by the optimization hyperparameters.
[LG-73] On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series
链接: https://arxiv.org/abs/2605.26194
作者: Sharmita Dey,Diego Paez-Granados
类目: Machine Learning (cs.LG)
*备注:
Abstract:Clinical time-series learning is routinely constrained by small, heterogeneous cohorts and protocol drift, while its downstream use spans both classification (e.g., pathology diagnosis) and regression (e.g., temporal forecasting). These constraints make foundation-model pretraining appealing, but raises an important question of which inductive biases should the pretraining objective impose so that representations transfer across task types and subjects. We study this question in pathological gait analysis for spinal cord injury (SCI) via PathoFM, an encoder-centric transformer pretrained on multivariate gait windows with three complementary objectives: Local Completion (reconstruct contiguous masked spans to enforce local structure), Temporal Continuity (predict a masked mid-horizon continuation from an observed prefix to enforce smoothness and causal consistency), and Unsupervised In-Context Dynamics (support-query reconstruction conditioned on subject exemplar windows via attention). Empirically comparing objective families (grouping/contrastive, dynamics-based, and generative reconstruction), we find that dynamics-centric mixtures produce the most balanced transfer: grouping objectives favor discriminative margins but can degrade magnitude fidelity needed for continuous targets, whereas reconstruction-only objectives preserve waveform structure but may underperform on classification. Overall, combining local reconstruction with temporal continuity, and adding in-context conditioning when exemplar access is realistic, yields robust subject-generalizing representations.
[LG-74] ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling
链接: https://arxiv.org/abs/2605.26172
作者: Meng Cai,Lars Kulik,Farhana Choudhury
类目: Machine Learning (cs.LG)
*备注: Preprint. 34 pages, 2 figures
Abstract:When language models use test-time sampling, they generate multiple reasoning trajectories and select an answer by majority vote. We show that these trajectories are not independent: for a given question, they concentrate into a small number of clusters, or reasoning basins, each defined by a normalized final answer and the solutions that reach it. A majority vote therefore selects the most stable basin rather than the most accurate one, which creates wrong-majority failures where the correct answer is present but outvoted. We introduce ARBITER, a model-agnostic approach that models interactions between basins using only the base model’s own sampled outputs, hidden states, and derived evidence. Most direct correction strategies fail; ARBITER instead uses conservative additive evidence on top of consensus. In its simplest parameter-free form, ARBITER-\Delta adds same-model evidence to the majority prior, while ARBITER-Enc augments this with bounded residual signals from hidden states over complete solutions. On GSM8K with Qwen3-4B, consensus over K=24 samples achieves around the mid-94% range, while a same-pool top-2 oracle reaches around the mid-96% range. ARBITER recovers a subset of these cases using zero external information. Across three model families and three math benchmarks, it yields consistent gains with no net-negative cases; for example, on Llama-3.1-8B MMLU-HS-Math, it improves accuracy from the mid-78% range to the mid-82% range, recovering about 22% of the available oracle headroom, indicating that this headroom can be partially recovered from the sample pool itself.
[LG-75] When Rule Violations Are Rare: Chimera Training for Logical Anomaly Detection
链接: https://arxiv.org/abs/2605.26171
作者: Alejandro Ascarate,Leo Lebrat,Rodrigo Santa Cruz,Clinton Fookes,Olivier Salvado
类目: Machine Learning (cs.LG)
*备注: 9+30 pages, 4+4 figures, under review
Abstract:Many practical anomalies are not merely rare inputs, but violations of semantic constraints: objects co-occur in structured ways, actions imply preconditions, and events satisfy temporal or relational regularities. We study anomaly detection in this setting, where constraints are given as logical rules over learned visual concepts, but real rule violations are rare or absent during training. We propose a neural rule evaluator that compiles each constraint into a directed acyclic graph and learns feature-aware subtree MLP gates for its internal logical operators. Each gate maps child features and edge-level negations to a parent representation and a rule-satisfaction probability, with intermediate supervision obtained from exact Boolean propagation over ground-truth concept labels. The key difficulty is that same-image training data often provide insufficient coverage of informative truth configurations and also allow shortcut solutions. To address this, we introduce chimera training: an operand-level counterfactual construction at the feature level. Instead of mixing input images, we concatenate subtree features from different samples; each operand keeps the hard truth label of the sample it came from, and the chimera target is obtained by applying the node’s logical operator to those inherited labels. This supplies supervised logical counterexamples without requiring real anomalous images. Across CLEVRER, OpenImages, and VidOR, the resulting evaluator improves rule-level anomaly AUROC over independent-events and same-image semantic-training baselines, especially for compositional and relational rules. The method yields both scalar anomaly scores and rule-level attributions.
[LG-76] LearnedCache: An eBPF-Integrated Perceptron-Based Eviction Policy for the Linux Page Cache
链接: https://arxiv.org/abs/2605.26168
作者: Zejia Qi
类目: Operating Systems (cs.OS); Machine Learning (cs.LG)
*备注: 11 pages, 12 figures, 4 listings. Policies and harnesses: this https URL . Model and visualizations: this https URL
Abstract:Linux is the foundation of the digital age, accounting for the majority of the cloud and mobile OS markets. Any device that runs Linux uses the Linux page cache, a central pillar in OS and application performance, serving to reduce extraneous disk access. Many page cache eviction policies have been developed but remain bound by the rigidity of heuristics. The rise of AI-driven tools in recent years, melded with the ever-increasing variety of workloads for Linux devices, sets the stage for machine-learning-driven cache eviction policies. Promising research has been done in this field, but only in the field of user-space applications such as CDNs. We develop LearnedCache, an eBPF-integrated single-layer perceptron-based cache eviction policy for the Linux page cache, trained on real kernel data from diverse workloads. We demonstrate median AUCs of nearly 80% over multiple linear models modeling page reuse time, then take a step further by embedding these models inside the Linux kernel for real-time performance evaluation. Through statistical testing over 50 paired trials against a baseline of FIFO for each workload, LearnedCache reveals that machine-learning-derived cache eviction policies are practical in the Linux kernel under representative empirical workloads and are able to surpass conventional FIFO by statistically significant margins of up to 10% in insertion rate, a frequency-adjusted derivation of cache hit rate, in specific workloads while incurring minimal overhead.
[LG-77] Adversarial Water-Filling: Theory Algorithms and Foundation Model
链接: https://arxiv.org/abs/2605.26163
作者: Xindi Tong,Chee Wei Tan,H. Vincent Poor
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Submitted to IEEE Journal of Selected Topics in Signal Processing
Abstract:Competitive resource allocation problems over frequency and space can be formulated as minimax interaction between transmit power and worst-case interference. This formulation naturally arises in multi-operator low Earth orbit (LEO) satellite spectrum sharing, where transmissions from competing constellations interfere in real-time. Under Gaussian channels, AWF is strongly convex–concave on nondegenerate active channels, whereas discrete constellations yield generally nonconvex mercury/water-filling formulations. In this paper we propose the Adversarial Water-Filling (AWF) problem with corresponding theory and algorithms for these real situations. In addition, we develop a wireless foundation model for AWF to learn the AWF search dynamics. The architecture incorporates permutation-invariant channel representations, a constraint-aware graph neural network (GNN) with sparse message passing, and global latent variables capturing the low-dimensional water level implied by the AWF optimality. Through learned projected extragradient iterations, the model approximates stationary solutions of the constrained minimax problem arising under mercury/water-filling. We further show that, under local regularity and contractivity conditions, the learned AWF dynamics converge locally linearly around regular stationary points. Experiments demonstrate empirical generalization across unseen problem sizes, different constraints, and multiple discrete constellations, while achieving more than one-order-of-magnitude runtime improvements over iterative baselines. The related code can be found at this https URL.
[LG-78] Device Context Protocol: A Compact Safety-First Architecture for LLM -Driven Control of Constrained Devices
链接: https://arxiv.org/abs/2605.26159
作者: Dongxu Yang
类目: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 15 pages, 5 figures. Reference implementation, Python package (pip install pydcp), and reproduction scripts at this https URL
Abstract:Large language models are increasingly used as orchestrators of external tools via the Model Context Protocol (MCP), but MCP is built for software services with megabytes of memory and does not descend to the microcontrollers that dominate the long tail of physical devices. Recent work (IoT-MCP) ports MCP to edge gateways at 74 KB peak memory; this still excludes the smallest commodity MCUs and, critically, does not address the safety problem of giving an unreliable caller (an LLM that may hallucinate or be prompt-injected) direct control of physical hardware. We present the Device Context Protocol (DCP): a sub-50-byte typical frame (6-byte header + CBOR payload + optional 16-byte HMAC), a manifest schema in which capability scoping, range and type checks, dry-run evaluation, and units-as-types are protocol-layer primitives, and a host-side Bridge that rejects malformed or hallucinated calls before any byte reaches the device. Reference firmware measures 27.6 KB flash / 0.6 KB RAM on ESP32; the Python Bridge, ESP32 firmware, and a language-neutral conformance suite are MIT-licensed and public. An empirical study – 675 tool calls produced by five LLMs across four vendors (DeepSeek, Alibaba, Zhipu, MiniMax) against six categories of adversarial prompts, with the injection category instantiating AgentDojo’s attack templates – shows DCP rejects 100% of capability-escalation attempts and 78% of prompt-injection attempts, versus 0–1% for Raw MCP and IoT-MCP, matching the expressiveness of a well-formed OpenAPI 3 schema at three orders of magnitude less firmware footprint. We position DCP as the missing layer between MCP (which is moving toward enterprise SaaS connectivity) and the physical devices it does not reach.
[LG-79] Neural Bayesian Sequential Routing
链接: https://arxiv.org/abs/2605.26147
作者: Yongchao Huang
类目: Machine Learning (cs.LG)
*备注: 71 pages
Abstract:Human decision-making is sequential and uncertainty-aware, yet standard neural networks often rely on static, dense forward computation with limited visibility into evidence acquisition, uncertainty evolution, or when computation should stop. We introduce \textbfNeural Bayesian Sequential Routing (NBSR), a framework that models neural inference as active evidence accumulation over a hierarchical Directed Acyclic Graph (DAG). Within a Dirichlet–Categorical conjugate framework, neural experts query a persistent global knowledge oracle to extract positive evidence vectors, which act as pseudo-counts and update a Dirichlet belief state by exact conjugate addition. Coupled with a Gumbel-Softmax Straight-Through estimator, this update enables hard, path-dependent routing while preserving surrogate gradients for end-to-end training. The resulting Dirichlet precision and entropy provide mechanisms for uncertainty quantification, entropy-based early exiting, OOD abstention, and cost-aware evidence acquisition. We prove that, under strictly positive evidence extraction, total Dirichlet precision increases monotonically along any valid trajectory and marginal predictive variance is bounded, formalizing sequential ``hypothesis sharpening’'; under idealized capacity and optimization assumptions, the terminal Dirichlet expectation recovers the Bayes-optimal conditional distribution. Empirical evaluations across visual categorization, structured medical diagnosis, language modeling, partially observable control, and cost-aware Bayesian experimental design show that NBSR achieves competitive predictive performance while providing transparent routing traces, path-dependent evidence attribution, uncertainty-aware decision control, and resource-rational inference. Overall, NBSR offers a mathematically grounded framework for interpretable, modular, and resource-rational agentic AI.
[LG-80] SilIF: Silhouette-Augmented Isolation Forest for Unsupervised Transaction Fraud Detection
链接: https://arxiv.org/abs/2605.26135
作者: Venkatakrishnan Gopalakrishnan
类目: Machine Learning (cs.LG)
*备注: 5 pages, 1 figure, 5 tables. Code: this https URL
Abstract:Unsupervised anomaly detection is widely used in transaction fraud detection where labels are scarce. Isolation Forest (IF) is among the most popular classical methods due to its scalability and ease of deployment. We propose SilIF, an augmentation of Isolation Forest that adds a silhouette-based scoring layer computed in a representation space induced by the trees of the forest. For each point, we extract a vector of per-tree path lengths, cluster these “fingerprints” into structural groups, and compute a silhouette score that measures how well the point fits its assigned group versus the nearest alternative. The silhouette signal is combined with the base IF score via a single hyperparameter alpha. On the IEEE-CIS Fraud Detection benchmark (~590K transactions, 3.5% fraud), SilIF with alpha=1.0 improves over plain Isolation Forest by +0.0080 AUC-PR on average across five seeds, with SilIF winning on all five seeds (paired t-test p=0.046). We also report results on a synthetic credit-card dataset (Sparkov) where the silhouette augmentation does not improve over plain IF, and we characterize the conditions that distinguish the two outcomes. The paper presents SilIF as a tunable, easy-to-deploy enhancement to Isolation Forest with honest reporting of when it helps and when it does not. Code at this https URL.
[LG-81] AirCast-SR: A Foundation Model for Kilometer-Scale Atmospheric Super-Resolution via Latent Consistency Diffusion
链接: https://arxiv.org/abs/2605.26130
作者: Somnath Luitel,Manmeet Singh,Joshua Durkee,Abdullah Al Fahad,Naveen Sudharsan,Prabhjot Singh,Cenlin He,Harsh Kamath,Zong-Liang Yang,Krishnagopal Halder,Sandeep Juneja,Parthasarathi Mukhopadhyay,Saptarishi Dhanuka,Amit Kumar Srivastava
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Somnath Luitel and Manmeet Singh are equal-contribution co-first authors, with Manmeet Singh ( this http URL @wku.edu) as corresponding author
Abstract:Operational weather prediction at kilometer scales remains computationally prohibitive for traditional numerical weather prediction (NWP) models, limiting forecast access for applications in energy, agriculture, and disaster management that require fine-grained spatiotemporal detail. Here we introduce AirCast-SR, a foundation model for atmospheric super-resolution that downscales global AI weather forecasts from 0.25 degree (~28 km) to 1 km horizontal resolution at hourly temporal resolution, producing 67-hour forecasts of eight coupled surface variables simultaneously. EarthMind-SR employs a three-dimensional U-Net conditioned within a Latent Consistency Model (LCM) diffusion framework, trained on patch-based samples over the contiguous United States (CONUS) using GraphCast forecasts as input and NOAA’s Analysis of Record for Calibration (AORC) as the target. The model achieves near-zero bias across all variables and lead times, and its radial power spectral density analysis demonstrates preservation of fine-scale atmospheric structure at wavelengths of 10 km to 100 km where coarser models lose spectral power. We validate EarthMind-SR across three CONUS case studies spanning winter, summer, and spring seasons, and demonstrate zero-shot global transferability over India and Germany using independent surface station observations without any retraining or fine-tuning. As an open-weights foundation model, EarthMind-SR establishes a new paradigm for kilometer-scale AI weather prediction and provides a platform for regional fine-tuning, distillation, and downstream applications in climate services and hazard forecasting.
[LG-82] he Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models
链接: https://arxiv.org/abs/2605.26128
作者: Jaideep Ray
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:Production LLM systems increasingly require machine-readable outputs: JSON objects, typed traces, regex-constrained fields, and tool-call schemas. This paper targets on-device and low-cost small language model (SLM) deployments, where sub-3B models are attractive for privacy, latency, and commodity hardware but have limited capacity to satisfy schemas while solving tasks. The usual engineering assumption is that hard output constraints improve reliability without changing the underlying answer. We show that this assumption is unsafe for small models. We introduce \emphconstraint tax, a measurement protocol for isolating the answer and executable-accuracy loss caused by structured-output constraints at fixed model, fixed task distribution, and fixed problem instances. Across 15,000 commodity-GPU generations with Qwen2.5-0.5B, Qwen2.5-1.5B, and SmolLM2-1.7B, hard answer-only schema decoding raises schema validity from 61.5% to 100.0%, but lowers answer accuracy from 19.7% to 11.0% and increases wrong-valid-schema outputs from 49.5% to 88.9%. The strongest industry analogue is a deterministic calendar tool-call task: Qwen2.5-1.5B achieves 91.5% executable accuracy with prompt-only JSON but only 48.0% under the same hard tool-call schema, while both modes are 100.0% schema-valid. The error is semantic, not structural. We also show that the 3B boundary still pays a direct-schema tax and that delayed packaging supports a constructive design pattern: reason free, constrain late. The practical conclusion is direct: production systems should report schema validity, answer accuracy, executable accuracy, and wrong-valid-schema rate separately.
[LG-83] Gaussian Process-based learning with new MCMC-based implementation of Wishart prior on correlation matrix
链接: https://arxiv.org/abs/2605.27093
作者: Kane Warrior,Dalia Chakrabarty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In probabilstic supervised learning of an input-output relationship - as a sample function of a Gaussian Process (GP) - priors are typically specified for the hyperparameters of the kernel that parametrises the covariance function of the GP, where the induced covariance matrix of the (resulting multivariate Normal) likelihood, governs the learning and prediction. When the sought function is highly multivariate, multiple lengthscale parameters must be learnt simultaneously, making inference difficult. We develop a ``self-assembled’’ Wishart prior for the covariance matrix, while undertaking Bayesian inference on the kernel hyperparameters using MCMC. The construction uses a look-back window over recent MCMC iterations to define a time-step dependent scale matrix, thereby introducing adaptiveness to the chain. Results suggest that direct prior specification on the covariance matrix can be useful for diagnosing weakly informative inputs within the GP-based learning paradigm. We support our prior development with two distinct empirical illustrations - one on synthetic data, and another on a real-world dataset.
[LG-84] Causal Representation Learning for Generalisable Recommendation
链接: https://arxiv.org/abs/2605.27043
作者: Yorgos Felekis,Michael O’Riordan,Oriol Corcoll,Ciarán M. Gilligan-Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Predictive models trained on observational data often fail to generalise to the distributions they encounter when deployed, especially when the training data is a product of the system being optimised. Recommender systems are a canonical example: they are trained on interaction logs confounded by the deployed policy, past user behaviour, and platform filtering. As a result, the training distribution differs substantially from the candidate distribution scored at serving time, a gap that makes offline metrics unreliable predictors of online performance. We address the distribution shift problem with a method motivated by causal representation learning (CRL). We propose an information-theoretic disentanglement criterion and prove that its optimum depends only on the causal components of the input. We then derive a tractable variational lower bound that makes the criterion optimisable from finite observational data alone. The scope of our method is narrower than that of much of the CRL literature, in that we target better generalisation under distribution shift, not full identification of all latent causal factors. This narrower target is what makes the method practical, requiring only the existing confounded logs, applying to any standard supervised model, and adding no inference-time cost. Our headline evaluation is an A/B test with millions of users on Spotify, applied to a production ranker for personalised playlist generation. A capacity-matched CRL variant performed on par offline but delivered substantial online gains in listener engagement. Complementary evidence on the public KuaiRand recommendation dataset and a synthetic benchmark with known causal structure shows the same pattern: offline parity with baseline, gains under distribution shift. Across all three settings, adding our causal disentanglement objective yields meaningfully better out-of-distribution generalisation.
[LG-85] Constrained Bayesian Experimental Design via Online Planning ICML2026
链接: https://arxiv.org/abs/2605.26990
作者: Yujia Guo,Daolang Huang,Xinyu Zhang,Sammie Katt,Samuel Kaski,Ayush Bharti
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 9 figures. Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)
Abstract:Bayesian experimental design (BED) is a principled framework for data-efficient design of sequential experiments. However, existing BED methods are unable to adapt to dynamic constraints inherent in real-world tasks due to budget limitations, varying costs, or physical constraints that restrict how designs evolve over time. In this paper, we introduce a novel approach to BED that enables constrained optimization of experimental designs by combining offline pre-training of an amortized policy and a posterior network with online multi-step lookahead planning using scenario trees. We empirically demonstrate that our method yields substantially more informative design sequences than existing methods across a range of constrained BED tasks, while incurring only a modest additional computational overhead.
[LG-86] Signal-to-Noise Ratio and Sample Size Govern Representational Alignment in Neural Networks
链接: https://arxiv.org/abs/2605.26973
作者: Ali Hussaini Umar,Alessandro Laio
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Neural networks are known to develop latent representations that are aligned , namely structurally similar across networks trained with different architectures, training protocols, or training datasets. We study this phenomenon in a controlled setting, where we train an ensemble of networks on regression and classification tasks using training sets perturbed by independent realizations of a noise process. We show that the signal-to-noise ratio (SNR) and the training sample size influence the alignment in qualitatively similar ways in networks trained on real-world datasets and in an extremely simple linear network with a single hidden layer, for which the alignment can be estimated analytically. Across linear and nonlinear networks, regression and classification tasks, and both synthetic and real-world data, we consistently observe that alignment varies monotonically with SNR but non-monotonically with training sample size. In particular, the alignment is minimized near the interpolation threshold, and a stronger alignment does not necessarily correspond to better generalization error. These findings reveal a non-trivial dependence of alignment on data quality and quantity, decoupled from generalization performance.
[LG-87] Adaptive Reinforcement Learning for Robust Open Quantum System Control: A Multi-Task Framework with Temporal Optimization
链接: https://arxiv.org/abs/2605.26925
作者: Haftu W. Fentaw,Steve Campbell,Simon Caton
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:We present a Multi-task Soft Actor-Critic (SAC) Reinforcement Learning framework designed for open-system quantum control across diverse Hamiltonians, which learns optimal pulse sequences while simultaneously discovering problem-specific evolution time T and number of control pulse segments N. Experimental results across 51 Hamiltonian variations demonstrate that the multi-task SAC model is able to generate control pulses that can drive a system, under environment noise, from its initial state to its target state with high fidelities, establishing essential foundations for universal quantum control applicable to realistic noisy quantum devices. Through progressive expansion of the training Hamiltonian set, we investigate if a single multi-task model trained using a given number of sample Hamiltonians can successfully accomplish state-transfer tasks for Hamiltonians drawn from the same Hamiltonian space but not encountered during training. In addition, our Robustness Infidelity Measure (RIM) analysis reveals that SAC trained policies exhibit superior robustness to pulse amplitude perturbations and decoherence rate variations compared to GRAPE-optimized controls.
[LG-88] Particle-Lund Multimodality in Jet Taggers
链接: https://arxiv.org/abs/2605.26821
作者: Loukas Gouskos,Benedikt Maier
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:
Abstract:The Lund plane offers a physics-motivated, hierarchical representation of QCD radiation within jets, while transformer-based taggers have reached state-of-the-art performance by learning directly from raw particle constituents and their pairwise relations. We investigate whether transformers implicitly capture hierarchical QCD structure from constituent-level inputs, or whether explicit physics representations remain complementary. To test this, we introduce PLuM, a multimodal architecture that projects particle constituents and Lund plane splittings into a shared latent space, processing both jointly with a unified transformer. Cross-attention allows the model to probe whether structured QCD information provides discriminating power beyond what particles alone encode. We observe systematic gains for top-quark and \mathrmH\to\mathrmb\bar\mathrmb tagging, while finding no comparable improvement for \mathrmH\to\mathrmc\bar\mathrmc or \mathrmH\to 4\mathrmq topologies. This selective enhancement suggests that explicit hierarchical information about b-jet formation remains complementary to raw particle representations even in highly expressive architectures, while other topologies are already well-captured at constituent level. For high-impact LHC analyses such as Lorentz-boosted di-Higgs searches in the four \mathrmb quark final state ( \mathrmH\mathrmH(4\mathrmb) ), the gains are substantial: at a 25% di-Higgs efficiency working point, PLuM achieves 25% higher background rejection than the baseline. Our results indicate that physically structured representations of QCD radiation retain discriminating value in the transformer era, motivating further study into how different aspects of jet dynamics are encoded by deep learning algorithms.
[LG-89] Neural Autoregressive Control Variates for the Quantum Monte Carlo Sign Problem
链接: https://arxiv.org/abs/2605.26814
作者: Bei Qiao,Lei Wang
类目: rongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 18 pages, 9 figures
Abstract:We train a pair of autoregressive models to construct zero-mean control variates to mitigate the sign problem in quantum Monte Carlo simulations. The two autoregressive networks are confined to the positive- and negative-sign sectors with strictly disjoint support, and each is exactly normalized over its sector. Their difference is therefore structurally zero-mean, providing an unbiased auxiliary observable whose correlation with the sign estimator controls the variance reduction. We implement the method within the stochastic series expansion framework, which we extend to frustrated lattices by developing an incremental loop-topology update. Sign-ergodic sampling is achieved through a twist channel, which is the unique sign-changing mechanism on non-bipartite lattices. We implement the control variates as autoregressive transformers with an end-of-sequence parity mask that enforces exact sign-sector resolution, while the incremental loop-count change and cumulative frustration parity are incorporated as topological features. On the triangular-lattice Heisenberg antiferromagnet, we benchmark the method in the small- N limit. The control variate reduces the standard error of the average sign by up to an order of magnitude and that of the energy estimator by a factor of three to five, remaining effective even when the average sign drops below 10^-3 . This work lays out the framework and provides a proof-of-principle demonstration that autoregressive control variates can effectively mitigate the sign problem. Scaling to larger systems with physics-informed architectures is the subject of future work.
[LG-90] ransformers Can Learn Posterior Predictive Distributions In-Context
链接: https://arxiv.org/abs/2605.26713
作者: Gyeonghun Kang,Changwoo J. Lee,Xiang Cheng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Prior-data fitted networks (PFNs) have recently emerged as a powerful approach for Bayesian prediction tasks, approximating the posterior predictive distribution (PPD) through in-context learning. Despite their strong empirical performance and ability to go beyond point predictions, theoretical understandings of the algorithmic capability of transformers to learn distributions in context are still lacking. Focusing on Gaussian process regression problems, we show by construction that transformers can implement a gradient descent algorithm targeting the posterior predictive mean and variance, followed by nonlinear mappings that yield binned probabilities of PPD. We study the error bounds of the approximated PPD in terms of attention depth and bin resolution. Based on these results, we further demonstrate the key role of normalization and the choice of attention depth in enabling the extrapolation abilities of transformers beyond the pretraining sample size range. We conduct simulations that corroborate our findings, providing insight into the expressivity of PFNs targeting PPDs and how architectural choices may influence generalization capabilities.
[LG-91] Proper Calibeating
链接: https://arxiv.org/abs/2605.26703
作者: Dean P. Foster,Sergiu Hart
类目: Theoretical Economics (econ.TH); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The classic concept of “calibrated forecasts” and its more recent refinement, “calibeating,” are defined with respect to the standard quadratic scoring rule. We extend these notions to the class of \textitproper scoring rules (for which the best forecast is the true distribution) and define \textitproper-calibration and \textitproper-calibeating by requiring the errors to converge to zero uniformly over all bounded proper scoring rules. We first establish that calibration always implies proper-calibration, whereas calibeating need not imply proper-calibeating. Second, we show how to guarantee proper-calibeating and proper-multicalibeating. Finally, we demonstrate the equivalence between proper-calibration and universal no regret when best replying to forecasts in decision-making under uncertainty.
[LG-92] CART Random Forests as Sequential Allocation over Random Opportunity Sets: A Stochastic-Control Theory of Ensemble Risk
链接: https://arxiv.org/abs/2605.26675
作者: Tianxing Mei,Yingying Fan,Mingming Leng,Jinchi Lv
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 69 pages, 1 figure
Abstract:CART random forests are among the most widely used modern predictive methods, with well-documented empirical success. Yet, at the mechanistic level, the algorithm is often treated as a black box because of its complexity. In this paper, we develop a stochastic-control perspective on feature-subsampled CART random forests, named CART random opportunity-set allocation (CART-ROSA). At each node, the random subset of features is interpreted as a random feasible action set, and the CART split rule as a masked-action allocation policy. This policy induces a controlled stochastic process over informative split-count states, whose terminal law determines both single-tree error and cross-tree interaction terms in the forest mean squared error (MSE). Such representation opens the black box of CART-forests by separating two design levers: the informative-opportunity rate induced by feature subsampling, and the contraction strength from the within-mask split policy. We establish that the CART policy is locally stabilizing: it contracts imbalances in informative split allocations and concentrates terminal tree geometry. At the system level, however, it can be globally suboptimal for the forest objective. Specializing to the linear model, we derive the MSE risk expansion explicitly. Our results show how an operations-research perspective makes tractable a theoretical gap difficult to access from the standard algorithmic description of CART forests.
[LG-93] Data-driven sparse identification of governing PDEs via knockoff filters and multi-criteria trade-offs
链接: https://arxiv.org/abs/2605.26631
作者: Pongpisit Thanasutives,Naichang Ke,Yoshinobu Kawahara
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 42 pages, 5 figures, 10 tables
Abstract:We propose KO-PDE-IDENT, a data-driven framework for identifying parsimonious partial differential equations (PDEs) with false discovery rate (FDR) control. PDE discovery from noisy observations is often hindered by extreme multicollinearity among candidate terms, which causes typical sparse-regression methods to select spurious terms. To address this problem, KO-PDE-IDENT initially mines a support set of potential candidate terms via model-X knockoff filters with finite-sample FDR control, then refines and ranks the surviving PDE alternatives. The framework integrates three components. First, knockoff feature statistics are constructed by coupling \ell_0 -constrained adaptive best-subset selection with SHapley Additive exPlanations (SHAP), yielding an effective and computationally efficient difference statistic. Second, a recursive feature elimination (RFE) procedure removes terms whose marginal contributions are dispensable and assesses statistical necessity through knockoff-perturbed hypothesis testing. Third, the final model selection is formulated as a multi-criteria decision-making (MCDM) problem, where the optimal governing equation is the alternative that best balances a wide range of criteria such as predictive accuracy, model complexity and coefficient uncertainty. We validate KO-PDE-IDENT on five canonical PDEs under severe noise corruption. Empirical results show that our framework can exactly recover the true PDE structure, eliminating false discoveries while retaining all true underlying terms, with low coefficient estimation error.
[LG-94] When Does LeJEPA Learn a World Model?
链接: https://arxiv.org/abs/2605.26379
作者: David Klindt,Yann LeCun,Randall Balestriero
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:A representation that scrambles the true degrees of freedom of the world cannot support reliable planning or compositional generalization. We prove that LeJEPA (alignment plus Gaussian regularization) linearly recovers the world’s latent variables from nonlinear observations, a property known as linear identifiability, in a broad class of worlds where latents evolve under stationary, additive-noise transitions. Our main result is that among all such worlds, the Gaussian is the unique latent distribution for which this guarantee holds. The forward direction rests on a spectral decomposition in which each degree of nonlinearity is strictly penalized by alignment, making the linear map the optimum; the converse rules out every non-Gaussian alternative. We further prove an approximate identifiability result where the guarantee degrades gracefully, and show that linear, orthogonal identifiability enables optimal latent-space planning. We validate the theory with experiments ranging from 2D examples to 1024-dimensional latents, including distributional ablations and pixel-based robotic control. Our theory turns an empirically successful recipe into a mathematical guarantee, providing the foundation for building World Models that provably recover the structure of the world.
[LG-95] Deep Learning-based Algebraic Reynolds Stress Closures for RANS Simulations of Turbulent Flows
链接: https://arxiv.org/abs/2605.26358
作者: Daniel Dehtyriov,Jonathan F. MacArt,Justin Sirignano
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:Turbulence is ubiquitous in engineering and science, yet direct simulation is prohibitively expensive. The Reynolds-averaged Navier-Stokes (RANS) equations provide savings exceeding ten orders of magnitude but introduce unclosed terms (the closure problem). Offline-trained machine-learning (ML) closures suffer distribution shift in predictive simulations, while ML methods that bypass the governing equations struggle to generalise from scarce high-fidelity data. We develop a physics-derived deep learning closure model for RANS, the Deep Algebraic Reynolds Stress Model (DARSM), which can be trained on small datasets and accurately generalise across Reynolds numbers, to unseen geometries, and to different flow regimes. A neural network maps flow invariants to empirical parameters in an implicit algebraic Reynolds stress equation, derived from the Reynolds stress transport equations under the weak-equilibrium assumption, imposing physics-based structure on the ML closure. End-to-end optimisation through the governing PDEs and the coupled implicit closure eliminates distribution shift, but both unrolled and implicit automatic differentiation fail on the stiff coupled solver. We derive adjoint equations that exploit the solver’s implicit-explicit structure for efficient optimisation. On canonical square-duct and periodic-hill benchmarks, DARSM reduces average test velocity error over baseline RANS by 2 - 4\times across Reynolds number, geometries, and flow regimes, with peak case-level reductions of 12\times . The model trained on attached, anisotropy-dominated flows (square duct) accurately generalises without retraining to separated flows (periodic hills), a regime change in the underlying physics. DARSM also outperforms five established ML methods: offline training, tensor-basis neural networks, field-inversion machine learning, DeepONets, and physics-informed neural networks.
[LG-96] Beyond Differences: Doubly Robust Meta-Learners for Ratio-Based Treatment Effects
链接: https://arxiv.org/abs/2605.26288
作者: Michael Fuchs,Dominik Kreiss
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 13+5 pages, 5 figures, 6 tables. Code: this https URL
Abstract:When treatment effects are naturally expressed as ratios – as in medicine, pricing, and marketing – the ratio-based CATE \tau(x) = E[Y|W=1,X=x] / E[Y|W=0,X=x] is the appropriate estimand. Yet existing estimators either impose a log-linear parametric structure or apply generic regression without robustness guarantees for this functional. We introduce the Q-Learner, which decomposes \tau(x) into a product of two odds ratios, reducing ratio-CATE estimation for binary outcomes to two propensity classification tasks. We further derive doubly robust augmentations for both S/T- and Q-style ratio learners and characterize their distinct robustness properties. In benchmarks on seven RCT datasets, the Q-Learner is the most consistently competitive method in low-conversion regimes, where its propensity-only construction sidesteps the imbalanced regression that hurts outcome-based estimators. On four observational datasets, where propensity must be estimated and confounding cannot be ruled out, the DR learners introduced here decisively come out on top, making them practitioners’ natural default for confounded observational data.
[LG-97] Learning Nonlinear Factor Models with Unknown Monotone Links from Incomplete and Noisy Data
链接: https://arxiv.org/abs/2605.26271
作者: Yutong Chao,Resat Gökhan,Jalal Etesami,Ali Habibnia
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:
Abstract:We study a nonlinear factor model in which observed responses depend on low-rank latent factors through an unknown monotone link function. This setting is challenging and largely underexplored due to severe nonconvexity and identifiability issues. The link function is assumed to lie in a reproducing kernel Hilbert space (RKHS), enabling flexible nonparametric modeling while preserving identifiability. We formulate the problem as the joint recovery of the low-rank factors, loadings, and the nonlinear link function from possibly incomplete and noisy observations and propose a projected block coordinate descent (BCD) algorithm with explicit regularization to address scale and rotational ambiguities. Under mild incoherence of factors and standard sampling conditions, we establish convergence guarantees in both noiseless and noisy regimes, along with sublinear regret bounds for the link-function updates. Our results extend classical linear factor models to a broad nonlinear regime and provide a principled framework for learning nonlinear latent structures. We evaluate the proposed approach using controlled synthetic experiments, indicating promising performance.
[LG-98] Minimal surfaces Knots and Neural Networks
链接: https://arxiv.org/abs/2605.26234
作者: Tancredi Schettini Gherardini,Marco Usula
类目: Differential Geometry (math.DG); Machine Learning (cs.LG); Geometric Topology (math.GT)
*备注: 38 pages, 12 figures
Abstract:A recent conjecture by Joel Fine posits a relationship between the coefficients of the HOMFLY polynomial of a knot K in the 3-sphere S^3 , and the signed count of minimal surfaces in hyperbolic 4-space \mathrmH^4 meeting the sphere at infinity at K , with prescribed genus and self-intersection number. In this paper, we develop a novel machine learning framework based on Physics-Informed Neural Networks (PINNs) to solve the minimal surface equation in hyperbolic space. We utilise this framework to test Fine’s Conjecture by constructing near-minimal surfaces bounding various families of knots in S^3 . Furthermore, we develop an algorithmic method to find self-intersections and compute their sign. For every knot analysed, the computationally discovered minimal surfaces and their self-intersection numbers perfectly align with the predictions of Fine’s Conjecture, providing empirical evidence for it.
[LG-99] What Molecular Structure Cannot Tell Us: A Taxonomy of Explainability Gaps in GNN-Based Drug Toxicity Prediction
链接: https://arxiv.org/abs/2605.26183
作者: Juergen Dietrich
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 14 pages
Abstract:Graph Neural Networks (GNNs) have emerged as a structurally natural approach for molecular toxicity prediction, operating directly on atomic connectivity without the information loss inherent to fixed-length fingerprints. However, the fraction of a drug’s known pharmacological profile that is actually encodable in its molecular structure remains systematically underexplored. This study addresses this question through a systematic case study using acetylsalicylic acid (ASA, Aspirin) - one of the most comprehensively characterized drugs in pharmacology - as a model compound. A Message Passing Neural Network (MPNN) is trained on the Tox21 benchmark and GNNExplainer is applied to characterize atom-level attribution. Results indicate that molecular structure explains approximately 45% (5/11) of known ASA adverse effects. A four-category Gap Taxonomy (GAP-1 through GAP-4) is introduced distinguishing between principally non-encodable effects, data gaps arising from Missing Not At Random (MNAR) mechanisms, assay panel mismatches, and representation errors. The MNAR gap is empirically quantified via a systematic ChEMBL query (42 documented assays, 0 retrievable bioactivity entries). An attention pooling experiment localizes the representation error to the MPNN message passing layers rather than the aggregation step. The Gap Taxonomy has direct implications for drug safety signal detection workflows and regulatory frameworks including Good Pharmacovigilance Practice (GVP) guidelines and New Approach Methodologies (NAMs).
[LG-100] Rapid online deep artifact suppression for real-time spiral bSSFP CMR with blipped-CAIPI simultaneous multi-slice imaging at 1.5 T
链接: https://arxiv.org/abs/2605.26127
作者: Julius Åkesson,Iulius Dragonu,Einar Heiberg,Tina Yao,Rebecca Baker,Ruta Virsinskaite,Daniel Knight,Vivek Muthurangu,Jennifer Steeden
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注:
Abstract:Purpose: Real-time (RT) bSSFP MRI enables fast free-breathing cardiovascular imaging but requires 10-16 slices for functional assessment, resulting in prolonged scan times. Simultaneous multi-slice (SMS) imaging can reduce acquisition time but when combined with non-Cartesian trajectories, it relies on iterative reconstructions that preclude online use. This study investigates deep artifact suppression to facilitate rapid, online reconstruction of RT-SMS. Methods: A spiral bSSFP SMS RT sequence with two simultaneously acquired slices was implemented at 1.5 T. Reconstruction used slice separation in k-space, followed by deep artifact suppression in image space using a 3D U-Net. Ten healthy volunteers were imaged. RT-SMS image quality and reconstruction time were compared between deep artifact suppression and compressed sensing (CS) reconstructions. Left (LV) and right (RV) ventricular volumes at end diastole (EDV) and end systole (ESV) and LV mass (LVM) were compared between RT-SMS with deep artifact suppression and reference-standard breath-hold (BH) imaging. Results: The RT-SMS acquisition was ~13x faster than BH imaging (15 s vs 3 min 15 s). RT-SMS reconstruction using deep artifact suppression was ~50x faster than CS (30 s vs 24 min 55 s). Deep artifact suppression consistently outperformed CS in quantitative and qualitative image quality (p0.001). Functional agreement between BH and RT-SMS with deep artifact suppression was good (LVEDV: -7.5 +/- 6.8 ml, LVESV: -0.9 +/- 4.2 ml, RVEDV: -6.4 +/- 8.4 ml, RVESV: 0.2 +/- 10.7 ml, LVM: -10.3 +/- 11.0 g). Conclusion: Online deep artifact suppression reconstruction for RT-SMS bSSFP CMR enables free-breathing short-axis coverage with a substantial reduction in acquisition and reconstruction time while maintaining diagnostic image quality. Subjects: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Image and Video Processing (eess.IV) Cite as: arXiv:2605.26127 [physics.med-ph] (or arXiv:2605.26127v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2605.26127 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jennifer Steeden Dr [view email] [v1] Mon, 18 May 2026 16:08:48 UTC (2,429 KB)
[LG-101] Stochastic global optimization of continuous functions via random walks on Grassmannians
链接: https://arxiv.org/abs/2605.14151
作者: Kartik Gupta,Stephen D. Miller,Pradeep Ravikumar,Ramarathnam Venkatesan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 21 pages
Abstract:We introduce a stochastic global optimization method based on random walks on Grassmannian manifolds. To minimize a continuous objective \ell:\mathbbR^d\rightarrow\mathbbR , the method repeatedly samples random k -dimensional linear subspaces (with k\ll d ), solves the resulting low-dimensional restrictions of these problems to these subspaces using an arbitrary black-box optimizer, and updates the iterate (which monotonically improves upon the previous iterate). Unlike classical optimization analyses that rely on convexity, smoothness, Lipschitz bounds, or Polyak-Lojasiewicz-type conditions, our convergence guarantees depend only on the geometric distribution of restricted minima across the k -dimensional subspaces passing through a given point in \mathbbR^d . We identify a gap parameter – an analogue of a spectral gap for random walks – that controls the rate at which the iterates approach the global minimum value. Finally, we argue that the same analysis yields a blind-spot robustness property: sufficiently narrow, deep dips of the loss function (small-measure regions where \ell spikes downward) have limited influence on the algorithm’s trajectory, since they are unlikely to be encountered by random subspace sampling.
附件下载


