Arxiv今日论文 | 2026-05-20

本篇博文主要内容为 2026-05-20 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明：每日论文数据从Arxiv.org获取，每天早上12:30左右定时自动更新。

提示: 当天未及时更新，有可能是Arxiv当日未有新的论文发布，也有可能是脚本出错。尽可能会在当天修复。

自然语言处理共103篇(Computation and Language (cs.CL))
人工智能共313篇(Artificial Intelligence (cs.AI))
计算机视觉共168篇(Computer Vision and Pattern Recognition (cs.CV))
机器学习共315篇(Machine Learning (cs.LG))
多智能体系统共21篇(Multiagent Systems (cs.MA))
信息检索共31篇(Information Retrieval (cs.IR))
人机交互共22篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] When Skills Dont Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

【速读】：该论文试图解决的问题是：当前生成式 AI（Generative AI）代理中引入“技能包”（Agent Skills，即结构化的程序性知识）虽普遍提升任务成功率，但存在显著的不一致性——部分任务反而因技能引入而表现下降。现有研究尚未厘清技能在何种条件下有效、何种情况下仅为冗余开销。其解决方案的关键在于提出并验证一个新机制：环境反馈带宽（environment-feedback bandwidth）。论文通过重新分析一项基于MCP框架的自主攻防赛（Capture-the-Flag）实验，发现当代理工具层返回严格、结构化且低延迟的观测信号时，环境本身已能提供所需的程序修正信号，从而削弱甚至抵消了技能的边际收益；相反，在高反馈带宽环境下，技能不仅无益，反而可能因干扰策略优化而损害性能（如定时侧信道场景）。这一发现为构建高效复合AI系统提供了可验证的理论依据与设计原则。

链接: https://arxiv.org/abs/2605.20023
作者: Samuel Jacob Chacko,James Hugglestone,Chashi Mahiul Islam,Xiuwen Liu
机构: Florida State University (佛罗里达州立大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted as a poster at ACM CAIS 2026 AgentSkills Workshop

点击查看摘要

Abstract:Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2~percentage points across diverse domains. Yet the same benchmarks show wide variance, with 16 of 84 tasks suffering negative deltas when Skills are introduced. The community has not yet articulated a clean mechanism for \emphwhen Skills help and when they are merely redundant overhead. We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions of increasing richness (55, 1,478, 1,976, and 4,147 lines), and show that these conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation. In offensive cybersecurity, a domain not deeply covered by existing Skills benchmarks, the marginal benefit of Skills collapses. The spread between the no-Skills and full-Skills conditions is only 8.9~pp ( p = 0.71 , \chi^2 ; p = 0.25 , Cochran–Armitage trend test; five of six pairwise Cohen’s h values fall below the 0.2 small-effect threshold). We argue that the missing variable is \emphenvironment-feedback bandwidth. When an agent’s tool layer returns strict, schema-validated, low-latency observations, the environment itself supplies the procedural correction signal that Skills are normally needed to provide. As a result, the marginal benefit of curated Skills diminishes substantially, and, in some cases (e.g., our timing side-channel setting), actively degrades performance. We articulate a falsifiable hypothesis, sketch its design implications for compound AI systems, and will release the reanalysis pipeline to support replication.

[MA-1] Equilibria in Multiplayer Graph Games: An Algorithmic Study

【速读】：该论文试图解决在多玩家博弈框架下，如何识别具有鲁棒性的均衡策略问题，尤其是在存在多个系统或环境代理且目标不同时，传统单人博弈模型难以适用。解决方案的关键在于对五种不同的均衡概念（包括纳什均衡在内的多种变体）进行形式化分析，并针对“约束存在性问题”——即判定给定博弈是否存在满足各玩家收益区间要求的均衡——提供相应的复杂度结果，从而为设计和验证分布式系统中的协议与程序提供理论支撑。

链接: https://arxiv.org/abs/2605.19954
作者: Léonard Brice
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Formal Languages and Automata Theory (cs.FL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:To verify the robustness of a program or protocol, it is common in the computer science community to rely on the theoretical framework of game theory. In particular, if one seeks to enforce a desired property, or specification, despite an unpredictable environment, a useful abstraction is to model the situation as a two-player zero-sum game. The goal is then to find a strategy for the system that guarantees the specification against any strategy of the environment. However, to model more complex situations, such as multiple systems with different objectives or an environment composed of various agents, the richer framework of multiplayer games must be considered. In this setting, a natural question is to identify equilibria, i.e., strategy profiles that are robust in the sense that no player has an incentive to deviate. The most well-known equilibrium concept is the Nash equilibrium, but several alternatives exist. We study five such notions and, for each of them, we provide complexity results for the constrained existence problem, which consists of deciding whether a given game contains an equilibrium that ensures each player a payoff within a specified interval.

[MA-2] LLM Agents Make Collective Belief Dynamics Programmable: Challenges and Research Directions

【速读】：该论文试图解决的问题是：随着大语言模型（LLM）代理在在线讨论中规模化参与，传统基于人类有限理性和协调能力的信念动态模型已不再适用，如何实现对群体信念的可编程控制成为新挑战。解决方案的关键在于识别出四个使检测与防御变得困难的结构性属性——不可区分性（indistinguishability）、持久性（persistence）、情境依赖性（contextuality）和可配置性（configurability），并通过多智能体模拟验证了协同AI代理可在数轮交互内引发并稳定可观测的信念偏移，从而为未来研究提供了理论框架与实践方向。

链接: https://arxiv.org/abs/2605.19915
作者: Xin He,Junxi Shen,Yuchen Mou,David M. Bossens,Caishun Chen,Ivor W. Tsang,Yew Soon Ong
机构: Centre for Frontier AI Research, Agency for Science, Technology and Research (ASTAR), Singapore; Institute of High Performance Computing, Agency for Science, Technology and Research (ASTAR), Singapore; College of Computing and Data Science, Nanyang Technological University, Singapore; College of Design and Engineering, National University of Singapore, Singapore
类目: Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Classical models of opinion dynamics assume human participants with bounded rationality and limited coordination. The rise of LLM-based agents introduces a qualitative shift: agents can now participate in online discussions at scale, maintain consistent persuasion strategies, and coordinate systematically. This paper argues that LLM agents make collective belief dynamics programmable, enabling deliberate steering of population-level beliefs. We term this emerging problem programmable collective belief control. Through controlled multi-agent simulations, we provide proof-of-concept evidence that coordinated AI agents can induce measurable belief shifts that stabilize within a few interaction rounds. We identify four structural properties (indistinguishability, persistence, contextuality, and configurability) that make detection and defense fundamentally difficult. Based on these findings, we outline a research agenda spanning theoretical foundations for adversarial belief dynamics, operational methods for system-level detection and intervention, and simulation infrastructure for scalable experimentation. Our goal is not to present a complete solution, but to articulate why this problem demands urgent attention and to provide a conceptual foundation for future work.

[MA-3] DAG-Based QoS-Aware Dynamic Task Placement for Networked Multi-Stage Control Pipelines

【速读】：该论文旨在解决工业网络机器人中感知-感知-规划-控制（sensing-perception-planning-control）流水线在本地执行与静态边缘卸载之间存在的性能瓶颈问题。当前物理人工智能（Physical AI, PAI）依赖闭环视觉伺服（visual-servoing）流程，其感知和规划阶段因嵌入复杂模型而计算密集，导致本地硬件资源饱和或静态边缘卸载引发网络抖动，进而影响控制稳定性。解决方案的关键在于提出一种基于有向无环图（DAG）的质量服务（QoS）感知动态任务放置（Dynamic Task Placement, DTP）框架，该框架将整个流水线建模为带有任务级和节点级属性（如计算成本、通信延迟、可行放置集合）的DAG，并设计一个窗口化的代价函数，综合考虑尾部端到端延迟、截止时间违反率、硬件利用率及切换惩罚（Hamming-distance switching penalty），同时引入具有迟滞特性和最小驻留时间约束的DTP算法以抑制放置振荡（placement chatter）。该方案实现了控制-通信-计算（3C）协同设计，在保证实时性的同时提升系统鲁棒性与资源利用率。

链接: https://arxiv.org/abs/2605.19887
作者: Thien Tran,Jonathan Kua,Thuong Hoang,Minh Tran,Yuemin Ding,Jiong Jin
机构: 1 University of Technology Sydney (悉尼科技大学); 2 National Institute of Information and Communications Technology (日本信息与通信技术国家研究所); 3 University of South Australia (南澳大学); 4 Queensland University of Technology (昆士兰科技大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Robotics (cs.RO); Systems and Control (eess.SY)
备注: 4 pages, 1 figure, 1 algorithm, accepted as a Work-in-Progress (WiP) paper, on the 24th IEEE International Conference on Industrial Informatics (INDIN), 26-29 July, 2026, Melbourne, Australia

点击查看摘要

Abstract:Current Physical AI (PAI) relies heavily on closed-loop visual-servoing pipelines, whose perception and planning stages may become computationally intensive onboard due to complex models embedded on robots. In practice, offloading the perception task to on-site edges statically is inappropriate for latency-sensitive, precise industrial settings over a standardized industrial network. This emphasizes the importance of Control-Communication-Computing (3C) co-design in industrial automation: monolithic local execution saturates AI-accelerated machine and robot hardware, while static edge offloading exposes the control loop to network jitter. Existing adaptive task placement (ATP) controllers can partially address the gap by relocating a single pipeline stage on binary threshold rules, without a multi-stage model and an explicit cost on placement switching. In this Work-in-Progress (WiP) paper, we propose a directed acyclic graph (DAG) based quality-of-service (QoS)-aware dynamic task placement (DTP) framework for sensing-perception-planning-control pipelines in networked robotics. This pipeline is formalized as a DAG with task-level and node-level attributes for compute cost, communication delay, and feasible placement sets; over a small interpretable candidate set (fully local, static offload, hybrid), a window-based cost function combines tail end-to-end latency, deadline violation rate, hardware utilization, and a Hamming-distance switching penalty, and a DTP algorithm with hysteresis and a minimum dwell-time bounds placement chatter. Our WiP paper presents the theoretical framework, a structured qualitative analysis, and a two-phase simulation plus hardware-in-the-loop validation roadmap.

[MA-4] Operationalising Artificial Intelligence Bills of Materials (AIBOMs) for Verifiable AI Provenance and Lifecycle Assurance

【速读】：该论文试图解决人工智能（AI）系统在复杂多层软件供应链中面临的可复现性、透明度和安全保证难题。其解决方案的关键在于提出了一种扩展CycloneDX标准的人工智能物料清单（AIBOM）框架，通过结构化模式工程、密码学验证和代理驱动的自动化机制，实现可验证的软件来源追踪；并开发了一个自主AI流水线，持续执行环境检查、漏洞增强与可复现性审计，从而保障AI生命周期的可验证性和安全性。实证结果表明，该方法在容器化分析工作流中实现了98.7%的可复现性保真度、96.2%的漏洞匹配精度，并减少63%的手动监督，验证了自动化来源保障与可复现性验证的可行性。

链接: https://arxiv.org/abs/2605.19755
作者: Petar Radanliev,Omar Santos,Carsten Maple,Kay Atefi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) systems are increasingly dependent on complex, multi-layered software supply chains that introduce challenges for reproducibility, transparency, and security assurance. This study presents an Artificial Intelligence Bill of Materials (AIBOM) schema extending the CycloneDX standard to capture AI-specific provenance, model lineage, and disclosure metadata. The framework provides a formalised approach to verifiable software provenance through structured schema engineering, cryptographic validation, and agent-driven automation. An autonomous AI pipeline is developed to perform continuous environment inspection, vulnerability enrichment, and reproducibility auditing using machine-verifiable provenance chains. Empirical evaluation demonstrates 98.7% reproducibility fidelity, 96.2% vulnerability match precision, and a 63% reduction in manual oversight across containerised analytic workflows. These results confirm the feasibility of automated provenance assurance and reproducible AI lifecycle validation. The AIBOM framework advances the scientific foundations of software supply chain transparency and AI reproducibility engineering, offering a generalisable methodology for securing AI systems, strengthening provenance integrity, and supporting compliance with international information security standards.

[MA-5] Memory-Augmented Reinforcement Learning Agent for CAD Generation

【速读】：该论文旨在解决基于大语言模型（LLM）的计算机辅助设计（CAD）模型自动生成方法在处理复杂CAD模型时存在的局限性，特别是长操作序列、多样操作类型和强几何约束下推理链断裂及缺乏有效纠错机制的问题。解决方案的关键在于提出一种增强记忆的强化学习框架，该框架将底层几何内核封装为结构化工具链，并构建了设计意图理解、全局规划、执行与多维验证的闭环机制；同时设计了由案例库和技能库组成的双轨记忆模块，并提出动态效用检索算法，通过引入强化学习进行检索与策略优化，使代理能够避免语义相似但几何不可行的检索陷阱，实现无需额外大规模标注数据的在线自我修正与持续进化。实验表明，该方法显著提升了复杂CAD模型生成任务的成功率与几何一致性。

链接: https://arxiv.org/abs/2605.19748
作者: Yin Xiaolong,Liu Yu,Shen Jiahang,Lu Xingyu,Ni Jingzhe,Fan Fengxiao,Sang Fan
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 26 pages; multilingual submission: English version first, followed by Chinese version

点击查看摘要

Abstract:Automatic generation of computer-aided design (CAD) models is a core technology for enabling intelligence in advanced manufacturing. Existing generation methods based on large language models (LLMs) often fall short when handling complex CAD models characterized by long operation sequences, diverse operation types, and strong geometric constraints, primarily because reasoning chains break and effective error-correction mechanisms are lacking. To address this problem, this paper proposes a memory-augmented reinforcement learning framework for CAD generation agents. The framework encapsulates the underlying geometric kernel into a structured toolchain callable by the agent and builds a closed-loop mechanism of design intent understanding, global planning, execution, and multi-dimensional verification. It also designs a dual-track memory module consisting of a case library and a skill library, and proposes a dynamic utility retrieval algorithm. By introducing reinforcement learning into retrieval and policy optimization, the agent can effectively avoid retrieval traps in which examples are semantically similar but geometrically infeasible, enabling online self-correction and continual evolution without additional large-scale annotated data. Experiments show that the proposed method significantly improves both the success rate and geometric consistency on complex CAD model generation tasks.

[MA-6] EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM -Driven Engineering Design

【速读】：该论文旨在解决当前大型语言模型（Large Language Model, LLM）代理在工程设计任务中评估不足的问题，尤其是针对融合仿真、检索与制造准备的多智能体系统（Multi-Agent System, MAS）缺乏系统性评测框架。其解决方案的关键在于构建一个包含三个维度的基准测试套件：(1) 工作流基准，涵盖七种提示风格以覆盖不同认知需求（如直接工具调用、语义消歧、条件分支和工作记忆任务）；(2) 增强生成检索（Retrieval-Augmented Generation, RAG）基准，通过门控评分机制分离检索对参数选择的贡献；(3) 高性能计算（High Performance Computing, HPC）基准，用于评估在SLURM集群上端到端机器学习训练编排的能力。同时提出EngiAI作为MAS参考实现，基于LangGraph架构协调七个专业化代理，统一拓扑优化、文档检索、HPC作业调度与3D打印控制。实验表明，专有模型在Beams2D任务上平均完成率达96–97%，开源4B参数模型为55–78%，且条件分支任务难度最高（Photonics2D任务中完成率仅20–53%），验证了评估设计的有效性。

链接: https://arxiv.org/abs/2605.19743
作者: Gioele Molinari,Florian Felten,Soheyl Massoudi,Mark Fuge
机构: ETH Zurich (苏黎世联邦理工学院); Autom8.build (Autom8.build)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 26 pages, 10 figures, to be published at IDETC 2026

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequately address multi-agent systems that combine simulation, retrieval, and manufacturing preparation. We introduce a benchmark suite with three evaluation dimensions: (1) a workflow benchmark with seven prompt styles targeting distinct cognitive demands-including direct tool use, semantic disambiguation, conditional branching, and working-memory tasks; (2) a Retrieval-Augmented Generation (RAG) benchmark with gated scoring isolating retrieval contributions to parameter selection; and (3) an High Performance Computing (HPC) benchmark evaluating end-to-end ML training orchestration on a SLURM cluster. Alongside the benchmark we present EngiAI, a Multi-Agent System (MAS) reference implementation built on LangGraph that operationalizes the benchmark by coordinating seven specialized agents through a supervisor architecture, unifying topology optimization, document retrieval, HPC job orchestration, and 3D printer control. Across four LLM backends and two EngiBench problems, proprietary models achieve 96-97% average task completion on Beams2D, while open-source 4B-parameter models reach 55-78%, with clear generational improvement. Conditional branching proves most challenging, with task completion dropping to 20-53% for the conditional style on Photonics2D. RAG gating confirms near-perfect retrieval-augmented scores ( \approx 1.0 ) versus near-zero without retrieval, validating the evaluation design. On HPC orchestration, one model completes all pipeline steps in 100% of runs while another drops to 50%, revealing that multi-step instruction following degrades over long-running workflows.

[MA-7] PAVE: A Cognitive Architecture for Legitimate Violation in Generative Agent Societies

【速读】：该论文试图解决的问题是：基于大语言模型（Large Language Models, LLMs）的生成式智能体（Generative Agents）在需要违反规则的情境中（如火灾疏散或受权威监督的紧急情况）如何进行合理推理，当前对此类情境的建模尚不充分。解决方案的关键在于提出一个名为PAVE（Perception, Assessment, Verdict, Emulation）的四模块认知架构，其核心创新在于通过结构化感知、多维评估、合法性门控决策和受限范围执行四个阶段，实现对规则违背行为的可控、可解释与合法化处理。其中，“合法性判断”作为硬性阈值（hard legitimacy gate）确保仅当触发条件满足必要性、比例性和无替代方案时才允许违规，同时保留对权威指令的优先服从，并限制违规行为的范围和持续时间，从而在多个关键属性（合法违规、权威服从、范围约束、恢复能力）上显著优于传统方法。

链接: https://arxiv.org/abs/2605.19351
作者: Ahmad Yehia,Abduallah Mohamed,Kun Qian,Tianyi Wang,Jiseop Byeon,Omar Hassanin,Christian Claudel
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校); Meta Reality Labs (Meta现实实验室); University of Calgary (卡尔加里大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. 23 pages, 4 figures. Code and environment will be released upon publication

点击查看摘要

Abstract:Generative agents based on large language models reproduce believable human behavior in cooperative settings, but how they should reason in situations where rule-breaking may be required, such as fire evacuation or authority-supervised emergency, remains poorly characterized. We propose PAVE (Perception, Assessment, Verdict, Emulation), a novel four-module cognitive architecture that addresses this gap end to end: (i) Perception extracts a structured context with explicit authority distance, peer behaviors, and severity-tagged situational cues; (ii) Assessment scores the context along five scalars including an explicit legitimacy judgment that checks necessity, proportionality, and absence of alternatives; (iii) Verdict decides to comply or violate under a hard legitimacy gate, with a per-agent threshold elicited from the persona; (iv) Emulation enacts the verdict and scopes the violation to the rule the trigger justifies. We instantiate PAVE in Voville, a tile-based traffic environment forked from Smallville, and evaluate across three scenarios, four LLM backbones, and a focused ablation. PAVE agents satisfy four properties simultaneously: legitimate violation (only when a trigger justifies it), authority deference (officer instructions override even high legitimacy), bounded scope (violations confined to the targeted rule), and recovery (baseline restored once the trigger ends). PAVE agents make more structured and interpretable decisions than vanilla across all four properties, and human evaluators rate them as more plausible. Ablating the legitimacy gate reproduces vanilla-like failures. We release Voville, the PAVE prompts and code, and the evaluation pipeline.

[MA-8] STAR-PólyaMath: Multi-Agent Reasoning under Persistent Meta-Strategic Supervision

【速读】：该论文旨在解决当前前沿生成式 AI (Generative AI) 模型与多智能体系统在处理需要长周期、复杂推理的数学问题时所面临的三大核心可靠性问题：幻觉累积（hallucination accumulation）、记忆碎片化（memory fragmentation）以及推理与工具使用之间的失衡（imbalanced reasoning-tool trade-offs）。其解决方案的关键在于提出 STAR-PólyaMath，一个基于元级监督（meta-level supervision）和结构化“推理者-验证者”交互机制的多智能体框架。该框架采用受控状态机设计，嵌套挑战-步骤-重规划循环，并由一个无推理能力的 Python 编排器分离控制逻辑与推断过程，从而通过回溯与重规划限制错误传播。最核心的创新是引入了一个持久化的元策略代理（Meta-Strategist），它具备跨尝试的记忆能力，能动态提供高层战略指导或强制指令，使系统能够跳出无效循环而非陷入停滞或过度依赖工具，从而显著提升长期推理的稳定性和准确性。

链接: https://arxiv.org/abs/2605.19338
作者: Jiaao Wu,Xian Zhang,Hanzhang Liu,Sophia Zhang,Fan Yang,Yinpeng Dong
机构: Tsinghua University (清华大学); Microsoft Research (微软研究院); New York University (纽约大学); MIT (麻省理工学院)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages, 4 figures. Code: this https URL

点击查看摘要

Abstract:Frontier AI models and multi-agent systems have led to significant improvements in mathematical reasoning. However, for problems requiring extended, long-horizon reasoning, existing systems continue to suffer from fundamental reliability issues: hallucination accumulation, memory fragmentation, and imbalanced reasoning-tool trade-offs. In this paper, we introduce STAR-PólyaMath, a multi-agent framework that systematically addresses these challenges through meta-level supervision and structured Reasoner-Verifier interaction. STAR-PólyaMath is structured as an orchestrated state machine with nested challenge-step-replan loops, governed by a reasoning-free Python orchestrator that separates control from inference and bounds error propagation through trace-back and re-planning. Our key innovation is a persistent Meta-Strategist that maintains cross-attempt memory and exercises meta-level control by issuing high-level strategic guidance or mandatory directives, so the system can escape unproductive loops rather than stagnate or over-rely on tools. STAR-PólyaMath achieves state-of-the-art results on all eight top-tier competition benchmarks: AIME 2025-2026, MathArena Apex Shortlist, MathArena Apex 2025, Putnam 2025, IMO 2025, HMMT February 2026, and USAMO 2026. It obtains perfect scores on AIMEs, Putnam, and HMMT, and shows its largest margin on Apex 2025, scoring 93.75% compared with 80.21% by the strongest baseline GPT-5.5. Ablation studies show that the gains arise from the framework’s orchestration rather than from model-level diversity since removing key components or substituting in mixed backbones consistently weakens performance. Code is available at this https URL.

[MA-9] Swimming with Whales: Analysis of Power Imbalances in Stake-Weighted Governance

【速读】：该论文试图解决的是权益证明（Proof-of-Stake, PoS）区块链中基于权益加权投票机制所引发的权力失衡问题。在该机制下，少数持有大量权益的用户可能掌握不成比例的决策控制权，即使他们并未拥有全部权益。解决方案的关键在于通过计算社会选择理论中的Penrose-Banzhaf权力指数来量化这种权力不平等，并从理论和实证两个层面进行分析：理论上证明，在特定条件下，权力与权益份额之间的理想对齐虽难以实现，但可在期望意义上近似达成；实证上则利用真实链上治理系统（Project Catalyst）的数据，揭示当前权益加权治理系统中权力失衡的具体表现和程度。

链接: https://arxiv.org/abs/2605.19264
作者: Yuzhe Zhang,Manvir Schneider,Qin Wang,Davide Grossi
机构: Independent researcher; Cardano Foundation (卡达诺基金会); CSIRO (澳大利亚联邦科学与工业研究组织); University of Groningen and University of Amsterdam (格罗宁根大学和阿姆斯特丹大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Voting methods weighted by stakes are the fundamental governance paradigm in Proof-of-Stake (PoS) blockchains. Such a paradigm is known to be prone to power distortions: a few users possessing large stakes may completely control decision making, even without owning the totality of the stakes. We study this phenomenon through the lens of computational social choice, focusing on the extent of power imbalances in stake-weighted voting when power is quantified using the Penrose-Banzhaf power index. Our work presents both analytical and empirical contributions. Analytically, we demonstrate that while a perfect alignment between power and relative stake ownership is generally unattainable, it can be approximated in expectation under specific conditions. Empirically, using data from a real-world on-chain governance system (Project Catalyst), we provide a more fine-grained understanding of the power imbalances that are likely to occur in current stake-weighted governance systems.

[MA-10] AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在GUI代理（GUI-agent）任务中因高分辨率GUI截图引入的非均匀空间信息密度问题：即图像中存在大量低信息量的视觉同质区域，而关键文本和图标则需保持高视觉保真度。现有方法要么依赖额外训练，要么采用基于注意力机制的令牌压缩策略，忽视了GUI截图固有的结构化布局与空间冗余特性。其解决方案的关键在于提出AQuaUI——一种无需训练的推理阶段令牌削减方法，通过为每张截图构建自适应四叉树（adaptive quadtree），并在每个叶子节点保留一个代表性合并令牌，从而实现对空间冗余的有效利用；同时，为提升多步交互中的时序一致性，进一步设计条件四叉树算法，借助连续截图间的状态连续性对当前四叉树进行优化，以保留静态或轻微移动状态下的细粒度区域。实验表明，AQuaUI在不重新训练的前提下显著提升了准确率-效率权衡，在GUI-Owl-1.5-32B-Instruct上实现了最高达13.22%的加速和29.52%的视觉令牌减少，同时保持99.06%的完整令牌性能，验证了GUI截图空间冗余可在推理阶段被有效挖掘。

链接: https://arxiv.org/abs/2605.19260
作者: Yuankai Li,Tinghui Zhu,Ha Min Son,Zhe Zhao,Xin Liu,Muhao Chen
机构: UC Davis (加州大学戴维斯分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iteration step. However, these screenshots exhibit highly non-uniform spatial information density: large regions may carry little information and are visually homogeneous, while key text and icons may require high visual fidelity. Existing approaches to this problem either require additional training or rely on attention-based token compression, ignoring the structured layout and spatial redundancy of GUI screenshots. To fill the gap, this paper proposes AquaUI, a training-free inference-time token reduction method for GUI agent models that utilizes the non-uniform information density in screenshots. AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of the quadtree. AQuaUI preserves the spatial positions of retained tokens throughout the pipeline to ensure that all position-encoding stages remain consistent. To further improve temporal consistency across multi-step GUI interactions, we propose a conditional quadtree algorithm that leverages the continuity between consecutive screenshots within a single request. Specifically, it refines the current quadtree using previous quadtrees as references, helping preserve fine-grained regions across static or mildly shifted GUI states. We implement AQuaUI on state-of-the-art GUI agent models and conduct experiments on standard grounding and navigational benchmarks. AQuaUI consistently shows improved accuracy-efficiency trade-offs over prior baselines. Notably, on GUI-Owl-1.5-32B-Instruct, AQuaUI achieves up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance, suggesting that the spatial redundancy of GUI screenshots can be exploited at inference without retraining.

[MA-11] CASPIAN: Online Detection and Attribution of Cascade Attacks in LLM Multi-Agent Systems via Cross-Channel Causal Monitoring

【速读】：该论文试图解决大语言模型多智能体系统（LLM-MAS）中级联攻击（cascade attacks）的检测难题，即恶意影响通过智能体间的复杂交互传播并引发系统级故障的问题。现有防御方法主要局限于局部文本层面，无法捕捉跨通道、时间上协同演化的级联传播动态。解决方案的关键在于提出CASPIAN框架，首次实现对LLM-MAS中级联行为的统一跨通道因果分析：通过在线监控智能体间动态影响传播，利用晚交互条件转移熵（LI-CTE）高效估计统一的动态因果影响矩阵，从而从涌现的系统级结构而非孤立异常中识别级联起点；进一步实现在线因果归因，精准定位驱动级联的源头、桥梁和放大代理及其主传播路径。实验表明，CASPIAN在多种多智能体架构与基准测试中显著优于语义护栏、LLM判别器和图基异常检测方法，在检测准确率和早期识别能力上表现优异，且延迟开销低于1%，验证了统一跨通道因果建模对于可靠检测与理解LLM-MAS级联失效的重要性。

链接: https://arxiv.org/abs/2605.19240
作者: Kavana Venkatesh,Jafar Isbarov,Saad Amin,Murat Kantarcioglu,Jiaming Cui
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: this https URL

点击查看摘要

Abstract:Cascade attacks in LLM multi-agent systems (MAS) arise when adversarial influence propagates across agents and leads to escalated system-level failures through complex agent interactions. Detecting such cascades is challenging, as their signals are distributed, tightly coupled across interaction channels, and often appear plausibly benign locally but may unfold quickly either within a single turn or gradually across multiple turns. Existing defenses, being largely local and text-centric, fail to capture such cross-channel, temporally coordinated dynamics of cascade propagation. Therefore, we propose CASPIAN, the first framework that provides a unified, cross-channel causal analysis of cascade behavior in LLM-MAS through online monitoring of dynamic influence propagation across agents. CASPIAN models multi-agent interactions using a unified, dynamic causal influence matrix across channels, estimated efficiently via a late-interaction conditional transfer entropy (LI-CTE) formulation, thereby enabling the detection of cascade onset from emergent system-level structure rather than isolated anomalies. It further performs online causal attribution, identifying the origin, bridge, and amplifier agents driving the cascade and reconstructing its principal propagation pathways, capabilities not supported by existing methods. Across diverse multi-agent frameworks and benchmarks, CASPIAN consistently outperforms semantic guardrails, LLM-based judges, and graph-based anomaly detectors in both detection accuracy and early cascade identification while operating with sub-1% relative overhead latency. These results demonstrate that unified cross-channel causal modeling is essential for reliably detecting and understanding cascade failures in LLM multi-agent systems.

[MA-12] Graph Neural Planning and Predictive Control for Multi-Robot Communication-Constrained Unlabeled Motion Planning ICRA

【速读】：该论文旨在解决多机器人未标记运动规划问题，即在协同任务中同时完成机器人到目标点的分配与安全轨迹生成。现有基于图神经网络（Graph Neural Network, GNN）的方法虽具备可扩展的分布式特性，但通常依赖简化的动力学模型和仿真环境，忽视了实际部署中的动态可行性与通信约束等关键挑战。解决方案的关键在于提出一种分层框架，结合图注意力规划器（Graph ATtention Planner, GATP）与去中心化的非线性模型预测控制器（Nonlinear Model Predictive Controller, NMPC）：GATP通过多机器人协作生成中间子目标，NMPC则在非线性动力学和执行器约束下保障轨迹安全性。实验表明，该方法在仿真与真实四旋翼无人机场景中均表现出对更大规模团队的泛化能力、高达200毫秒通信延迟的鲁棒性，以及基于机载分布式推理的实际可行性。

链接: https://arxiv.org/abs/2605.19209
作者: Manohari Goarin,Yang Zhou,Giuseppe Loianno
机构: New York University (纽约大学); University of California Berkeley (加州大学伯克利分校)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: 8 pages, 6 figures, Accepted at the IEEE International Conference on Robotics and Automation (ICRA) 2026

点击查看摘要

Abstract:The multi-robot unlabeled motion planning problem of concurrently assigning robots to goals and generating safe trajectories is central in many collaborative tasks. Recent Graph Neural Network methods offer scalable decentralized solutions but rely on simplified dynamics and simulation environments, overlooking key challenges of real-world deployment such as dynamic feasibility and communication constraints. To address these gaps, we propose a hierarchical framework that combines a Graph ATtention Planner (GATP) with a decentralized Nonlinear Model Predictive Controller (NMPC). GATP provides intermediate subgoals through multi-robot cooperation, and the NMPC enforces safety under nonlinear dynamics and actuation constraints. We evaluate our framework in both simulation and real-world quadrotor experiments. Thanks to attention mechanisms and minimal communication requirements, we demonstrate improved generalization to larger teams, robustness to communication delays up to 200 ms and practical feasibility with decentralized on-board inference.

[MA-13] How Far Are We From True Auto-Research?

【速读】：该论文试图解决的问题是：当前自动研究系统虽能生成完整的学术论文，但其产出质量尚未得到系统性评估，尤其缺乏对生成论文实验严谨性和可复现性的深入分析。解决方案的关键在于构建一个名为ResearchArena的最小化框架，使现成的智能体（如Claude Code、Codex和Kimi Code）在仅需轻量级引导的情况下独立完成从选题、实验到写作与自我优化的完整科研闭环。通过三重评估机制——仅基于文稿的评审（SAR）、结合实验资源的同行评审（PR）以及人工元评审——发现仅靠文稿评审时表现乐观，但加入实验材料审查后，各智能体生成论文的实验严谨性显著下降，暴露出三大失败模式（伪造结果、实验功率不足、计划与执行不一致），且这些缺陷高度依赖于不同智能体的研究风格，最终所有117篇论文均未达到顶流会议的录用标准，揭示出当前自动生成研究仍存在显著差距。

链接: https://arxiv.org/abs/2605.19156
作者: Zhengxin Zhang,Ning Wang,Sainyam Galhotra,Claire Cardie
机构: Cornell University (康奈尔大学)
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Recent auto-research systems can produce complete papers, but feasibility is not the same as quality, and the field still lacks a systematic study of how good agent-generated papers actually are. We introduce ResearchArena, a minimal scaffold that lets off-the-shelf agents (Claude Code using Opus 4.6, Codex using GPT-5.4, and Kimi Code using K2.5) carry out the full research loop themselves (ideation, experimentation, paper writing, self-refinement) under only lightweight guidance. Across 13 computer science seeds and 3 trials per agent-domain pair, ResearchArena yields 117 agent-generated papers, each evaluated under three complementary lenses: a manuscript-only reviewer (SAR), an artifact-aware peer review (PR) in which agents inspect the workspace alongside the manuscript, and an human conducted meta-review. Under SAR alone the picture is optimistic: Claude Code obtains the highest score, outperforms Analemma’s FARS, and matches the weighted-average human ICLR 2025 submission, suggesting that minimally scaffolded agents can produce papers that look competitive on manuscript-only review. Manual inspection, however, reveals this picture is overstated: SAR scores are poorly aligned with its actual acceptance decisions and reward plausible framing without verifying experimental substance. Under artifact-aware PR scores drop sharply, and manual auditing identifies experimental rigor as the major bottleneck, decomposing into three failure modes (fabricated results, underpowered experiments, and plan/execution mismatch) that are highly agent-dependent: Codex 5%/8% paper-vs-artifact mismatch / fabricated references versus Kimi Code 77%/72%, a \sim 15 \times spread that tracks distinct research personas the agents develop. None of the 117 agent-generated papers reaches the acceptance bar of a top-tier venue. This suggests that we are still gapped from the true auto-research.

[MA-14] DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agent ic Workflows

【速读】：该论文试图解决长时程智能体工作流中涌现的委托（delegation）机制评估问题，即如何在多模型协作场景下系统性地衡量代理决策的质量、效率与策略合理性。其解决方案的关键在于提出一个名为DecisionBench的基准底座（benchmark substrate），该底座标准化了任务套件（GAIA、tau-bench、BFCL多轮）、同行模型池（11个模型，7个厂商）、委托接口（call_model + 可选read_profile通道）、确定性技能标注层及多维指标体系（涵盖质量、成本、延迟、委托率、路由保真度-at-k、厂商自偏好和反事实委托上限）。这一设计使不同委托策略（如学习型路由器、增强记忆、自适应配置构建等）均可在同一框架下公平比较，并通过五条件参考扫描揭示出：单纯以任务完成质量为评价标准会遗漏委托信号；路由保真度差异显著但质量相近，说明信息传递方式（即时工具调用 vs. 预加载描述）比内容更重要；且实际性能距离理想委托上限仍有15–31个百分点差距，表明未来编排方法存在巨大优化空间。

链接: https://arxiv.org/abs/2605.19099
作者: Yuxuan Gao,Megan Wang,Yi Ling Yu,Zijian Carl Ma,Ao Qu
机构: OpenMesh AI; University of Pennsylvania; Columbia University; Stanford University; MIT
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 28 pages, 9 figures, 11 tables. Code and data: this https URL

点击查看摘要

Abstract:We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| = 0.010, p = 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.

[MA-15] RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning CVPR2026

【速读】：该论文试图解决的问题是：监督式开环训练在复杂驾驶场景中无法有效捕捉多智能体之间的动态交互，导致交通仿真模型的现实性不足。其解决方案的关键在于提出了一种基于强化学习的微调框架 RLFTSim，该框架通过设计一个兼顾保真度（fidelity）与可控性（controllability）的奖励函数，使仿真轨迹对齐真实世界数据分布，并实现目标条件下的可控性蒸馏（goal-conditioned controllability distillation）。相比传统启发式搜索方法，RLFTSim 采用低方差且密集的奖励信号，显著减少了所需样本数量，同时从结构上直接解决了仿真真实性对齐问题。

链接: https://arxiv.org/abs/2605.19033
作者: Ehsan Ahmadi,Hunter Schofield,Behzad Khamidehi,Fazel Arasteh,Jinjun Shan,Lili Mou,Dongfeng Bai,Kasra Rezaee
机构: University of Alberta (阿尔伯塔大学); Huawei Technologies Canada (华为加拿大技术公司); York University (约克大学); Canada CIFAR AI Chair, Amii (加拿大CIFAR人工智能主席，阿米研究院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: CVPR 2026 Highlight; Project page at this https URL

点击查看摘要

Abstract:Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at this https URL.

[MA-16] Nash Welfare in Additively Separable Hedonic Games

【速读】：该论文旨在解决在可加分离型享乐博弈（Additively Separable Hedonic Games, ASHGs）中最大化纳什福利（Nash welfare）的问题。传统研究主要关注效用总和（utilitarian welfare），而纳什福利作为一种兼顾公平性与效率、具有尺度不变性的经济度量指标，此前在ASHGs中被完全忽视。论文的关键贡献在于：首先，指出高纳什福利的划分具有理想性质，例如在对称博弈中能保证契约纳什稳定性（contractual Nash stability），即使是对纳什福利的近似解也成立；其次，提出基于打包策略的近似算法，针对对称敌意博弈（AEGs）和朋友欣赏博弈（appreciation of friends games）分别实现 $n-1$ 和 $2n$ 的近似比；最后，通过严格不可逼近性结果表明，在一般ASHGs中，纳什福利的近似因子无法优于 $1.0000759$ 。此外，论文还界定了在联盟规模或数量受限时问题的复杂性边界：当上限为2时多项式可解，超过3则变为NP-hard或无界不可逼近。

链接: https://arxiv.org/abs/2605.19030
作者: Marta Pagano,Alexander Schlenga
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
备注:

点击查看摘要

Abstract:Additively separable hedonic games (ASHGs) are a prominent model of coalition formation where agents’ preferences are derived from their individual valuations of peers. While social welfare maximization in ASHGs has traditionally focused mostly on utilitarian welfare, Nash welfare – a well-established metric in economics which balances fairness with efficiency and offers scale invariance – has been entirely overlooked. In this paper, we initiate the study of Nash welfare in ASHGs. We point out desirable properties fulfilled by partitions with high Nash welfare. This includes guaranteed contractual Nash stability in symmetric games, even for any approximation of Nash welfare. This is particularly appealing since, as for other welfare notions, Nash welfare turns out to be NP-hard to maximize, even for the ASHG subclass of symmetric aversion to enemies games (AEGs). A main focus of our study is on approximation algorithms for the Nash welfare objective. We present packing-based algorithms with approximation ratios for well-established subclasses of ASHGs: n-1 for AEGs and 2n for appreciation of friends games. This is complemented by a strict inapproximability result showing it is NP-hard to approximate Nash welfare within a factor of 1.0000759 in general ASHGs. Further, we investigate the restricted settings with an upper bound on the coalition size or number of coalitions, and draw the boundary between the cases admitting efficient algorithms and those yielding NP-hardness: bounding the allowed size or number of coalitions by 2 admits polynomial-time solvability, whereas bounds of 3 or more yield NP-hardness or unbounded inapproximability.

[MA-17] he fitness landscape of social norms in social dilemmas

【速读】：该论文试图解决的问题是如何在多智能体系统中通过社会规范（social norms）实现协调，以应对社会困境（social dilemmas），特别是在环境中的随机信号（stochastic signals）具有足够相关性时，如何设计出能被理性代理采纳并最终占据主导地位的规范。其解决方案的关键在于：将规范建模为一种依赖于环境信号的策略，并基于演化博弈论（evolutionary game theory）框架，特别是通过复制动态（replicator dynamics）分析规范的演化路径，从而确保这些规范能够引导代理达成一种相关均衡（correlated equilibrium），而非传统的纳什均衡（Nash equilibrium）。作者进一步将理论推广至马尔可夫博弈（Markov game）这一更通用的强化学习设置中，清晰地阐述了规范如何映射到信号与奖励空间，并提供了对复制动态的通用解法与分析，从而揭示了规范演化的内在机制。

链接: https://arxiv.org/abs/2605.18834
作者: Maximilian Puelma Touzel
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI); Populations and Evolution (q-bio.PE)
备注:

点击查看摘要

Abstract:By specifying behaviour across multiple agents, social norms are a coordination approach to resolving social dilemmas. Decentralized and wide adoption can be achieved by norms whose prescription involves interpreting stochastic signals in the environment. Such signals must have enough correlation to orchestrate mutually beneficial coordination and enough disincentivizing uncertainty about the benefits of exploiting that coordination. Evolutionary game theory of matrix games has been used to describe how, by rational agents comparing and adopting norms, a norm can evolve to become dominant in a population. Morsky \ Akçay (2019) classify norms according to a set of rationality criteria. Joint player strategies that adopt norms that are consistent with optimal single-player strategies with respect to expected reward naturally satisfy a correlated, rather than Nash game theoretic equilibrium condition. Here, we present a version of this theory that clarifies the basic ingredients. We formulate it in the more general Markov game setting more commonly used in reinforcement learning theory. We illustrate the theory by mapping norms over the signal and reward space, while also giving a detailed exposition of the underlying mechanics of the approach. Finally, we give a general solution and analysis of replicator dynamics, which Morsky \ Akçay (2019) propose as a means by which these norms could emerge.

[MA-18] ClinQueryAgent : A Conversational Agent for Population Health Management ACL

【速读】：该论文试图解决的问题是如何在不泄露患者数据的前提下，让医疗专业人员通过自然语言查询来获取可执行的数据库查询结果，从而提升人口健康管理和临床决策效率。解决方案的关键在于提出了一种名为ClinQueryAgent的新架构，该架构利用具备本地与外部知识库访问权限的代理（agent）系统，在确保患者数据不出安全环境的同时，借助云端大语言模型的强大能力实现自然语言到结构化查询的转换；此外，为缓解长对话中因上下文漂移（context rot）导致的准确性下降问题，系统将信息检索任务交由一个子代理完成，从而保证了多轮交互中的稳定性和可靠性。

链接: https://arxiv.org/abs/2605.18768
作者: Joseph S. Boyle,Anthony Dranfield,Mike O’Neil,Maria Liakata,Alison Q. Smithard
机构: Canon Medical Research Europe(康诺医疗研究欧洲); Queen Mary University of London(伦敦玛丽女王大学); Nottingham and Nottinghamshire ICB(诺丁汉及诺丁汉郡临床委员会); NHS England(英国国家医疗服务体系英格兰); University of Edinburgh(爱丁堡大学); The Alan Turing Institute(艾伦·图灵研究所)
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 11 pages, 4 figures. Submitted to ACL Systems Demonstrations

点击查看摘要

Abstract:In this paper we introduce ClinQueryAgent, a system for translating natural language population health questions into executable database queries using agents with access to both local and external knowledge bases. Our novel architecture enables the use of powerful cloud-based language models whilst ensuring that no patient data leaves the secure environment. To combat inaccuracies over the course of longer dialogues due to context rot, information retrieval is delegated to a sub-agent. We deploy the system via a chat window embedded within an existing population health management platform where it has been used by 128 staff from 15 healthcare practices covering a total of 148,319 patients in the UK’s National Health Service (NHS). We evaluate the system’s capacity to autonomously handle a range of health informatics tasks on a constructed dataset and via a beta-testing phase. Our results show that both analysts and clinicians are able to easily generate actionable information from patient health records using natural language requests requiring no programming expertise to verify. We make a public demo of the system available at: this https URL

[MA-19] Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits

【速读】：该论文试图解决的问题是：当前基于大语言模型（LLM）的社会模拟研究中，科学结论的可信度可能受到模型架构细节和微小扰动的显著影响，导致所得到的社会机制解释可能源于实现上的“人工制品”而非真实社会过程。其解决方案的关键在于提出一个系统性的鲁棒性审计框架——TRAILS（Taxonomy for Robustness Audits In LLM Simulations），该框架从三个层级（个体层、交互层、系统层）对LLM社会模拟进行结构化验证，强调鲁棒性应作为每项科学主张和每个模型的先决条件，而非默认假设。通过两个案例研究（重复囚徒困境与社交媒体回音室模拟）证明，即使微小的参数或提示设计变化也能引发高达76个百分点的合作率波动，且不同模型对相同扰动的响应差异极大，凸显了量化并标准化鲁棒性评估的必要性。

链接: https://arxiv.org/abs/2605.18890
作者: Jinyi Ye,Lei Cao,Ding Chen,Emilio Ferrara
机构: University of Southern California (南加州大学)
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The scientific claims drawn from LLM social simulations should be no stronger than the robustness audits that support them. Generative agents bring new expressive power to agent-based modeling, enabling simulations of collective social processes like cooperation, polarization, and norm formation. Yet they also introduce complexity through additional architectural choices, such as agent specification, memory representation, interaction protocols, and environment design. Small perturbations that appear minor to researchers can cascade into macro-level outcomes through repeated interaction, creating a “butterfly effect.” Consequently, scientific claims drawn from LLM social simulations may reflect implementation artifacts rather than the social mechanisms being modeled. We support this position with two case studies: a repeated Prisoner’s Dilemma and a social media echo chamber simulation. Across multiple models, minor perturbations in persona format and game-instruction framing shift cooperation rates by up to 76 percentage points, while network homophily and hub assignment produce significant and consistent shifts in polarization metrics. We also find that sensitivity is unevenly distributed across both architectural choices and model families: the same perturbation that produces the 76 pp shift in one frontier model only shifts another by 1 pp. Robustness is therefore a property that should be measured per claim and per model, not assumed. To address this validation gap, we introduce TRAILS (Taxonomy for Robustness Audits In LLM Simulations), a robustness-audit taxonomy spanning three levels of simulation design: agent (micro-level), interaction (meso-level), and system (macro-level). We call for robustness to become a first-order validation requirement before LLM social simulations are used to explain mechanisms, evaluate interventions, or inform decisions. Subjects: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA) Cite as: arXiv:2605.18890 [physics.soc-ph] (or arXiv:2605.18890v1 [physics.soc-ph] for this version) https://doi.org/10.48550/arXiv.2605.18890 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-20] Majoritarian Assignment Rules AAMAS

【速读】：该论文试图解决多智能体系统中对象公平分配的问题，特别是如何在分配机制中引入经典多数投票（majoritarian）的社会选择函数来实现更合理的分配结果。其解决方案的关键在于利用分配域的特殊结构，发现偏好配置（preference profiles）与多数图（majority graphs）之间近乎一一对应的关系；这一关系表明，诸如帕累托最优性（Pareto-optimality）、最小不受欢迎性（least unpopularity）和混合受欢迎性（mixed popularity）等关键分配属性，仅由对应的多数图决定。进一步地，作者证明所有帕累托最优分配都属于半受欢迎（semi-popular）且位于“顶端循环”（top cycle）内，而顶端循环可通过串行独裁机制（serial dictatorships）高效识别。最终，论文对顶端循环进行了完整刻画，指出其最多只能包含一种、两种、除两个外的所有、除一个外的所有或全部分配方案，这为分配问题提供了精确的理论边界和计算可行性保障。

链接: https://arxiv.org/abs/2602.14816
作者: Felix Brandt,Haoyuan Chen,Chris Dong,Patrick Lederer,Alexander Schlenga
机构: 未知
类目: Theoretical Economics (econ.TH); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: Appears in the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2026

点击查看摘要

Abstract:A central problem in multiagent systems is the fair assignment of objects to agents. In this paper, we initiate the analysis of classic majoritarian social choice functions in assignment. Exploiting the special structure of the assignment domain, we show a number of surprising results with no counterparts in general social choice. In particular, we establish a near one-to-one correspondence between preference profiles and majority graphs. This correspondence implies that key properties of assignments – such as Pareto-optimality, least unpopularity, and mixed popularity – can be determined solely by the associated majority graph. We further show that all Pareto-optimal assignments are semi-popular and belong to the top cycle. Elements of the top cycle can thus easily be found via serial dictatorships. Our main result is a complete characterization of the top cycle, which implies the top cycle can only consist of one, two, all but two, all but one, or all assignments. By contrast, we find that the uncovered set contains only very few assignments.

自然语言处理

[NLP-0] IDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

【速读】：该论文旨在解决扩散型大语言模型（Diffusion Large Language Models, dLLMs）在资源受限设备上部署时面临的效率瓶颈问题，尤其是当dLLMs采用混合专家（Mixture-of-Experts, MoE）架构后，现有基于自回归（Autoregressive, AR）的推理方法往往导致显著的I/O开销或计算瓶颈。其解决方案的关键在于提出TIDE系统，该系统利用扩散过程中专家激活在块内具有时间稳定性的特性，设计了一种基于区间的专家刷新策略（interval-based expert refresh strategy），以I/O感知的方式动态更新专家放置，并将推理调度建模为数学规划问题，求解最优区间以最小化I/O流量和CPU计算开销。最重要的是，TIDE是一种无损优化方法，无需模型训练即可实现“免费加速”（free lunch acceleration），在单GPU-CPU系统中相较于基线方法在LLaDA2.0-mini和LLaDA2.0-flash模型上分别实现了最高1.4倍和1.5倍的吞吐量提升。

链接: https://arxiv.org/abs/2605.20179
作者: Zhiben Chen,Youpeng Zhao,Yang Sui,Jun Wang,Yuzhang Shang
机构: University of Central Florida; Mobi.AI; Rice University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To ensure optimal performance, we formulate the inference scheduling as a mathematical programming problem, solving for the optimal interval that minimizes I/O traffic and CPU computation. Most importantly, TIDE is a lossless optimization that requires no model training, providing a “free lunch” acceleration for dLLM inference. In a single GPU-CPU system, we demonstrate that TIDE achieves up to 1.4 \times and 1.5 \times throughput improvements over prior baselines on LLaDA2.0-mini and LLaDA2.0-flash models, respectively.

[NLP-1] From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models ICML2026

【速读】：该论文试图解决的问题是：当前视觉语言模型（VLMs）在视觉任务上的性能瓶颈并非源于推理能力不足，而是受限于视觉感知能力的欠缺。解决方案的关键在于对VLM后训练过程进行分阶段优化，将能力拆解为三个独立训练阶段——视觉感知、视觉推理和文本推理，并针对每个阶段使用专门设计的数据进行训练。研究发现，视觉感知需要通过针对性优化（如强化学习RL而非基于描述的监督微调SFT）来强化，且应作为基础框架优先固化，再逐步提升视觉推理能力；这种基于能力的分阶段训练策略显著优于传统合并训练方式，在多个视觉数学与感知基准测试中（如WeMath和RealWorldQA）实现了更高准确率（分别提升+5.2%和+3.7%），同时推理路径更短（减少20.8%），表明高质量感知可降低冗余推理需求。此外，该方法构成了一种与传统难度导向课程设计正交的新课程维度，结合两者能进一步提升性能。

链接: https://arxiv.org/abs/2605.20177
作者: Juncheng Wu,Hardy Chen,Haoqin Tu,Xianfeng Tang,Freda Shi,Hui Liu,Hanqing Lu,Cihang Xie,Yuyin Zhou
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 9 figures; Accepted to ICML 2026; Project Page: this https URL

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and © is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

[NLP-2] ClinSeekAgent : Automating Multimodal Evidence Seeking for Agent ic Clinical Reasoning

【速读】：该论文试图解决的问题是：当前基于大语言模型（Large Language Models, LLMs）和智能体（agentic systems）的临床决策支持系统普遍假设证据已预先整理并提供给模型，而现实临床工作流程中，智能体需能主动搜索、迭代规划，并从异构来源中整合多模态证据。解决方案的关键在于提出ClinSeekAgent——一个自动化智能体框架，能够基于原始数据源动态获取多模态证据，通过查询医学知识库、导航电子健康记录（EHR）和调用医学影像工具来实现主动证据采集；同时在新信息出现时不断优化假设，并将证据转化为可解释的临床决策。该框架不仅作为推理阶段的智能体提升主流LLM性能（如Claude Opus 4.6在文本任务F1提升至63.2），还作为训练阶段的数据蒸馏管道，生成高质量代理轨迹以训练轻量化模型（如ClinSeek-35B-A3B在AgentEHR-Bench上平均F1达34.0，较基线提升+11.9）。

链接: https://arxiv.org/abs/2605.20176
作者: Juncheng Wu,Letian Zhang,Yuhan Wang,Haoqin Tu,Hardy Chen,Zijun Wang,Cihang Xie,Yuyin Zhou
机构: UC Santa Cruz
类目: Computation and Language (cs.CL)
备注: 24 pages, 9 figures; Project Page: this https URL

点击查看摘要

Abstract:Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.

[NLP-3] KoRe: Compact Knowledge Representations for Large Language Models

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在知识表示上的固有缺陷问题，即其将世界知识编码于参数中，导致知识表达不透明、难以调试与更新，并易产生幻觉。为此，作者提出了一种名为KoRe的解决方案，其关键在于将1跳子图（1-hop sub-graphs）编码为紧凑的离散知识标记（discrete knowledge tokens），并将其注入LLM主干模型中，从而实现对LLM的知识增强。该方法无需大量重训练或微调，且在三个基准测试中表现出竞争力，同时显著降低（最高达10倍）令牌使用量，证明了紧凑离散知识图谱（Knowledge Graph, KG）表示可高效且有效地用于 grounding 现代LLM。

链接: https://arxiv.org/abs/2605.20170
作者: Davide Cavicchini,Fausto Giunchiglia,Jacopo Staiano
机构: University of Trento(特伦托大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern Large Language Models (LLMs) have shown impressive performances in user-facing tasks such as question answering, as well as consistent improvements in reasoning capabilities. Still, the way these models encode knowledge seems inherently flawed: by design, LLMs encode world-knowledge within their parameters. This way of representing knowledge is inherently opaque, difficult to debug and update, and prone to hallucinations. On the other hand, Knowledge Graphs can provide human-readable and easily editable world knowledge representations, and their application in knowledge-intensive tasks has consistently proven beneficial to downstream performance. Nonetheless, current integration techniques require extensive retraining or finetuning. To overcome this issue, we introduce KoRe, a methodology to encode 1-hop sub-graphs into compact discrete knowledge tokens and inject them into a LLM backbone. We test the proposed approach on three established benchmarks, and report competitive performances coupled with a significant reduction (up to 10x) in token usage. Our results show that compact discrete KG representations can efficiently and effectively be used to ground modern LLMs.

[NLP-4] Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models

【速读】：该论文试图解决的问题是：大型视觉语言模型（LVLMs）在医学影像（如胸部X光片）应用中，其生成的推理结果缺乏对视觉证据的忠实定位，导致临床可信度存疑。现有视觉归因方法虽被广泛用于解释LVLM预测，但其是否真正反映了模型决策所依赖的视觉证据尚未得到验证，因为模型内部推理的真值标注通常不可得。解决方案的关键在于构建一个因果评估框架，通过反事实编辑技术筛选出专家标注区域确实因果性地影响模型预测的样本，并在此基础上系统评估11种归因方法、6个开源LVLM模型及两种输出模式（直接回答与分步推理）的表现。研究发现，现有方法普遍无法准确识别LVLM使用的证据；为此，作者提出MedFocus，一种基于概念的归因方法，利用非平衡最优传输定位临床有意义的解剖区域，并通过靶向干预测量其对模型输出的因果效应，从而实现空间、概念层级和token层级的归因，显著优于现有方法，推动了医疗LVLM可信赖归因的发展。

链接: https://arxiv.org/abs/2605.20158
作者: Guangzhi Xiong,Qiao Jin,Sanchit Sinha,Zhiyong Lu,Aidong Zhang
机构: University of Virginia (弗吉尼亚大学); National Institutes of Health (美国国家卫生研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model’s decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model’s prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at this https URL.

[NLP-5] MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在高风险决策场景中可能因训练数据中嵌入的注意力偏差而忽视隐含但关键的上下文线索，即表现出类似人类“注意盲视”（attentional blindness）的认知局限。解决方案的关键在于提出一种名为潜在关系补全提示法（Potential Relation Completion Prompting, PRCP）的新提示策略，通过恢复被忽略的因果关系来增强模型对隐含信息的敏感性；实验表明，PRCP能有效提升模型在显式-隐式推理任务上的表现，从而缓解LLMs的注意盲视问题。

链接: https://arxiv.org/abs/2605.20128
作者: Yuanqing Cai,Ziyi Huang,Minhao Liu,Lixin Duan,Wen Li,Yanru Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emphinattentional blindness in human cognition, we investigate whether LLMs, trained on human-preferred corpora that embed attentional biases, exhibit a similar limitation: \emphfailing to attend to subtle yet important contextual cues under explicit task instructions. To evaluate this, we introduce the task of \textbfexplicit-implicit reasoning and present \textbfMixRea, a benchmark of 2,246 multiple-choice questions across 9 reasoning types with varying distributions of explicit and implicit information. Evaluation of 21 advanced LLMs shows that even the best-performing reasoning model (Gemini 2.5 Pro) achieves only 42.8% consistency, revealing widespread inattentional blindness. To mitigate this, we propose \textbfPotential Relation Completion Prompting (PRCP), a prompting method that improves reasoning by recovering overlooked causal relations. Further analysis shows that this limitation persists across diverse multi-source reasoning tasks, highlighting the need for more cognitively aligned models.

[NLP-6] houghtTrace: Understanding User Thoughts in Real-World LLM Interactions

【速读】：该论文试图解决的问题是：当前对话式人工智能（Conversational AI）虽然已广泛应用于数亿用户，但现有数据集仅记录了用户“说了什么”，而忽略了用户“为什么这么说”——即其内在动机、意图和对AI响应的反应。为填补这一空白，作者提出ThoughtTrace，这是首个大规模配对真实多轮人机对话与用户自报思维的数据集，其中包含用户发送提示的原因及对AI回应的反馈。解决方案的关键在于：首次系统性地收集并标注了1,058名用户、2,155次对话、17,058轮交互中的10,174条思维注释（thought annotations），并通过实证分析证明这些思维在语义上区别于对话文本、难以被前沿大语言模型（LLM）从上下文中推断、内容多样且与对话阶段密切相关。此外，论文进一步验证了思维数据在下游任务中的价值：作为推理时上下文可提升用户行为预测准确性；通过思维引导的改写提供细粒度对齐信号，用于训练个性化AI助手。ThoughtTrace因此确立了“用户思维”作为一种新的数据模态，为理解人机交互背后的认知动态提供了基础，并推动构建能更好理解并适应用户潜在目标、偏好与需求的智能助手。

链接: https://arxiv.org/abs/2605.20087
作者: Chuanyang Jin,Binze Li,Haopeng Xie,Cathy Mengying Fang,Tianjian Li,Shayne Longpre,Hongxiang Gu,Maximillian Chen,Tianmin Shu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 53 pages, 23 figures, 4 tables. Project website: this https URL

点击查看摘要

Abstract:Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human–AI conversations with users’ self-reported thoughts: their reasons for sending prompts and reactions to assistant responses. ThoughtTrace comprises 1,058 users, 2,155 conversations, 17,058 turns, and 10,174 thought annotations collected across 20 language models. Our analysis shows that ThoughtTrace captures long-horizon, topically diverse interactions, and that thoughts are semantically distinct from messages, difficult for frontier LLMs to infer from context, diverse in content, and tied to conversation stages. We further demonstrate the utility of thoughts for downstream modeling. First, thoughts improve user-behavior prediction as inference-time context. Second, thought-guided rewrites provide fine-grained alignment signals for training personalized assistants. Together, ThoughtTrace establishes user thoughts as a new data modality for studying the cognitive dynamics behind human–AI interaction and provides a foundation for building assistants that better understand and adapt to users’ latent goals, preferences, and needs.

[NLP-7] BalanceRAG : Joint Risk Calibration for Cascaded Retrieval-Augmented Generation

【速读】：该论文试图解决的问题是：在检索增强生成（RAG）中，如何在保证事实准确性的同时减少不必要的检索调用，尤其是在大语言模型（LLM）仅凭自身知识即可提供可靠答案的情况下。传统方法对所有查询均启用RAG，导致资源浪费；而简单的分阶段校准策略又可能过于保守，无法实现系统级的最优风险控制。解决方案的关键在于提出BalanceRAG，它通过将LLM-only和RAG两个分支的不确定性阈值组合视为二维网格上的操作点，并利用顺序图形检验（sequential graphical testing）识别在目标风险水平下安全的操作点，从而实现风险自适应的阈值校准。该方法不仅能控制整体接受样本中的错误率，还能保留更多正确答案并显著降低无谓的检索请求次数，同时支持多风险联合校准以约束检索使用与选择条件下的风险。

链接: https://arxiv.org/abs/2605.20084
作者: Zijun Jia,Yuanchang Ye,Sen Jia,Yiyao Qian,Haoning Wang,Baojie Chen,Diyin Tang,Jinsong Yu,Zhiyuan Wang
机构: Beihang University (北京航空航天大学); Shenzhen Institute of Advanced Technology (深圳先进技术研究院); Zhejiang University of Finance & Economics (浙江财经大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can enhance factuality via retrieval-augmented generation (RAG), but applying RAG to every query is unnecessary when the model-only answer is reliable. This motivates cascaded RAG: each query is first handled by an LLM-only branch, escalated to a RAG fallback only if the primary branch is uncertain, and abstained from when neither branch is sufficiently trustworthy. However, calibrating such cascades stage by stage may be conservative, since the final utility depends on joint uncertainty thresholding of LLM-only and RAG. In this work, we develop BalanceRAG to certify threshold pairs at a target risk level. Given uncertainty scores from the two branches, BalanceRAG frames each threshold pair as an operating point on a two-dimensional lattice and identifies safe operating points using sequential graphical testing. This enables risk-adaptive threshold calibration, controlling the system-level error rate among accepted points, while retaining more examples. Furthermore, BalanceRAG extends to multi-risk calibration, allowing retrieval usage to be bounded together with the selection-conditioned risk. Experiments on three open-domain question answering (QA) benchmarks across multiple LLM backbones demonstrate that BalanceRAG meets prescribed risk levels, preserves higher coverage and more accepted correct examples, and reduces unnecessary retrieval calls compared with always-on RAG.

[NLP-8] CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agent ic Reasoning

【速读】：该论文试图解决传统链式思维（Chain-of-thought, CoT）范式中因强制先思考后作答而导致的效率低下问题，特别是当大语言模型（LLM）在未完成完整推理前即可生成合理答案时，仍会消耗不必要的token资源。其解决方案的关键在于提出CopT（Counterfactual of Thought），一种重构的推理流程：首先生成一个草稿答案（draft answer），随后基于该草稿进行条件性反思与修正（on-policy thinking）。CopT通过将连续嵌入（continuous embeddings）转化为推理时的对比验证器（contrastive verifiers），利用离散标记输入与连续嵌入输入下模型对同一生成token的支持差异，计算序列级反向KL估计量来评估答案可靠性。若可靠性不足，则启动进一步的策略内思考，并通过第二个KL估计量动态控制草稿答案的可见性，保留有用信息的同时降低误导风险。实验表明，CopT在数学、编程和代理推理任务中最高可提升峰值准确率23%，同时减少57%的token消耗，且无需额外训练。

链接: https://arxiv.org/abs/2605.20075
作者: Dachuan Shi,Hanlin Zhu,Xiangchi Yuan,Wanjia Zhao,Kejing Xia,Wen Xiao,Wenke Lee
机构: Georgia Tech (佐治亚理工学院); UC Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code: this https URL , Website: this https URL

点击查看摘要

Abstract:Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model’s support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at this https URL.

[NLP-9] xt-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

【速读】：该论文旨在解决知识图谱问答（Knowledge Graph Question Answering, KGQA）中依赖大模型或全监督标注（如黄金查询注释）的问题，探索在无标签查询数据条件下，如何利用基于结果的强化学习训练小型指令微调语言模型实现零样本文本到SPARQL查询生成。其解决方案的关键在于采用组相对策略优化（Group-Relative Policy Optimization, GRPO）方法，在DBLP-QuAD数据集上对Qwen3-1.7B模型进行训练，通过结合自然语言问题与实体和关系符号提示的提示模板，并利用执行反馈、结构约束和答案级奖励作为训练信号，其中还引入了基于黄金查询的奖励塑形变体。实验表明，GRPO显著优于零样本基线并具备良好的泛化能力，尽管监督微调（DoRA）在整体准确率上更高，但GRPO在缺乏token级监督时展现出可行性，且消融分析显示执行反馈奖励贡献了主要性能提升，奖励塑形带来的增益有限，验证了基于结果的强化学习是一种有效的替代训练范式。

链接: https://arxiv.org/abs/2605.20066
作者: Jann Pfeifer,Debayan Banerjee,Ricardo Usbeck
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by NeSy 2026

点击查看摘要

Abstract:Knowledge graph question answering seeks to translate natural language questions into executable queries over knowledge graphs, but existing approaches often rely on large models or full supervision in the form of gold query annotations. This study examines whether reinforcement learning with outcome-based rewards can train a small instruction-tuned language model to perform zero-shot Text-to-SPARQL generation in the scholarly domain. Group-Relative Policy Optimization (GRPO) is applied to the Qwen3-1.7B model on DBLP-QuAD, using prompts that combine natural language questions with symbolic hints about entities and relations. Training relies on execution feedback, structural constraints, and answer-level rewards, with an additional variant that incorporates gold-query-based shaping. The resulting models are compared to the unmodified zero-shot baseline and to a supervised DoRA-finetuned baseline across answer-level accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit, suggesting that outcome-based reinforcement learning is a viable training strategy when gold queries are unavailable for token-level supervision.

[NLP-10] Rewarding Beliefs Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

【速读】：该论文试图解决在部分可观测环境中，大语言模型（LLM）代理在长时程交互任务中因观测不完整导致的信念漂移（belief drift）以及奖励延迟引发的时序信用分配（temporal credit assignment）难题。解决方案的关键在于提出一种过程级强化学习算法 ReBel（Reward Belief），其核心创新包括：1）引入信念一致性监督（belief-consistency supervision），通过预测信念与实际反馈之间的差异生成密集的自监督信号，无需外部逐步标注或验证器；2）采用信念感知分组（belief-aware grouping），在相似信念状态下比较轨迹以获得更稳健、方差更低的优势估计。实验表明，ReBel 在 ALFWorld 和 WebShop 等长时程基准上相比基于回合级别的基线 GRPO 提升任务成功率最高达 20.4 个百分点，并将样本效率提升 2.1 倍，验证了信念感知自监督机制在部分可观测环境下实现可靠长期决策的有效性。

链接: https://arxiv.org/abs/2605.20061
作者: Wenjie Tang,Minne Li,Sijie Huang,Liquan Xiao,Yuan Zhou
机构: National University of Defense Technology (国防科技大学); Intelligent Game and Decision Lab (IGDL); Institute of Artificial Intelligence, Xiamen University (厦门大学人工智能研究院)
类目: Computation and Language (cs.CL)
备注: 10 pages, 4 figures, 3 tables, plus appendix

点击查看摘要

Abstract:Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to 20.4 percentage points over the episode-level baseline GRPO and increases sample efficiency by 2.1\times . These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: this https URL.

[NLP-11] PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling ACL

【速读】：该论文旨在解决医学影像报告自动标注在低资源场景下的难题，即如何在标注数据稀缺的情况下实现高精度的多标签分类。传统基于规则的方法难以应对临床报告中描述的多样性，而微调预训练语言模型（PLM）又依赖大量标注数据，在临床环境中往往不可行。其解决方案的关键在于提出PromptRad——一种知识增强的多标签提示调优方法：通过将多标签分类任务重构为掩码语言建模，并利用UMLS Metathesaurus中的同义词构建多词 verbalizer 来丰富类别表示；同时无需添加额外分类层即可微调PLM，显著降低对标注数据的需求。实验表明，PromptRad仅用32个标注样本即优于字典基线和常规微调方法，且性能接近GPT-4，同时更擅长捕捉复杂的否定模式，是低资源环境下放射学报告标注的高效方案。

链接: https://arxiv.org/abs/2605.20052
作者: Ying-Jia Lin,Tzu-Chin Lo,Ping-Chien Li,Chi-Tung Cheng,Chien-Hung Liao,Hung-Yu Kao
机构: Chang Gung University (长庚大学); Sijhih Cathay General Hospital (汐止凯撒综合医院); Chang Gung Memorial Hospital (长庚纪念医院); National Tsing Hua University (国立清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: BioNLP 2026 @ ACL

点击查看摘要

Abstract:Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large-scale annotation for medical imaging research. Existing rule-based labelers struggle with the diverse descriptions in clinical reports, while fine-tuning pre-trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge-enhanced multi-label \textbfprompt-tuning approach for \textbfradiology report labeling under low-resource settings. PromptRad reformulates multi-label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi-word verbalizer to enrich category representations. By fine-tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine-tuning. Experiments on liver CT reports show that PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT-4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data-scarce clinical scenarios. Our code is available at this https URL.

[NLP-12] Language Mutations Sustain the Persistences of Conspiracy Theories on Social Media

【速读】：该论文试图解决的问题是：语言变异（language mutations）如何影响阴谋论在社交媒体上的持续扩散。研究发现，具有更高语义变异性的阴谋论内容具有更长的存续时间；具体而言，心理语言学特征（如代词、社会参照词、认知过程术语及与风险和健康相关的词汇）的变异，以及行为者-行动-目标（Actor-Action-Target, AAT）结构层面的变异，均与阴谋论寿命延长显著相关。关键解决方案在于识别出两种主要的语言变异模式——简化（simplification）与同化（assimilation），这些模式同时存在于语言层面和AAT结构层面。研究强调，内容审核策略应关注阴谋论内容的可变性，并聚焦于核心主张以应对潜在的多样化表达形式，从而提升长期治理效果。

链接: https://arxiv.org/abs/2605.20050
作者: Calvin Yixiang Cheng,Dorian Quelle,Scott A. Hale
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study investigates how language mutations affect the persistent diffusion of conspiracy theories on social media. Drawing on a three-year dataset of conspiracy-related posts from X, and applying computational linguistic analysis alongside survival modelling, we find that conspiracy claims with greater semantic mutations have substantially longer lifespans. Mutations in psycholinguistic properties, including pronouns, social reference words, cognitive process terms, risk- and health- related vocabularies, are associated with extended lifespans. Mutations in actor, action and target (AAT) categories are associated with longer lifespans as well. Qualitative analysis identifies two predominant mutation patterns: simplification and assimilation, at both linguistic and AAT structural levels. Taken together, the results advance our understanding of how language mutations contribute to conspiracy persistence online and shed lights on longitudinal content moderation strategies. We argue that content moderation should consider the mutability of conspiracy claims and focus on the core claims that can address their potential variations.

[NLP-13] Mind Your Moras: Orthography-Aware Error Analysis of Neural Japanese Morphological Generation

【速读】：该论文试图解决的问题是：在日语过去时形态屈折（past-tense morphological inflection）任务中，尽管神经序列到序列模型在整体准确率上表现良好，但其系统性错误仍缺乏深入分析，且这些错误是否与日语平假名（hiragana）的正字法特性相关尚不明确。解决方案的关键在于引入一种基于正字法感知的错误分类体系（orthography-aware error taxonomy），将错误归类为七种主要失败模式，并发现其中促音（gemination）相关的错误占残差错误的75–80%，尤其集中在词干以元音「e」结尾、需在过去时后缀前添加促音的动词中。研究进一步表明，这些错误模式在不同架构和随机种子下高度一致，揭示了正字法表示、形态结构与数据频率效应之间存在稳健交互关系，强调了在形态复杂语言中进行正字法敏感评估对理解神经网络泛化能力的重要性。

链接: https://arxiv.org/abs/2605.20043
作者: Wen Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present an orthography-aware error analysis of Japanese past-tense morphological inflection, treating hiragana not merely as a transcriptional medium, but as a representational system encoding morphophonological distinctions that may influence model generalization. We evaluate two character-level sequence-to-sequence architectures on past-tense formation using datasets formatted according to the SIGMORPHON 2020 and 2023 shared task conventions. Despite high aggregate accuracy, models exhibit systematic, linguistically interpretable errors that cluster around specific orthographic properties of hiragana. We introduce a concise error taxonomy capturing seven primary failure modes and provide both quantitative and qualitative analyses. Gemination-related errors dominate residual failures, accounting for 75-80% of errors, particularly in verbs whose stems end in the vowel e and require gemination before the past-tense suffix. Error patterns remain highly consistent across architectures and random seeds, suggesting a robust interaction between orthographic representation, morphological structure, and data frequency effects in shaping model generalization. These results underscore the necessity of orthography-aware evaluation for understanding neural generalization in morphologically complex languages.

[NLP-14] FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

【速读】：该论文旨在解决传统推测解码（speculative decoding）在大批次规模下因验证不匹配（verification mismatch）导致吞吐量骤降的问题，以及现有并行推测解码方法在小批量时效率不足或需昂贵持续预训练且质量下降的局限性。其解决方案的关键在于提出 FlexDraft 框架，通过三个核心设计实现无损加速：(1) Attention Tuning 通过仅微调最终几层的注意力投影矩阵，在掩码标记上实现块扩散式草稿生成，保持自回归路径不变以确保高质量输出且参数开销极低；(2) Bonus-guided Calibration 利用轻量级多层感知机（MLP）根据已解析的奖励 token 校准草稿 logits，缓解因奖励 token 不确定性引发的验证偏差；(3) Flex Decoding 动态切换策略——在小批量时采用并行草稿与验证模式，在大批量时转为串行模式，并依据草稿置信度调整验证长度，从而消除冗余计算，实现对不同批大小的灵活适配和稳定性能提升。

链接: https://arxiv.org/abs/2605.20022
作者: Yaojie Zhang,Jianuo Huang,Junlong Ke,Yuhang Han,Yongji Long,Tianchen Zhao,Biqing Qi,Linfeng Zhang
机构: EPIC Lab, SJTU (上海交通大学EPIC实验室); UESTC (电子科技大学); School of Software Engineering, HUST (华中科技大学软件学院); Tsinghua University (清华大学); HKUST(GZ) (香港科技大学（广州）); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting between drafting and verification, and repeated exchange of intermediate states further increases memory access overhead. Parallel speculative decoding addresses this limitation by performing drafting and verification within a single target forward pass, allowing future drafts to be prepared while current candidates are being verified. Although effective at small batch sizes, existing parallel speculative decoding methods either require costly continual pretraining with quality degradation or suffer from low acceptance rates. More importantly, this paradigm inherently suffers from uncertainty in both the bonus token and the accepted length, leading to draft verification mismatch and causing throughput gains to collapse at large batch sizes. To address these limitations, we introduce FlexDraft, a lossless speculative decoding framework that flexibly adapts to varying batch sizes through three key designs. (1) Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens, while keeping the autoregressive path frozen to preserve the target distribution and produce high quality drafts with minimal trainable parameters. (2) Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits, mitigating draft verification mismatch caused by bonus token uncertainty. (3) Flex Decoding dynamically switches between parallel draft and verify at small batch sizes and sequential draft then verify at large batch sizes, and adjusts verification length based on draft confidence to eliminate redundant computation.

[NLP-15] Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory

【速读】：该论文试图解决大语言模型（Large Language Models, LLM）代理在长期交互中因记忆系统不足而导致的对话连贯性差、推理能力弱的问题。现有方法通常采用基于提取事实的范式，即通过手工设计的静态提示将原始对话压缩为原子事实进行存储与检索，但这种方式会丢失对话中的细粒度信息，难以支持对分散事实的深度推理，且无法适应不同对话风格下的统一提取粒度。解决方案的关键在于提出TriMem，它维护三种共存的表示粒度：以源标识符锚定的原始对话片段（保障存储保真度）、用于高效检索的原子事实，以及聚合离散事实的合成概要（支持深层语义推理）。此外，通过TextGrad-based提示优化机制，利用响应质量反馈迭代改进提取与概要提示，实现无需参数更新的终身演化。实验表明，TriMem在LoCoMo和PerLTQA多个LLM基座上均显著优于现有强基线方法。

链接: https://arxiv.org/abs/2605.19952
作者: Jingwei Sun,Jianing Zhu,Jiangchao Yao,Tongliang Liu,Bo Han
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To enable reliable long-term interaction, LLM agents require a memory system that can faithfully store, efficiently retrieve, and deeply reason over accumulated dialogue history. Most existing methods adopt an extracted fact based paradigm: handcrafted static prompts compress raw dialogues into atomic facts, which are then stored, matched, and injected into downstream reasoning. Nevertheless, such fact-centric designs inevitably discard fine-grained details in original dialogues and fail to support deep reasoning over scattered isolated facts. Moreover, static prompts cannot maintain consistent extraction granularity across diverse dialogue styles. To address these limitations, we propose TriMem, which maintains three coexisting representation granularities, including raw dialogue segments anchored by source identifiers for storage fidelity, extracted atomic facts for efficient memory retrieval, synthesized profiles that aggregate dispersed facts into holistic semantic understanding for deep reasoning. We further adopt TextGrad-based prompt optimization, which iteratively refines extraction and profiling prompts via response quality feedback, achieving lifelong evolution without any parameter updating. Extensive experiments on LoCoMo and PerLTQA across multiple LLM backbones demonstrate that TriMem consistently outperforms strong memory baselines. The code is available at this https URL .

[NLP-16] GEM: GPU-Variability-Aware Expert to GPU Mapping for MoE Systems

【速读】：该论文旨在解决混合专家（Mixture-of-Expert, MoE）模型在分布式推理过程中因GPU负载不均和硬件异构性导致的性能瓶颈问题。现有MoE服务引擎采用同步锁步处理机制，使得整个层的计算必须等待最慢的GPU完成，从而严重限制了整体吞吐效率。其关键解决方案是提出GEM（GPU-variability-aware Expert Mapping），一种考虑GPU性能差异的专家到GPU映射框架。GEM的核心思想包括：第一，根据GPU的运行时变异性动态分配非均匀的token负载，使各GPU在同一层处理中尽可能同时完成；第二，识别两类专家——始终活跃的“一致型”专家和周期性共现的“时序型”专家，并将它们分散部署于不同GPU上，避免将高频率使用或协同使用的专家集中放置在性能较差的GPU上，从而显著降低延迟。实验表明，GEM平均可提升端到端延迟7.9%，最高达16.5%。

链接: https://arxiv.org/abs/2605.19945
作者: Sourish Wawdhane,Avinash Kumar,Poulami Das
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Mixture-of-Expert (MoE) models enable efficient inference by employing smaller experts and activating only a subset of them per token. MoE serving engines distribute experts across multiple GPUs and route tokens to appropriate GPUs at inference time based on experts activated. They process tokens in lock-step fashion, where tokens within a batch must finish processing before proceeding to the next layer. This synchronization barrier acts as a critical bottleneck because the performance of MoE models is limited by the straggler GPU that finishes last. Stragglers emerge when too many heavily used experts are placed on the same GPU or the slowest GPU. While prior works place experts that balance token loads across GPUs, they all overlook GPU variability and often place highly used experts on the slowest GPUs. We propose GEM, GPU-variability-aware Expert Mapping, a framework for GPU variability-aware expert to GPU mapping for MoE models. GEM exploits two insights. First, we must place experts such that each GPU receives non-uniform token loads based on their variability and they all finish processing a layer at about the same time. Our studies show that there are two types of experts: consistent that are used most of the time and temporal that are often used together for the remaining time. Our second insight is that we must place simultaneously used consistent and temporal experts on different GPUs and avoid placing them on slower GPUs to reduce slowdown. GEM gathers the variability profile of GPUs for each model and task and uses the token load distributions per task to map experts to GPUs. Our experiments show that GEM improves end-to-end latency by 7.9% on average and by up to 16.5% compared to the baseline. Comments: 18 pages Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2605.19945 [cs.DC] (or arXiv:2605.19945v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.19945 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-17] A Measure-Theoretic Analysis of Reasoning : Structural Generalization and Approximation Limits

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在分布外（out-of-distribution, OOD）泛化能力不足的理论机制问题。其核心解决方案在于将推理过程形式化为最优传输（optimal transport），通过将离散轨迹投影到连续度量空间，利用Wasserstein-1距离量化领域偏移，并借助Kantorovich对偶性，从架构的Lipschitz连续性和函数逼近极限两个角度建立OOD泛化的理论边界。关键发现包括：第一，位置感知注意力机制（如绝对位置编码）破坏平移不变性，导致Ω(1)阶的Lipschitz常数和期望风险；而平移不变机制（如旋转位置嵌入）可保持等变性并有效控制误差；第二，通过将序列回溯映射为Dyck-k语言，证明了TC⁰类Transformer存在严格的电路深度下界，仅靠增加表示宽度无法规避表征坍缩，必须扩展物理层深度以满足Barron空间中不可约的逼近界限。实验在54种Transformer配置上的组合搜索任务中验证了这些理论边界，表明泛化风险随Wasserstein域偏移单调上升。

链接: https://arxiv.org/abs/2605.19944
作者: Yuyang Zhang,Yifu Zhang,Xuehai Zhou,Xiaoyin Chen
机构: McGill University; Mila - Quebec AI Institute
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:While empirical scaling laws for LLM reasoning are well-documented, the theoretical mechanisms governing out-of-distribution (OOD) generalization remain elusive. We formalize reasoning via optimal transport, projecting discrete trajectories into a continuous metric space to quantify domain shifts using the Wasserstein-1 distance. Invoking Kantorovich duality, we bound OOD generalization via architectural Lipschitz continuity and functional approximation limits. This exposes two primary constraints. First, position-dependent attention (e.g., Absolute Positional Encoding) fails to preserve shift invariance, yielding an \Omega(1) Lipschitz constant and expected risk, whereas shift-invariant mechanisms (e.g., Rotary Embeddings) preserve equivariance and bound the error. Second, by mapping sequential backtracking to a Dyck- k language, we establish a strict circuit depth lower bound for \textTC^0 Transformers. Scaling physical layer depth is necessary to avert representation collapse – a constraint that scaling representation width cannot bypass due to irreducible approximation bounds in Barron spaces. Evaluations across 54 Transformer configurations on combinatorial search corroborate these bounds, demonstrating that generalization risk degrades monotonically with the Wasserstein domain shift.

[NLP-18] What Are LLM s Doing to Scientific Communication? Measuring Changes in Writing Practices and Reading Experience LREC2026

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在科研写作过程中的广泛应用是否改变了科学交流的风格。为回答这一问题，作者构建了两个数据资源：一个包含37,000篇来自ACL Anthology（2020–2024年）的自然语料库，以及一个包含3,000个人类撰写段落及其LLM改进版本的合成数据集。解决方案的关键在于通过一系列历时性词汇分析和复杂文体特征建模，揭示LLM介入对文本语义、句法结构及词汇多样性的影响，并结合专家主观标注实验验证这些变化对阅读体验的作用。研究发现，LLM改写文本更常使用特定句法结构、更长且更复杂的词，同时词汇多样性降低；尽管专家普遍认为LLM改进后的文本更具可读性和吸引力，但同时也表现出对AI辅助写作的负面定性态度，凸显了AI写作对读者感知的主观影响。

链接: https://arxiv.org/abs/2605.19936
作者: Filip Miletić,Neele Falk
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to LREC 2026

点击查看摘要

Abstract:Has the style of scientific communication changed due to the growing use of large language models in the writing process? We address this question in the domain of Natural Language Processing by leveraging two data resources we create: a naturalistic corpus of over 37,000 papers from the ACL Anthology (2020-2024); and a synthetic dataset of 3,000 human-written passages and their LLM-generated improvements. We first implement a series of diachronic lexical analyses, showing that both word frequency and usage contexts have changed significantly over time, indicating semantic specialization in some cases and generalization in others. Broadening our perspective, we then model a range of more complex stylistic features and find that LLM-modified texts more frequently contain certain syntactic constructions, more complex and longer words and a lower lexical diversity. Finally, we connect these changes in writing practices to subjective reading experience through a pilot annotation study with 20 domain experts. They overall rate LLM-improved texts as more understandable and exciting, but also express negative qualitative attitudes towards LLMs, highlighting the strongly subjective effect of AI-assisted writing on reading experience.

[NLP-19] PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Model, LLM）代理在处理长期且重复的外部上下文（如文档集合和代码仓库）时，现有方法无法有效保留对这些上下文本身具有实用价值的“可重用导向知识”（reusable orientation knowledge），即关于上下文内容、组织结构以及历史中曾被证明有用的实体、常量和模式的认知信息。解决方案的关键在于提出 PEEK 系统，其核心是一个小型、固定大小的“上下文地图”（context map），作为代理提示中的持久化结构，持续提供对重复外部上下文的“窥视窗口”。该地图由三个模块维护：Distiller（从推理信号中提取可迁移知识）、Cartographer（将其转化为结构化编辑）和Evictor（基于优先级管理固定token预算）。实验表明，PEEK在长上下文推理、信息聚合与上下文学习任务中显著优于当前最优基线（如ACE框架），提升效率达1.7–5.8倍，并降低1.4倍成本，且在不同LLM和代理架构上均具泛化能力。

链接: https://arxiv.org/abs/2605.19932
作者: Zhuohan Gu,Qizheng Zhang,Omar Khattab,Samuel Madden
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent’s trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent’s prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.

[NLP-20] Where Does Authorship Signal Emerge in Encoder-Based Language Models?

【速读】：该论文试图解决的问题是：在使用相同的预训练编码器、数据和损失函数微调时，作者归属模型的性能差异可达四倍之多，这种差异仅由评分机制（scoring mechanism）的不同导致。解决方案的关键在于通过机制可解释性工具揭示这一性能差距的根源：尽管风格特征（如词长、标点密度和功能词频率）在所有模型的所有层中均等可用，且不因编码器质量而异，但评分机制决定了编码器如何整合作者特征信号——平均池化（mean pooling）促使早期到中期层即完成信号整合，而晚期交互（late interaction）则将整合推迟至后期层；这一差异可进一步从每种评分机制的梯度结构推导得出，并在训练动态中体现为不同的学习轨迹。

链接: https://arxiv.org/abs/2605.19908
作者: Francis Kulumba,Guillaume Vimont,Laurent Romary,Florian Cafiero
机构: Inria Paris; Sorbonne Université; IRIF; LRE, EPITA; Ecole nationale des chartes – PSL
类目: Computation and Language (cs.CL)
备注: 12 pages, 6 figures. Under review

点击查看摘要

Abstract:Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are equally available at every layer in every model, including in an off-the-shelf control encoder, hence the gap not coming from representation quality. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.

[NLP-21] Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning ICML2026

【速读】：该论文试图解决的问题是：当前基于工具增强的多模态大语言模型（MLLMs）在推理过程中普遍存在“过度依赖工具”或“不必要工具调用”的问题，这不仅增加了计算开销，还可能导致错误推理。解决方案的关键在于提出AutoTool模型，其核心创新包括：1）基于强化学习框架设计了一种显式的双模式推理策略（tool-assisted与text-centric），并为每种模式定义特定奖励函数以引导模型生成准确答案；2）通过联合探索和平衡两种推理模式，在训练初期避免模型过早偏向单一模式，并在后期促进自由探索，从而实现高效且鲁棒的决策机制。实验表明，AutoTool在V*基准上比基础模型提升21.8%准确率，在POPE基准上效率相比现有方法提高44.9%。

链接: https://arxiv.org/abs/2605.19852
作者: Qinghe Ma,Zhen Zhao,Yiming Wu,Jian Zhang,Lei Bai,Yinghuan Shi
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Tool-augmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models (MLLMs). However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query. Within a reinforcement learning framework, we design an explicit dual-mode reasoning strategy with mode-specific reward functions to guide the model toward producing accurate responses. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool-assisted and text-centric reasoning throughout training, and promotes free exploration in later stages. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21.8% accuracy gain on V* benchmark compared to the base model, and a 44.9% improvement in efficiency over existing tool-augmented methods on POPE benchmark. Code is available at this https URL.

[NLP-22] CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models

【速读】：该论文试图解决深度学习模型在医疗诊断和金融等高风险领域应用受限的问题，核心在于其黑箱特性导致的可解释性不足。解决方案的关键在于引入影响函数（influence functions），通过量化训练样本对模型预测的影响，实现对NLP模型在样本层面和概念层面的可解释性增强：一方面，影响函数能精准识别出对模型预测最具影响力的训练样本（包括有益和有害样本），并通过调整这些样本的标签或权重，在不重新训练模型的前提下恢复性能至基线水平，从而实现高效的调试；另一方面，在概念瓶颈模型（Concept Bottleneck Model, CBM）中，影响函数还能定位显著影响预测的关键概念，修改这些概念可明显改变模型行为，为决策过程提供清晰、可理解的洞察。

链接: https://arxiv.org/abs/2605.19848
作者: Yike Sun,Mingkun Xu,Mu You,Zhongzhi He,Henghua Shen,Zehan Tan,Derek F. Wong,Tao Fang
机构: Macau Mobile Communications Co., Ltd. (澳门移动通信有限公司); University of Macau (澳门大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, the black-box nature of deep learning models has limited their application in high-stakes domains such as medical diagnosis and finance, where interpretability is essential. To address this, we propose a novel approach using influence functions to enhance interpretability in NLP models at both the sample and concept levels. Experiments on CEBaB and Yelp datasets show that influence functions effectively identify the most impactful training samples, both helpful and harmful, on model predictions. By adjusting the labels and weights of these samples, we demonstrate that model performance can be restored to baseline levels without retraining, confirming the value of influence functions for efficient data debugging. Furthermore, our concept-level analysis identifies key concepts within Concept Bottleneck Models (CBM) that significantly affect predictions. Modifying these concepts alters model behavior observably, providing clear insights into the decision process.

[NLP-23] FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding CVPR’26

【速读】：该论文试图解决的问题是：当前视觉语言模型（VLMs）在通用视频理解任务中表现优异，但在细粒度理解方面存在明显不足，尤其是在需要精确解析人类动作与交互的现实应用场景中。现有基准测试未同时涵盖长视频、密集问答覆盖以及帧级时空定位能力，导致对模型细粒度理解能力的评估不充分。解决方案的关键在于提出FineBench这一新型人本视频问答（VQA）基准，其包含64段每段15分钟的长视频，共199,420个多项选择题，聚焦于人物运动、互动及物体操作等细粒度内容，尤其强调组合性动作的理解；同时设计了FineAgent模块化框架，通过引入局部定位器（Localizer）和描述器（Descriptor）增强VLM的空间推理能力，显著提升开源模型在复杂多人群场景中的表现，为未来细粒度人本视频理解研究提供严谨评估平台与实用改进方案。

链接: https://arxiv.org/abs/2605.19846
作者: Gueter Josmy Faure,Min-Hung Chen,Jia-Fong Yeh,Hung-Ting Su,Winston H. Hsu
机构: National Taiwan University (台湾大学); Google (谷歌); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: CVPR’26 (Workshop on Video Large Language Models)

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs.

[NLP-24] CADENet: Condition-Adaptive Asynchronous Dual-Stream Enhancement Network for Adverse Weather Perception in Autonomous Driving

【速读】：该论文试图解决恶劣天气（如雨、雾、沙尘和雪）条件下基于摄像头的自动驾驶目标检测性能下降问题，以及现有“增强-检测”方法因处理延迟而无法满足实时性要求的瓶颈。其解决方案的关键在于提出一种无需训练的三线程系统CADENet：其中Thread S（YOLOv11n）以全帧率执行检测且无额外延迟；Thread Q采用条件自适应增强（CAPE）并利用熵引导非极大值抑制（EG-NMS）融合结果，不阻塞Thread S；Thread E通过CLIP实现零样本天气分类，支持新天气类别仅需文本提示即可适配，无需标注数据或重新训练。该设计在保持高实时性的前提下显著提升了恶劣天气下的检测鲁棒性，并通过引入“注释完整性偏差”的形式化分析，指出F1分数为真实增益的下界，而召回率（Recall）成为不受注释缺口影响的核心评估指标。

链接: https://arxiv.org/abs/2605.19837
作者: Sherif Khairy,Catherine M. Elias
机构: German University in Cairo (GUC); C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Adverse weather (rain, fog, sand, and snow) degrades camera-based object detection in autonomous vehicles. Existing enhancement-then-detect approaches stall the safety-critical perception loop, violating hard real-time requirements. Progress on this problem is also constrained by an under-recognized evaluation ceiling: ground truth annotated on degraded images cannot credit a detector that recovers objects the annotators themselves could not see, so a genuinely useful enhancement can register as a near-flat F1 gain. This paper presents CADENet (Condition-Adaptive Asynchronous Dual-stream Enhancement Network), a training-free three-thread system: Thread S (YOLOv11n) delivers detections at full frame rate with zero added latency; Thread Q applies condition-adaptive enhancement (CAPE) and fuses results via entropy-guided NMS (EG-NMS) without blocking Thread S; Thread E provides CLIP zero-shot weather classification, so new weather categories require only a new text prompt, with no labeled data and no retraining. Evaluated on 1327 DAWN images (YOLOv11m, IoU = 0.5, confidence = 0.25), CADENet achieves Recall = 0.0103 (micro), F1 = 0.0230 on snow, and F1 = 0.0038 on rain. We formalize the annotation completeness bias on DAWN-class data, so the reported F1 values are lower bounds on the true gain; recall is the annotation-gap-immune headline metric. Thread S sustains approximately 44 FPS regardless of enhancement load. No model retraining or additional sensor hardware is required.

[NLP-25] Mega-ASR: Towards In-the-wild2 Speech Recognition via Scaling up Real-world Acoustic Simulation

【速读】：该论文试图解决现实环境中自动语音识别（ASR）系统面临的“声学鲁棒性瓶颈”问题，即在严重且复合的声学失真条件下，模型容易丧失声学语义关联，导致漏识或幻觉。解决方案的关键在于提出Mega-ASR框架，其核心创新包括：1）构建大规模复合数据集Voices-in-the-Wild-2M（涵盖7类经典声学现象和54种物理合理的复合场景），2）采用声学到语义的渐进式监督微调（Acoustic-to-Semantic Progressive Supervised Fine-Tuning）与双粒度词错误率（WER）门控策略优化（Dual-Granularity WER-Gated Policy Optimization），从而实现从声学到语义层面的协同优化。实验表明，Mega-ASR在多个恶劣条件下的ASR基准测试中显著优于现有最先进方法，尤其在复杂复合声学场景下相对词错误率（WER）降低超过30%，为野外环境下可扩展的鲁棒ASR提供了新范式。

链接: https://arxiv.org/abs/2605.19833
作者: Zhifei Xie,Kaiyu Pang,Haobin Zhang,Deheng Ye,Xiaobin Hu,Shuicheng Yan,Chunyan Miao
机构: NTU (南洋理工大学); NUS (新加坡国立大学); Shanghai AI Lab (上海人工智能实验室)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Project page: this https URL . Code, models, and dataset will be released. A robust ASR framework targeting in-the-wild and compositional acoustic scenarios where conventional ASR systems fail

点击查看摘要

Abstract:Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an “acoustic robustness bottleneck”: models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.

[NLP-26] From Prompts to Pavement Through Time: Temporal Grounding in Agent ic Scene-to-Plan Reasoning

【速读】：该论文试图解决的问题是：当前基于大型语言模型（LLM）和大型多模态模型（LMM）的自动驾驶车辆高阶场景理解与规划方法中，时间被视为次要属性，导致对连续动作的推理不一致，从而影响安全性和可解释性。解决方案的关键在于引入时间条件化（temporal conditioning）机制，通过在多智能体通信中增强时间感知能力，以提升场景到规划（scene-to-plan）推理的一致性与合理性。研究设计了三种逐步增加时间整合程度的规划器架构，并在BDD-X数据集的子集上进行评估，发现尽管时间条件化未显著提升标准自然语言处理（NLP）指标，但其能激发预测性危险推理、稳定的修正行为及策略性分歧，表明时间建模对复杂驾驶决策具有潜在价值，并首次建立了针对时间感知场景到规划推理的实证基准。

链接: https://arxiv.org/abs/2605.19824
作者: Ahmed Y. Gado,Omar Y. Goba,Alaa Hassanein,Catherine M. Elias,Ahmed Hussein
机构: German University in Cairo (GUC); C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems; Deggendorf Institute of Technology; IAV GmbH
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.

[NLP-27] LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation

【速读】：该论文试图解决法律自然语言处理（Legal NLP）领域中法律命题（legal proposition）自动生成与评估的不足问题。其核心挑战在于如何有效利用大语言模型（LLMs）生成符合法律规范且具有实质内容的命题，并建立可靠、可解释的评估体系。解决方案的关键在于提出一个由法律专家共同设计的三步式评估框架LP-Eval，该框架将法律命题质量分解为形式有效性（formal validity）和实质内容（substantive dimensions）两个维度，从而实现对LLM生成命题的系统性量化评估。研究团队基于此框架构建了一个包含两位专家标注的100条法律命题数据集，并发现LLM能生成高质量命题，但其表现受案例成熟度影响；同时，基于评估框架引导的LLM评分比直接整体打分更贴近专家判断，但仍难以捕捉人类专家识别的细微差异。

链接: https://arxiv.org/abs/2605.19815
作者: Shanshan Xu,Johan Lindholm,Amogh Raina,Henrik Palmer Olsen,Daniel Hershcovich
机构: University of Copenhagen(哥本哈根大学); Umeå University(于默奥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Legal proposition generation is central to legal reasoning and doctrinal scholarship, yet remain under-examined in Legal NLP. This paper investigates the automatic generation and evaluation of legal propositions from decisions of the Court of Justice of the European Union using large language models (LLMs). We introduce LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions. Using this rubric, we release a dataset of two experts’ annotations for 100 LLM-generated legal propositions. Our results show that LLMs can generate predominantly well-formed and high-quality propositions, while expert evaluations reveal higher quality for propositions derived from well established cases than from recent ones. We further examine LLMs as evaluators and find that rubric-guided LLM judgments align more closely with expert assessments than direct overall scoring, but remain insensitive to finer-grained distinctions captured by human experts.

[NLP-28] Chunking German Legal Code

【速读】：该论文旨在解决在德国民法典（German Civil Code）场景下，检索增强生成（Retrieval-Augmented Generation, RAG）中分块策略（chunking strategies）对法律问答任务性能的影响问题。其核心挑战在于如何在保持高召回率的同时控制计算成本与存储开销。解决方案的关键在于：采用与法律文本固有结构一致的分块方式（如条款、子条款），相较于依赖大语言模型（LLM）密集计算的复杂方法（如上下文感知分块、RAPTOR层次检索或Lumber式分块），这些基于结构的简单分块策略不仅实现了更高的召回率，还显著降低了查询延迟、索引构建时间和存储需求。研究结果揭示了语义丰富性与运行成本之间的权衡关系，并强调在法律信息检索中保留领域特定结构是提升系统有效性的重要前提。

链接: https://arxiv.org/abs/2605.19806
作者: Max Prior,Natalia Milanova,Andreas Schultz
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates chunking strategies for retrieval-augmented generation on German statutory law, using the German Civil Code as a structured benchmark corpus. We implement and compare a range of segmentation approaches, including structural units (sections, subsections, sentences, propositions), fixed-size windows, contextual chunking, semantic clustering, Lumber-style chunking, and RAPTOR-based hierarchical retrieval. All methods are evaluated on a legal question-answering dataset with section-level gold labels, measuring recall, query latency, index build time, and storage requirements. Results show that chunking strategies aligned with the inherent legal structure - particularly section and subsection - based retrieval-achieve the highest recall, while more complex approaches that override this structure perform worse. These simpler methods also offer favorable computational efficiency compared to LLM-intensive techniques such as contextual chunking, RAPTOR, and Lumber. The findings highlight a key trade-off between semantic enrichment and operational cost, and demonstrate that preserving domain-specific structure is critical for effective legal information retrieval.

[NLP-29] owards Trust Calibration in Socially Interactive Agents : Investigating Gendered Multimodal Behaviors Generation with LLM s

【速读】：该论文试图解决的问题是：如何使社会交互代理（Socially Interactive Agents, SIAs）在日常生活中与用户建立恰当的信任关系，即确保用户对代理的能力（ability）和善意（benevolence）的感知与其实际表现相匹配。解决方案的关键在于提出一种新颖的方法，利用大语言模型（LLMs）自动生成多模态行为（包括言语、语调、手势和面部表情），并使其精准反映特定水平的能力和善意。研究通过分析由GPT-5.4生成的大规模多模态行为数据集，验证了模型在不同模态间保持一致性，并借助随机森林特征重要性分析证明生成行为符合理论预期；同时，用户研究表明参与者能准确识别出所设计的意图差异，从而为实现信任校准的交互提供了可行路径。

链接: https://arxiv.org/abs/2605.19798
作者: Lucie Galland,Chloé Clavel,Magalie Ochs
机构: LIS Laboratory, AmuMarseilleFrance; Inria ParisParisFrance
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Socially Interactive Agents (SIAs) become increasingly integrated into daily life, the ability to calibrate user trust to an agent’s actual capabilities would help ensure appropriate usage of these agents. In this paper, we explore the capacity of Large Language Models (LLMs) to generate multimodal behaviors (verbal, vocal, gestural, and facial expression modalities) that reflect varying levels of ability and benevolence, two key dimensions of trustworthiness. We propose a novel method for automatically generating behaviors aligned with specific levels of these traits, a first step towards enabling nuanced and trust-calibrated interactions. By analyzing a large dataset of multimodal transcripts generated by LLMs, we demonstrate that GPT-5.4 is able to produce coherent behavior across different modalities (text, intonation, facial expression, and gesture). Using Random Forest feature importance analysis, we show that the generated behaviors align with theoretical expectations for ability and benevolence. However, we also find that when gender is specified in the prompt, LLMs tend to reproduce societal gender stereotypes, associating male agents’ behaviors with high ability and female agents’ behaviors with high benevolence. To validate our approach, we conducted a user study on Prolific using a within-subjects design. Participants perceived different levels of ability and benevolence in the generated behaviors align with the intended instructions.

[NLP-30] Synthesis and Evaluation of Long-term History-aware Medical Dialogue AAMAS2026

【速读】：该论文旨在解决当前医疗健康代理（healthcare agent）在处理患者长期医学病史时存在的记忆与推理能力不足问题，尤其针对现有数据集缺乏真实、连续的长期对话时间线，导致系统性评估困难的问题。其解决方案的关键在于提出一个基于知识引导的三阶段框架：首先构建包含多样化疾病进展和并发症轨迹的合成患者档案；其次生成每次就诊中的多轮对话；最后将这些对话整合为结构一致的纵向医疗对话数据集 MediLongChat。该框架还配套设计了三项基准任务（In-dialogue Reasoning、Cross-dialogue Reasoning 和 Synthesis Reasoning）用于评估模型的记忆与跨会话推理能力，并引入多维评估体系（包括向量指标与大语言模型作为裁判的评估），定义了 Faithfulness、Coherence、Diversity 等自动指标以及 Correctness 和 Realism 两类 LLM-based 评估指标，从而全面衡量数据质量与模型表现。实验表明，即便是最先进的大语言模型在 MediLongChat 上也面临挑战，凸显了该基准的实用性及开发专门面向医疗场景的增强方法的必要性。

链接: https://arxiv.org/abs/2605.19766
作者: Hebin Hu,Renke Dai,Ah-Hwee Tan,Yilin Kang
机构: South China University of Electric Power (华南理工大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAMAS 2026

点击查看摘要

Abstract:An effective healthcare agent must be able to recall and reason over a patient’s longitudinal medical history. However, the absence of datasets with realistic long-term dialogue timelines limits systematic evaluation. Real clinical text is constrained by privacy and ethics, while existing benchmarks focus on isolated interactions, failing to capture cross-session reasoning. We introduce a framework for synthesizing high-quality, long-term medical dialogues with LLMs. Our approach entails a knowledge-guided decomposition into three stages: constructing synthetic patient profiles with diverse disease and complication trajectories, generating multi-turn dialogues per encounter, and integrating them into a coherent longitudinal history dataset, MediLongChat. We establish three benchmark tasks-In-dialogue Reasoning, Cross-dialogue Reasoning, and Synthesis Reasoning-to evaluate the memory capabilities of healthcare agents. To assess data quality, we introduce a multi-dimensional evaluation framework combining vector-based metrics with LLM-as-a-judge assessments. Specifically, we define automatic measures-Faithfulness, Coherence, and Diversity-together with two LLM-based evaluations: Correctness and Realism. Benchmark experiments show that even state-of-the-art LLMs struggle with MediLongChat. These findings highlight the benchmark’s applicability and underscore the need for tailored methods to advance healthcare agents.

[NLP-31] What Really Improves Mathematical Reasoning : Structured Reasoning Signals Beyond Pure Code ICML2026

【速读】：该论文试图解决的问题是：在基础语言模型（foundation language model）训练中，代码（code）是否能作为通用推理能力的增强剂，以及其与知识密集型任务（如复杂数学推理）之间的关系尚不明确。解决方案的关键在于通过受控的预训练实验，在一个10T token的语料库上进行细粒度的领域分离分析，发现代码本身并不具备普遍的推理提升作用，反而会与知识密集型任务竞争资源；真正带来推理增益的是跨域结构化推理线索（如代码-文本、数学-文本混合），而非可执行代码本身；此外，通过提高数学领域结构化样本密度，在固定预算下显著提升了困难数学推理能力，同时保持编程性能稳定，表明认知支架（cognitive scaffolds）可有效缓解跨领域权衡问题。

链接: https://arxiv.org/abs/2605.19762
作者: Yuze Zhao,Junpeng Fang,Lu Yu,Zhenya Huang,Kai Zhang,Qing Cui,Qi Liu,Jun Zhou,Enhong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by ICML 2026, 22 pages, 10 figures

点击查看摘要

Abstract:Code has become a standard component of modern foundation language model (LM) training, yet its role beyond programming remains unclear. We revisit the claim that code improves reasoning through controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation. Our findings are threefold. First, when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general reasoning enhancer; instead, it competes with knowledge-intensive tasks, especially complex mathematical reasoning. Second, the reasoning gains often attributed to code are better explained by cross-domain structured reasoning traces, such as code-text and math-text mixtures, rather than by executable code alone. Third, increasing the density of structured math-domain samples within a fixed math budget yields substantial gains on difficult mathematical reasoning while largely preserving programming performance, suggesting that cognitive scaffolds offer a targeted way to mitigate cross-domain trade-offs. Finally, routing analyses show that data-composition effects are reflected in expert-activation patterns, providing mechanism-level evidence for competitive and synergistic interactions across domains. Our results clarify which data characteristics transfer across capability dimensions and point to more precise data-centric optimization strategies.

[NLP-32] ERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection

【速读】：该论文试图解决现有文本增强图异常检测方法在处理节点异常时忽视其拓扑结构语义的问题，即如何有效融合节点的结构性信息与内容特征以识别更复杂的异常模式。解决方案的关键在于提出一种名为TERGAD（Structure-aware Text-enhanced Representations for Graph Anomaly Detection）的数据增强框架，其核心创新是利用大语言模型（LLM）将节点的拓扑属性转化为自然语言描述，并生成高阶语义嵌入，再通过门控双分支自编码器将这些语义嵌入与原始节点属性进行自适应融合，最终基于联合重构误差计算异常分数，从而同时捕捉节点在可观测属性和LLM推导出的语义预期上的偏离。

链接: https://arxiv.org/abs/2605.19738
作者: Wen Shi,Zhe Wang,Huafei Huang,Qing Qing,Ziqi Xu,Qixin Zhang,Xikun Zhang,Renqiang Luo,Feng Xia
机构: Jilin University (吉林大学); Adelaide University (阿德莱德大学); RMIT University (皇家墨尔本理工大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Graph Anomaly Detection (GAD) aims to identify atypical graph entities, such as nodes, edges, or substructures, that deviate significantly from the majority. While existing text-rich approaches typically integrate structural context into the data representation pipeline using raw textual features, they often neglect the structural context of nodes. This limitation hinders their ability to detect sophisticated anomalies arising from inconsistencies between a node’s inherent content and its topological role. To bridge this gap, we propose TERGAD (Structure-aware Text-enhanced Representations for Graph Anomaly Detection), A novel data augmentation framework that enriches structural semantics for GAD via the semantic reasoning capabilities of Large Language Models (LLMs). Specifically, TERGAD translates node-level topological properties into descriptive natural language narratives, which are subsequently processed by an LLM to derive high-level semantic embeddings. These embeddings are then adaptively fused with original node attributes through a gated dual-branch autoencoder to jointly reconstruct both graph structure and node features. The anomaly score is computed based on the integrated reconstruction error, effectively capturing deviations in both observable attributes and LLM-informed semantic expectations. Extensive experiments on six real-world datasets demonstrate that TERGAD consistently outperforms state-of-the-art baselines. Furthermore, our ablation studies validate the indispensable role of structural semantic guidance and the efficacy of the gated fusion mechanism. Code is available at this https URL.

[NLP-33] ContextRAG : Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation

【速读】：该论文试图解决当前图结构检索增强生成（Graph-structured Retrieval-Augmented Generation, RAG）系统在索引阶段过度依赖大语言模型（LLM）进行实体、关系和摘要提取所导致的高计算开销问题，尤其是在大规模语料库中，LLM调用带来的token消耗和时间成本显著增加。解决方案的关键在于提出ContextRAG，一种无需LLM参与实体或关系抽取即可构建图拓扑的新型图RAG系统：它通过残差量化k-means聚类与Lukasiewicz residuated逻辑下的形式概念分析（Formal Concept Analysis）从文本块嵌入中自动推导出模糊概念图，并利用软模糊连接（join）和交运算（meet）生成桥接型和合取型上下文节点，从而替代传统由LLM显式书写图边的方式。实验表明，ContextRAG仅需30次LLM调用和22,073 tokens即可完成索引，远低于对比方法（如HiRAG）的870次调用和354万tokens；其在多跳任务上达到36.8% F1，且检索到至少一个格论衍生节点的查询相比未检索到者提升3.9个百分点F1，验证了该设计的有效性。

链接: https://arxiv.org/abs/2605.19735
作者: Roman Prosvirnin,Sergei Kuznetsov,Seungmin Jin
机构: HSE University (高等经济大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. 6 tables

点击查看摘要

Abstract:Graph-structured retrieval-augmented generation (RAG) systems can improve answer quality on multi-hop questions, but many current systems rely on large language models (LLMs) to extract entities, relations, and summaries during indexing. These calls add token and wall-clock costs that grow with corpus size. We present ContextRAG, a graph RAG system whose graph topology is constructed without LLM-based entity or relation extraction. ContextRAG derives a fuzzy concept graph over chunk embeddings using residual-quantization k-means and Formal Concept Analysis with Lukasiewicz residuated logic. Bridge-like and meet-derived context nodes are induced by soft fuzzy join and meet operations, rather than by LLM-written graph edges. On a 130-task UltraDomain subset, ContextRAG builds its index with 30 LLM calls and 22,073 tokens. In contrast, a local HiRAG reproduction stress test required 870 indexing calls and 3.54M tokens on a 20-task subset before failing during graph construction; linear extrapolation to 130 tasks implies over 23M indexing tokens. ContextRAG obtains 33.6% F1 overall and 36.8% F1 on multi-hop tasks. An activation analysis shows that queries retrieving at least one lattice-derived node in the top five achieve +3.9 percentage points F1 over queries that do not; this association is diagnostic rather than causal.

[NLP-34] Mathematical Reasoning in Large Language Models : Benchmarks Architectures Evaluation and Open Challenges

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在数学推理能力评估与提升方面的关键问题，即如何系统性地理解LLMs在数学推理任务中的表现、局限及其改进路径。其解决方案的关键在于构建一个统一的分析框架，涵盖数学数据集分类（预训练语料、监督微调资源和评估基准）、推理架构与训练策略（如工具集成、验证器引导推理、参数高效适配）以及评估指标的对比分析，从而识别出当前模型在推理忠实性、基准偏差和泛化能力等方面的共性失败模式，并提出未来研究方向，包括增强符号 grounding、提高评估可靠性，以及发展更鲁棒和可信的基于LLM的推理系统。

链接: https://arxiv.org/abs/2605.19723
作者: Husnain Amjad,Raja Khurram Shahzad,Aamir Shahzad,Mehwish Fatima
机构: SEECs (巴基斯坦国防大学电子与计算机工程系); Malmö University (马尔默大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.

[NLP-35] CAIT: A Syntactic Parsing Toolkit for Child-Adult InTeractions

【速读】：该论文旨在解决儿童语言习得研究中缺乏高效、精准的句法结构分析工具的问题。当前CHILDES语料库虽是语言习得研究的核心资源，但其句法分析仍依赖有限的计算工具。解决方案的关键在于利用最新发布的带有金标准Universal Dependencies（UD）标注的UD-English-CHILDES树库，训练了一个专为儿童与成人互动语境定制的先进依存句法分析器。该分析器在捕捉儿童与成人对话中的句法模式方面显著优于通用英语解析器（如SpaCy和Stanza），并配套发布了词性标注器和话语级构式标注器，共同构成开源的儿童-成人交互句法分析工具包（CAIT）。通过误差分析和发育时间轴上的构式分布案例研究，论文验证了该工具包在大规模、可复现的语言习得研究中的实用性。

链接: https://arxiv.org/abs/2605.19718
作者: Francesca Padovani,Xiulin Yang,Bastian Bunzeck,Jaap Jumelet,Yevgen Matusevych,Nathan Schneider,Arianna Bisazza
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:CHILDES is a paramount resource for language acquisition studies – yet computational tools for analyzing its syntactic structure remain limited. Leveraging the recent release of the UD-English-CHILDES treebank with gold-standard Universal Dependencies (UD) annotations, we train a state-of-the-art dependency parser specifically tailored to CHILDES. The parser more accurately captures syntactic patterns in child–adult interactions, outperforming widely used off-the-shelf English parsers, including SpaCy and Stanza. Alongside the parser, we also release a Part-of-Speech tagger and an utterance-level construction tagger, which together form the open-source Syntactic Parsing Toolkit for Child–Adult InTeractions (CAIT). Through a detailed error analysis and a case study tracking the distribution of syntactic constructions across developmental time in CHILDES, we demonstrate the practical utility of the toolkit for large-scale, reproducible research on language acquisition.

[NLP-36] LLM -Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets LREC2026

【速读】：该论文试图解决阿拉伯语金融语境下投资者情绪建模的难题，主要挑战在于语言复杂性以及相关资源匮乏。解决方案的关键在于构建一个针对沙特市场的大型阿拉伯语金融情感分析框架，该框架整合官方财经新闻与社交媒体数据，以捕捉机构与公众投资者的情绪。其核心创新在于通过多阶段流水线（包括数据收集、清洗、去重、实体链接和情感标注）构建了一个包含84K样本的大型阿拉伯语金融语料库，并采用基于Transformer的命名实体识别（NER）技术结合定制公司词典，实现文本提及与标准公司标识符的精准映射，同时使用五类情感标签体系进行标注，从而支持企业级情绪聚合及与沙特证券交易所股票市场行为的动态关联分析。

链接: https://arxiv.org/abs/2605.19714
作者: Mona H. Albaqawi,Eman M. Albalkhi,Joud A. Albaiti,Enrico Lopedoto
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at the 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT7), co-located with LREC 2026, Palma de Mallorca, Spain, May 2026. ISBN: 978-2-493814-52-4

点击查看摘要

Abstract:Investor sentiment shapes financial markets, yet modeling sentiment in Arabic financial contexts remains challenging due to linguistic complexity and limited resources. We present an Arabic NLP framework for large-scale financial sentiment analysis tailored to the Saudi market, integrating official financial news and social media to capture institutional and public investor sentiment. The framework constructs a large Arabic financial corpus through a multi-stage pipeline encompassing data collection, cleaning, deduplication, entity linking, and sentiment annotation. Transformer-based NER combined with a curated company lexicon links textual mentions to canonical company identifiers, with sentiment labels assigned using a five-class scheme. The resulting dataset of 84K samples supports company-level sentiment aggregation and analysis of sentiment dynamics relative to stock market behavior on the Saudi Exchange. Experimental results demonstrate reliable and scalable Arabic financial sentiment analysis.

[NLP-37] Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian INTERSPEECH2026

【速读】：该论文试图解决低资源语言（以弗里西语为例）在自动语音识别（ASR）中性能受限的问题，以及大语言模型（LLMs）通过生成式错误修正（GER）提升ASR效果的有效性是否受数据污染影响的不确定性。解决方案的关键在于：构建一个非公开的离线弗里西语文本数据集用于评估，从而控制潜在的数据污染问题；实验结果表明，GER在多数场景下均能提升ASR性能，且最佳GPT-5.1模型的结果甚至优于理想情况下的词错误率（oracle WER），同时在离线数据集上仍取得相当收益，证明了模型具备真实的纠错能力；此外，论文还提供了详细的错误分析，揭示了模型的修正模式。

链接: https://arxiv.org/abs/2605.19711
作者: Yun Hao,Reihaneh Amooie,Wietse de Vries,Rik van Noord,Martijn Wieling
机构: University of Groningen (格罗宁根大学)
类目: Computation and Language (cs.CL)
备注: Submitted to Interspeech 2026

点击查看摘要

Abstract:Automatic speech recognition (ASR) has improved substantially in recent years, yet performance remains limited for low-resource languages. Large language models (LLMs) have shown promise for improving ASR through generative error correction (GER), but their effectiveness in low-resource settings remains underexplored. In addition, it remains unclear to what extent data contamination influences the reported improvements in LLM-based GER. This study investigates LLM-based GER for low-resource Frisian. In addition to a public corpus, we construct and use a Frisian offline dataset with non-public texts for evaluation to control for potential data contamination. Results show that GER improves ASR performance in most settings, with the best GPT-5.1 results surpassing oracle WERs. Comparable gains on the offline dataset indicate that improvements reflect true correction ability. We further provide a detailed error analysis revealing model correction patterns.

[NLP-38] OScaR: The Occams Razor for Extreme KV Cache Quantization in LLM s and Beyond

【速读】：该论文旨在解决大语言模型（LLM）在长上下文推理和多模态智能发展背景下，键值（Key-Value, KV）缓存占用内存过大导致的部署效率瓶颈问题。其核心挑战在于：传统逐通道量化（per-channel quantization）方法在极端压缩条件下性能显著下降，主要原因被识别为“Token Norm Imbalance”（TNI，token范数失衡），即不同token间范数差异导致共享量化参数时误差被系统性放大。解决方案的关键创新是提出OScaR（Omni-Scaled Canalized Rotation）框架，通过“受控旋转”（Canalized Rotation）与“全token缩放”（Omni-Token Scaling）联合机制，在保持低复杂度的同时有效缓解TNI引发的序列维度方差问题。该方法无需复杂的量化流水线，且在X-LLMs（文本、多模态及全模态大模型）上实现近无损INT2量化压缩，相较BF16 FlashDecoding-v2基线，最高提升3.0倍解码速度、降低5.3倍内存占用、提升4.1倍吞吐量，成为新的帕累托前沿（Pareto front）基准。

链接: https://arxiv.org/abs/2605.19660
作者: Zunhai Su,Rui Yang,Chao Zhang,Yaxiu Liu,Yifan Zhang,Wei Wu,Jing Xiong,Dayou Du,Xialie Zhuang,Yulei Qian,Yuchen Xie,Yik-Chung Wu,Hongxia Yang,Ngai Wong
机构: Tsinghua University (清华大学); Meituan LongCat Team (美团龙猫团队); The University of Hong Kong (香港大学); The University of Edinburgh (爱丁堡大学); UCAS (中国科学院大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at this https URL.

[NLP-39] K-Quantization and its Impact on Output Performance

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）因参数规模庞大而难以部署的问题，核心关注点是量化（quantization）技术对模型性能与准确率的影响。其解决方案的关键在于系统性地评估八种不同规模的LLMs在2至6比特量化水平下的表现，涵盖知识推理（MMLU-Pro）、代码理解（CRUXEval）和阅读理解（MuSR）等任务。研究发现：高精度量化（如8-bit Q8_0）能稳定提升性能但收益递减；低比特量化（如2-bit Q2_K）虽可保留可接受的准确性，但部分模型性能显著下降；值得注意的是，中等规模模型（7–90亿参数）在效率与资源消耗之间达到最佳平衡，展现出最优的量化鲁棒性。这一结果为模型压缩中的精度-效率权衡提供了实证依据。

链接: https://arxiv.org/abs/2605.19645
作者: Robin Baki Davidsson,Pierre Nugues
机构: Lund University (隆德大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shown their remarkable capacities in many NLP tasks. However, their substantial size often presents challenges for deployment. This necessitates efficient techniques for model compression, with quantization emerging as a prominent solution. Despite its benefits, the exact impact of quantization (from 2- to 6-bit) on the performance and accuracy of LLMs remains an active area of research. This paper investigates the performance of eight LLMs at various quantization levels, focusing on tasks such as MMLU-Pro for knowledge processing and reasoning, CRUXEval for code comprehension, and MuSR for reading comprehension. Our results show a consistent trend where higher precision (e.g., 8-bit Q8_0) yields improved performance, albeit with diminishing returns. Aggressive quantization (e.g., 2-bit Q2_K) usually retains acceptable accuracy, though some models show a substantial loss in performance. Our findings indicate that while lower bit precision generally reduces performance, the impact varies across models and tasks. Larger models show greater resilience to aggressive quantization, but can still undergo significant drops at lower precision levels. Mid-sized models in the 7-9 billion parameter range strike an optimal balance between efficiency and resource usage. Such results provide insights into the trade-offs between model size, quantization, and performance.

[NLP-40] optimize_anything: A Universal API for Optimizing any Text Parameter

【速读】：该论文试图解决的问题是：如何构建一个统一的优化系统，使其能够跨多个根本不同的领域（如AI推理、云计算调度、CUDA内核生成等）达到甚至超越专门设计的工具的性能。解决方案的关键在于将各类优化问题统一建模为“通过评分函数评估文本结构的改进过程”，并利用大语言模型（LLM）进行搜索。该方法支持单任务搜索、多任务搜索（通过跨问题迁移学习提升效率）以及对未见输入的泛化能力，在六个不同任务中均实现了最先进（SOTA）的结果。实验表明，引入可操作的辅助信息比仅依赖评分反馈能显著加速收敛并提高最终性能，且多任务搜索在同等预算下优于独立优化，其优势随相关任务数量增加而增强。这首次证明了基于LLM的文本优化是一种通用的问题求解范式，可将原本需依赖领域专用算法的任务整合到同一框架中。

链接: https://arxiv.org/abs/2605.19633
作者: Lakshya A Agrawal,Donghyun Lee,Shangyin Tan,Wenjie Ma,Karim Elmaaroufi,Rohit Sandadi,Sanjit A. Seshia,Koushik Sen,Dan Klein,Ion Stoica,Joseph E. Gonzalez,Omar Khattab,Alexandros G. Dimakis,Matei Zaharia
机构: UC Berkeley(加州大学伯克利分校); MIT(麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Software Engineering (cs.SE)
备注: 16 pages, 11 figures; Blog: this https URL

点击查看摘要

Abstract:Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash’s ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve’s reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize_anything with support for multiple backends as part of the GEPA project at this https URL .

[NLP-41] LLM Eval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLM s with Adversarial Hardening

【速读】：该论文旨在解决当前大语言模型（LLMs）在自然语言逻辑推理能力评估中存在的问题：现有基准测试多通过模板化方式生成题目，缺乏精细或未经审核的形式化标注，且已迅速被前沿模型饱和。其解决方案的关键在于构建一个基于真实情境的中文逻辑推理基准LLMEval-Logic，该基准通过专家撰写和审核自然语言题目及其参考形式化表达、利用Z3验证答案正确性、制定专家评分规则（rubrics）以实现从自然语言到形式化表达的精准评分，并采用闭环对抗流程强化题目的难度。该基准包含两个配对子集：246项基础题（含1,400个评分原子）和190项难题（含938个多步骤子问题），实证表明当前最优模型在难题上的准确率仅为37.5%，即便结合形式化验证与专家评分，最高得分也仅达60.16%，凸显了当前模型在严格逻辑推理任务中的显著差距。

链接: https://arxiv.org/abs/2605.19597
作者: Ming Zhang,Qiyuan Peng,Yinxi Wei,Yujiong Shen,Kexin Tan,Yuhui Wang,Zhenghao Xiang,Junjie Ye,Zhangyue Yin,Zhiheng Xi,Shihan Dou,Tao Gui,Maxm Pan,Ruizhi Yang,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); Tencent (腾讯)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at this https URL.

[NLP-42] GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

【速读】：该论文试图解决长上下文强化学习（Long-Context Reinforcement Learning, LCRL）中数据构建方式单一、奖励机制不充分以及多任务优化困难的问题。现有方法通常依赖复杂的检索路径设计，导致任务覆盖范围狭窄且奖励设定无法真实反映实际长文本处理需求。解决方案的关键在于：(1) 提出以能力为导向的数据构建范式，并开源包含23K条RLVR样本的完整数据集及训练代码，涵盖9类任务类型与对应自然评估指标，结合真实来源（如书籍、学术论文）生成的合成样本，显著提升长上下文能力表现；(2) 设计TMN-Reweight机制，通过任务级均值归一化实现跨任务奖励尺度对齐，并引入难度自适应加权策略改善优势估计可靠性，从而在异构奖励场景下实现更稳定的多任务优化，提升整体性能并保持通用能力。

链接: https://arxiv.org/abs/2605.19577
作者: Minxuan Lv,Tiehua Mei,Tanlong Du,Junmin Chen,Zhenpeng Su,Ziyang Chen,Ziqi Wang,Zhennan Wu,Ruotong Pan,jian Liang,Ruiming Tang,Han Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.

[NLP-43] Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

【速读】：该论文试图解决自演化技能库中存在的“库漂移”（library drift）问题，即在缺乏以结果为导向的生命周期管理机制下，技能无限制积累导致检索性能下降、误报注入和性能停滞。其解决方案的关键在于：(1) 提供可复现的触发条件，通过消融实验明确漂移机制——禁用技能注入时性能稳定（+0.002），强制提前退役则造成负面影响（-0.019）；(2) 引入细粒度诊断工具，包括仅追加证据日志（append-only evidence log），记录每项技能的贡献分数、归因判断与路由参与度指标，使失败在最终任务指标恶化前即可被识别；(3) 验证了一种轻量级治理方案（基于结果的技能淘汰 + 有界活跃容量 + 元技能作者优先策略），在MBPP+ hard-100数据集上将保留测试集的pass@1从0.258提升至0.584（滚动增益+0.328），并通过8组消融实验揭示各治理机制的负载作用，为任何自演化智能体的库漂移诊断提供了可操作的实践指南。

链接: https://arxiv.org/abs/2605.19576
作者: Xing Zhang,Yanwei Cui,Guanghui Wang,Ziyuan Li,Wei Qiu,Bing Zhu,Peiyang He
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Self-evolving skill libraries face a silent failure mode we term \emphlibrary drift: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom–LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)–yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift–one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, - 0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain + 0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

[NLP-44] A Data-Driven Approach to Idiomaticity Based on Experts Criteria in Theoretical Linguistics

【速读】：该论文试图解决的问题是：如何通过多维度标准对多词表达（MWEs）的习语性（idiomaticity）进行量化分析，并澄清是否存在绝对意义上的习语表达。解决方案的关键在于构建一个基于16项词汇、语法及其他标准的分类体系，对286个MWEs进行系统标注与数据统计分析，结果表明习语性是一个连续谱而非二元划分，其中词汇类标准影响最显著，而语法类标准仅在特定条件下起作用；此外，过时词汇和语法结构的存在会显著影响MWE被单个词语替代的可能性。

链接: https://arxiv.org/abs/2605.19575
作者: Elena Mikhalkova,Anastasiya Vishnyakova,Anastasiya Drozdova,Polina Gavin,Aleksander Zhmykhov,Timofey Protasov
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The article observes data analysis of 286 multi-word expressions (MWEs) based on 16 lexical, grammatical and other criteria described in theoretical books and papers on the notion of idiomaticity. MWEs were collected from the same theoretical sources, and a set of experts in linguistics annotated them with these categories. The distribution of categories shows that there are no absolutely idiomatic expressions. Lexical criteria seem to be the most influential; grammatical criteria are bound to certain conditions; presence of obsolete words and grammar influence ability of an MWE to be replaced with one word.

[NLP-45] m3BERT: A Modern Multi-lingual Matryoshka Bidirectional Encoder KDD2026

【速读】：该论文旨在解决工业信息检索系统中预训练嵌入模型难以适应多样化部署场景的问题，特别是当不同业务场景对资源（如计算、存储）和精度要求不同时，现有固定架构与维度的模型难以有效适配。其解决方案的关键在于提出m3BERT——一种现代、多语言、分层嵌套（Matryoshka）双向编码器模型，通过一种新颖的预训练策略，联合优化Transformer层内及多个嵌入维度上的表示学习，从而实现单一模型在不同资源约束下灵活裁剪并保持与预训练目标的一致性。此外，m3BERT采用三阶段预训练流程（单语种预训练、多语言迁移、大规模网络语料持续预训练），显著提升了在Bing-Click等工业级检索数据集上的性能，验证了其作为资源感知型工业检索基础模型的高效性和通用性。

链接: https://arxiv.org/abs/2605.19568
作者: Yaoxiang Wang,Simiao Zuo,Qingguo Hu,Yucheng Ding,Yeyun Gong,Jian Jiao,Jinsong Su
机构: Xiamen University (厦门大学); Microsoft (微软); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: KDD 2026

点击查看摘要

Abstract:Embedding models are pivotal in industrial information retrieval systems like search and advertising. However, existing pretrained models often exhibit fixed architectures and embedding dimensionalities, posing significant challenges when adapting them to diverse deployment scenarios with varying business-driven constraints. A common practice involves fine-tuning with partial parameter initialization from larger pretrained models for resource-constrained tasks. This method is often suboptimal as the misalignment between pretraining and downstream usage prevents full realization of pretraining benefits. To address this limitation, we introduce m3BERT: a Modern, Multi-lingual, Matryoshka Bidirectional Encoder, which features a novel pretraining strategy that jointly optimizes representations across both transformer layers and multiple embedding dimensions. This enables a single model to be tailored to varied resource and accuracy targets while maintaining consistency with pretraining. Incorporating recent architectural improvements, m3BERT uses a three-stage pretraining: monolingual pretraining, multilingual adaptation to serve diverse user bases, and crucial continual pretraining on a massive web domain corpus to enhance utility in commercial retrieval. m3BERT significantly outperforms state-of-the-art embedding models in Bing-Click, a large-scale industrial retrieval dataset, showcasing its practical versatility as an efficient foundation for resource-aware industrial retrieval systems. Further experiments on public datasets also confirm the general effectiveness of our multigranular Matryoshka pretraining strategy.

[NLP-46] Investigating Cross-Modal Skill Injection: Scenarios Methods and Hyperparameters

【速读】：该论文试图解决的问题是：如何高效地为视觉-语言模型（Vision-Language Models, VLMs）注入持续演进的领域特定技能，同时避免传统监督微调（Supervised Fine-Tuning, SFT）所需的大量数据标注和计算资源。现有方法如模型合并（Model Merging）虽能减少训练成本，但缺乏对跨模态技能注入（cross-modal skill injection）的系统性研究，尤其在适用场景、实现方法及超参数优化方面存在空白。解决方案的关键在于提出并系统分析跨模态技能注入的三方面核心要素——应用场景、合并方法与超参数配置：首先识别出该技术在指令遵循和跨语言任务中表现优异，但在数学推理任务中效果有限；其次发现经典合并方法（如TA和DARE）显著优于其他方法；最后通过定量分析揭示了这些方法对关键超参数（如融合权重、学习率等）的敏感性，从而为实际部署提供可复现的技术路径。

链接: https://arxiv.org/abs/2605.19523
作者: Zhiyu Xu,Lean Wang,Yuanxin Liu,Lei Li,Hao Zhou,Fandong Meng,Jie Zhou,Xu Sun
机构: Peking University (北京大学); Tencent Inc. (腾讯公司); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models (LLMs) to VLMs without incurring additional training data requirements or significant computational overhead. Unlike conventional merging of homogeneous LLMs, which mainly aggregates existing capabilities, cross-modal skill injection aims to induce emergent cross-modal capabilities by integrating a domain-expert LLM into a VLM. However, existing research lacks a systematic analysis of the applicability and methodology of cross-modal skill injection. In this study, we investigate cross-modal skill injection across three main aspects: scenarios, methods, and hyperparameters. For scenarios, we find that cross-modal skill injection generally performs well in instruction-following and cross-lingual settings, yet struggles with mathematical reasoning. For methods, we find that classic approaches such as TA and DARE consistently achieve superior performance over alternative merging methods. We also provide a systematic and quantitative analysis of the hyperparameter tuning that these classic methods critically depend on.

[NLP-47] Base Models Look Human To AI Detectors

【速读】：该论文试图解决的问题是：当前商业AI文本检测器在区分生成式AI文本与人类写作时存在显著偏差，尤其对经过指令微调（instruction-tuned）的模型生成文本误判率较高，而基础模型（base model）生成的文本反而常被误判为人类撰写。这一现象表明现有检测器可能依赖于指令微调带来的特定特征或局部上下文模式，而非稳定的机器生成文本表征。解决方案的关键在于提出一种称为“迭代改写人类化”（Humanization by Iterative Paraphrasing, HIP）的检测器无关方法，其核心机制是通过最小限度地微调基础模型使其成为改写器，并反复应用该改写器以实现语义保留前提下的检测规避。实验表明，HIP在多个大语言模型（Llama-3 和 Qwen-3，参数规模从0.6B到70B）上均能显著提升文本的人类相似度，且优于现有基线方法，从而揭示了当前检测器对指令微调痕迹的敏感性，并呼吁未来设计更显式建模这些因素的检测方案。

链接: https://arxiv.org/abs/2605.19516
作者: Yixuan Even Xu,Ziqian Zhong,Aditi Raghunathan,Fei Fang,J. Zico Kolter
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 9 figures

点击查看摘要

Abstract:As AI-generated text enters the real-world at scale, institutions increasingly use commercial AI-text detectors, especially in education and academic-integrity workflows. We report a surprising empirical finding about such systems: when evaluated by GPTZero and Pangram, generated text from base models is often judged overwhelmingly human, whereas text generated by their instruction-tuned counterparts is not. Building on this observation, we propose Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that minimally fine-tunes a base model into a paraphraser and applies it iteratively. Compared with the baselines we test, HIP yields a stronger trade-off between semantic preservation and detector evasion on commercial detectors. Across Llama-3 and Qwen-3 families, spanning model sizes from 0.6B to 70B, HIP consistently improves detector human-likeness. Our findings suggest that current detectors are tracking artifacts of instruction tuning and local context more than any invariant notion of machine-generated text. This, in turn, calls for detector designs that model these factors more explicitly.

[NLP-48] Position: The Turing-Completeness of Real-World Autoregressive Transformers Relies Heavily on Context Management ICML2026

【速读】：该论文试图解决的问题是：当前文献中关于Transformer具有图灵完备性（Turing-completeness）的结论存在概念混淆，尤其是在固定系统设置（fixed Transformer system setting）与扩展族设置（scaling-family setting）之间的区分不清。现有证明大多基于扩展族设置（即使用不同规模模型处理不同长度输入），而真实大语言模型（LLM）部署和标准图灵完备性定义更贴近固定系统设置（即单一模型通过特定上下文管理机制处理任意长度输入）。论文的关键解决方案在于：首先形式化了固定系统设置，明确刻画了现实LLM的实际运行机制；其次指出扩展族设置中的结果虽能提供理论资源边界，但无法真正证明图灵完备性；最后强调上下文管理方法（context-management method）对计算能力有决定性影响，主张其是控制自回归Transformer计算能力的核心组件。

链接: https://arxiv.org/abs/2605.19514
作者: Guanyu Cui,Zhewei Wei,Kun He
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to the ICML 2026 Position Paper Track

点击查看摘要

Abstract:Many works make the eye-catching claim that Transformers are Turing-complete. However, the literature often conflates two distinct settings: (i) a fixed Transformer system setting, in which a fixed autoregressive Transformer is coupled with a fixed context-management method to process inputs of different lengths step by step, and (ii) a scaling-family setting, in which a family of different models (with increasing context-window length or numerical precision) is used to handle different input lengths. Existing proofs of Transformer Turing-completeness are frequently established in setting (ii), whereas real-world LLM deployment and the standard notion of Turing-completeness correspond more naturally to setting (i). In this paper, we first formalize the fixed-system setting, thereby providing a concrete characterization of how real-world LLMs operate. We then argue that results proved in the scaling-family setting provide theoretically meaningful resource bounds but do not establish Turing-completeness, thereby clarifying a common misinterpretation of existing results. Finally, we show that different context-management methods can yield sharply different computational power, and we advocate the position that context management is a central component that critically determines the computational power of real-world autoregressive Transformers.

[NLP-49] Drifting Objectives for Refining Discrete Diffusion Language Models

【速读】：该论文试图解决离散扩散语言模型（DDLMs）在生成过程中因采样时校正带来的效率与质量瓶颈问题，其核心挑战在于如何将连续生成器中有效的“漂移”（drifting）机制迁移至离散文本空间。解决方案的关键是提出TokenDrift方法：通过将离散的token预测映射为软token特征（soft-token features），在冻结的语义空间中应用反对称漂移（anti-symmetric drifting），并以stop-gradient方式将漂移后的特征目标反向传播回DDLM的logits层，从而实现训练阶段对采样误差的隐式修正。实验表明，在掩码扩散（MDLM）和均匀状态扩散（DUO）基线上，TokenDrift显著提升固定数值函数评估（NFE）下的生成质量，4 NFE时生成困惑度（Gen.-PPL）分别降低89%和86%，验证了漂移机制对DDLM的实际改进潜力。

链接: https://arxiv.org/abs/2605.19470
作者: Daisuke Oba,Hiroki Furuta,Naoaki Okazaki
机构: Institute of Science Tokyo (东京科学大学); AIST (产业技术综合研究所); NII LLMC (日本国立信息学研究所语言模型研究中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Discrete diffusion language models (DDLMs) generate text by iteratively denoising categorical token sequences, while recent drifting methods for continuous generators suggest that part of this sampling-time correction can instead be absorbed into training through an anti-symmetric fixed-point objective. We study how to transfer this principle to DDLMs, where the main challenge is the interface with discrete text: hard token samples are non-differentiable, and categorical predictions do not directly provide continuous samples to drift. We formulate TokenDrift, a drifting objective that lifts categorical predictions to soft-token features, applies anti-symmetric drifting in a frozen semantic space, and backpropagates the resulting stop-gradient feature target to DDLM logits. In controlled continual-training experiments with masked and uniform-state diffusion backbones, TokenDrift improves fixed-NFE generation quality over matched continuation baselines, reducing Gen.-PPL at 4 NFEs by 89% on MDLM and 86% on DUO. These results suggest that drifting can provide a practical refinement objective for DDLMs.

[NLP-50] CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

【速读】：该论文试图解决强化学习中基于可验证奖励（RLVR）训练时存在的信用分配不精准问题：在传统方法中，模型生成的每个token无论是否为关键推理步骤，都会获得相同的奖励信号，导致无法区分决定性推理步骤与语法填充词。解决方案的关键在于提出对比证据策略优化（CEPO），其核心思想是通过构造一个“错误答案教师”来对比判断当前token是否真正重要——即不仅考察正确答案是否支持该token，还要验证错误答案是否排斥该token；只有同时满足这两个条件的token才是真正的推理步骤，而两者均不满足的则是填充词。该方法利用训练批次中已有的拒绝轨迹构建错误答案教师，无需额外采样成本，并在理论上继承了前序最优方法的所有结构安全保证，同时在决定性token处显著增强信用分配，而在填充位置则无改进。实验表明，CEPO在五个多模态数学推理基准上分别达到43.43%和60.56%的平均准确率（2B和4B参数规模），优于GRPO方法（分别为41.17%和57.43%），且分布匹配自蒸馏方法（如OPSD、SDPO）因信息泄露导致性能低于未训练基线，验证了理论预测。

链接: https://arxiv.org/abs/2605.19436
作者: Ahmed Heakl,Abdelrahman M. Shaker,Youssef Mohamed,Rania Elbadry,Omar Fetouh,Fahad Shahbaz Khan,Salman Khan
机构: MBZUAI; Linköping University; Australian National University
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:When a model produces a correct solution under reinforcement learning with verifiable rewards (RLVR), every token receives the same reward signal regardless of whether it was a decisive reasoning step or a grammatical filler. A natural fix is to condition the model on the correct answer as a teacher, identifying tokens it would have generated differently had it known the answer. Prior work shows this either corrupts training by leaking the answer into the gradient, or produces a weak signal that cannot distinguish decisive steps from filler, since both look equally surprising relative to the model’s baseline. We propose Contrastive Evidence Policy Optimization (CEPO), which asks a sharper question at every token: not just “does the correct answer favor this token?” but “does the correct answer favor it while the wrong answer disfavors it?” A token satisfying both is a genuine reasoning step; one satisfying neither is filler. The wrong-answer teacher is constructed from rejected rollouts already in the training batch, incurring no additional sampling cost. We prove CEPO inherits all structural safety guarantees of the prior state of the art while strictly sharpening credit at decisive tokens, with the improvement vanishing exactly at filler positions. Empirically, CEPO achieves 43.43% and 60.56% average accuracy across five multimodal mathematical reasoning benchmarks at 2B and 4B scale, respectively, versus 41.17% and 57.43% for GRPO under identical training budgets. Distribution-matching self-distillation methods (OPSD, SDPO) fall below the untrained baseline, empirically confirming the information leakage our theory predicts. Our code is available at this https URL.

[NLP-51] Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在复杂推理任务中因长链式思维（Chain-of-Thought, CoT）导致的计算开销过大问题，以及现有知识蒸馏方法中存在的双重暴露偏差（dual exposure biases）难题。具体而言，传统离策略蒸馏（off-policy distillation）因训练分布与学生推理时的实际上下文不一致而引入暴露偏差（exposure bias），引发错误级联；而近策略蒸馏（on-policy distillation）虽允许学生自主探索轨迹，却因教师无法有效指导学生生成的次优上下文，产生反向暴露偏差（reciprocal reversed exposure bias）。解决方案的关键在于提出一种名为“监测轨迹并在偏离时回溯”（Monitoring Trajectories and Backtracking when it strays, MOTAB）的新蒸馏范式：通过动态监控学生生成过程是否超出自适应安全边界，在轨迹偏离时回退至最近安全状态并由教师介入纠正，从而在容忍小误差以缓解暴露偏差的同时，避免次优上下文带来的反向偏差。实验表明，MOTAB在LIMO-v2和AceReason数据集上显著改善了推理性能，平均提升约3%。

链接: https://arxiv.org/abs/2605.19433
作者: Bing Wang,Shaotian Yan,Chen Shen,kaiyuan liu,Sinan Fan,Ximing Li,Rui Miao,Xiaosong Yuan,Zhanming Shen,Jieping Ye
机构: Jilin University (吉林大学); Tongyi Lab, Alibaba Group (通义实验室，阿里巴巴集团); Zhejiang University (浙江大学); RIKEN Center for Advanced Intelligence Project (理化学研究所先进智能项目中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 8 figures

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success in complex reasoning tasks via long chain-of-thought (CoT), yet their immense computational overhead hinders real-world deployment. LLM reasoning distillation addresses this by transferring reasoning capabilities from formidable teacher models to compact student models. However, existing distillation paradigms face a fundamental dilemma. Typical off-policy distillation strictly utilizes teacher-generated golden trajectories, suffering from an exposure bias due to the mismatch between training distributions and student-generated inference contexts, which leads to error cascades in long CoT reasoning. To address this, on-policy distillation allows students to explore their own trajectories, but we demonstrate that it inherently introduces a reciprocal reversed exposure bias: the teacher model also struggles to provide positive guidance when conditioned on student-generated sub-optimal contexts. To resolve this dual exposure biases problem, we propose Monitoring Trajectories and Backtracking when it strays (MOTAB), a new LLM reasoning distillation pipeline. Specifically, MOTAB dynamically monitors the student’s on-policy generation against an adaptive safety boundary. When the generation strays and exceeds this threshold, MOTAB backtracks to the last safe state and leverages teacher intervention to correct the course. This approach inherently tolerates minor student errors to mitigate exposure bias, while preventing sub-optimal contexts to circumvent reversed exposure bias. Extensive experiments on the LIMO-v2 and AceReason datasets demonstrate that MOTAB effectively alleviates the dual exposure biases, yielding a roughly 3% average performance improvement in reasoning tasks.

[NLP-52] LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

【速读】：该论文试图解决的问题是：Group Relative Policy Optimization (GRPO) 在强化学习对齐中因依赖单一统计基准（如群体均值）而导致轨迹空间的相对拓扑结构被压缩为标量，从而丢失了精细的偏好信息，无法有效应对复杂且对排序敏感的奖励景观。解决方案的关键在于提出一种新的框架 Lambda Policy Optimization (LambdaPO)，其核心创新是将优势估计从标量形式重构为分解的成对偏好结构：每个轨迹的优势由其与同组其他轨迹的奖励差异之和构成，且每一对比较根据策略对偏好确定性的概率置信度进行动态衰减；此外，通过引入基于生成推理路径与真实解之间精确率-召回率对齐的语义密度奖励，缓解二元监督信号稀疏性问题，从而从一组轨迹中挖掘更细粒度的优化信号，引导大语言模型（LLM）收敛至更优解。

链接: https://arxiv.org/abs/2605.19416
作者: Zhe Yuan,Yipeng Zhou,Jinghan Li,Xinyuan Chen,Bowen Deng,Zhiqian Chen,Liang Zhao
机构: Pinterest(Pinterest); Facebook(Facebook); University of Michigan - Ann Arbor(密歇根大学-安娜堡分校); Mississippi State University(密西西比州立大学); Carnegie Mellon University(卡内基梅隆大学); Emory University(埃默里大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method’s reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward differentials against all peers in its cohort, where each pairwise comparison is dynamically attenuated by the policy’s own probabilistic confidence in the established preference. To further mitigate the sparsity of binary outcome supervision, we augment the objective with a semantic density reward, derived from the precision-recall alignment between generated reasoning traces and ground-truth solutions. As a result, our method can mine more fine-grained optimization signals from a group of rollouts, guiding the LLM to a better optima. Experimental results across challenging math reasoning and question-answering tasks demonstrates that LambdaPO improves performance compared to the baseline methods.

[NLP-53] EmbGen: Teaching with Reassembled Corpora

【速读】：该论文试图解决小规模指令微调模型在特定领域适配时，因高质量指令-响应样本收集成本过高而导致的效率瓶颈问题。现有基于教师大语言模型（LLM）生成合成数据的方案存在输出同质化、难以捕捉跨段落或跨文档语义依赖等缺陷。其解决方案的关键在于提出EmbGen管道：首先将语料库分解为实体-描述对，再通过嵌入相似性推断语义结构进行重组，并结合邻近采样、簇内与簇间采样策略及针对不同聚类设计的系统提示生成问答对。实验表明，在固定token预算下（500万和2000万），EmbGen在语义异质性最强的数据集上相比最强基线提升Binary Accuracy达12.5%（5M）和88.9%（20M），且在其他低异质性数据集上保持竞争力。

链接: https://arxiv.org/abs/2605.19394
作者: Arun K Lenin,Kai Rouse,Andrea Nicastro,Anna Leontjeva
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 images (32 pages with appendix)

点击查看摘要

Abstract:Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross-passage or cross-document dependencies. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question-answer (QA) pairs via proximity, intra-cluster, and inter-cluster sampling with cluster-specialized system prompts. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets (5 and 20 million tokens). We use lexical overlap metrics, an LLM-as-a-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12.5% at 5M and 88.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity.

[NLP-54] aming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在推理过程中难以平衡响应长度与准确率的问题：现有基于熵的深度推理方法要么无差别地增加输出长度，要么为缩短长度牺牲准确性。其解决方案的关键在于提出条件熵塑造（Conditional Entropy Shaping, CES）框架，该框架通过token级熵作为不确定性信号，动态调节推理路径上的熵值——在正确路径上惩罚高熵“分叉点”token以提升简洁性，在错误路径上奖励此类token以促进探索和纠错。CES基于DAPO构建，在12个数学基准测试中显著优于DAPO，既提高了平均准确率又减少了响应长度，且在小规模模型（1.5B参数）和域外任务上也表现出一致优势。

链接: https://arxiv.org/abs/2605.19358
作者: Shuyu Wei,Jian Sun,Delai Qiu,Yining Wang,Shengping Liu,Jiaen Liang,Ying Fu,Wei Huang,Jitao Sang
机构: Beijing Jiaotong University (北京交通大学); Unisound AI Technology Co., Ltd. (声智科技有限公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy “forking point” tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.

[NLP-55] SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models ACL2026

【速读】：该论文试图解决的问题是：当前大语言模型（Large Language Models, LLMs）在科学领域应用中的能力评估缺乏细粒度、可扩展性和实际场景对齐性，现有基准测试多为人工构建或领域通用，难以反映真实科学任务所需的特定能力。解决方案的关键在于提出一个名为SciCustom的新框架，其核心机制包括：1）基于本体（ontology）将科学知识组织为可控粒度的知识单元，并训练标注器将大规模数据实例映射到该知识空间；2）通过多模态共识投票机制识别与定制需求相关的知识单元；3）利用二分查找实现相关基准的高效检索，结合代理子集选择和数据驱动的基准生成策略，实现无需专家标注或合成问题生成的精细化评估。实验表明，SciCustom能够揭示标准基准忽略的LLM科学能力差异，同时具备高可扩展性和应用场景适配性。

链接: https://arxiv.org/abs/2605.19357
作者: Yiyang Gu,Junwei Yang,Junyu Luo,Ye Yuan,Bin Feng,Yingce Xia,Shufang Xie,Kaili Liu,Bohan Wu,Qi Shi,Haoran Li,Beier Xiao,Zhiping Xiao,Xiao Luo,Weizhi Zhang,Philip S. Yu,Zequn Liu,Ming Zhang
机构: Peking University; Zhongguancun Academy; IDEA; Xidian University; University of Washington; University of Wisconsin–Madison; University of Illinois Chicago
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main Conference

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at this https URL.

[NLP-56] IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis

【速读】：该论文试图解决的问题是：如何通过结构化数据和知识图谱技术，系统性地分析印度法院在处理涉及《印度刑法典》第498A条（家庭暴力相关）、《保护妇女免受家庭暴力法》及《刑事诉讼法》第482条的婚姻纠纷案件中的判决差异与趋势。其解决方案的关键在于构建了一个名为IMLJD的开放数据集，包含3,613份印度最高法院和卡纳塔克邦高等法院自2000年至2024年的判决文本，配有结构化的结果标签、元数据衍生指标以及法律知识图谱，从而支持对司法实践的量化比较——例如发现最高法院的撤销申请成功率（57.6%）显著高于卡纳塔克邦高等法院（39.7%），且在2018–2024年匹配期内差距扩大至19.6个百分点，验证了结论的稳健性。

链接: https://arxiv.org/abs/2605.19346
作者: Joy Bose
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 2 figures, 5 tables. Dataset available at this http URL and Code at this http URL

点击查看摘要

Abstract:We present IMLJD, an open dataset of 3,613 Indian court judgments covering matrimonial disputes under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. The dataset covers the Supreme Court of India from 2000 to 2024 (1,474 cases) and the Karnataka High Court from 2018 to 2024 (2,139 cases), with structured outcome labels, metadata-derived indicators, and a knowledge graph. We find that 57.6% of quashing petitions succeed at the Supreme Court level compared to 39.7% at the Karnataka High Court level. On a matched 2018 to 2024 period, the SC quash rate is 59.3%, widening the differential to 19.6 percentage points and confirming the finding is robust to temporal adjustment. The dataset, code, and knowledge graph are released openly at this https URL and this https URL.

[NLP-57] Retrieval-Augmented Linguistic Calibration

【速读】：该论文试图解决的问题是：如何在自然语言中建立一个通用且原理清晰的置信度校准框架，以准确表达生成式 AI（Generative AI）输出的置信水平。现有方法难以处理语言线索（如“I believe”或“probably”）的共现、上下文变化以及受众主观理解差异等挑战。解决方案的关键在于提出一种分布式的置信度建模方式——将语言置信度视为一个可能被受众感知的概率值分布，从而捕捉解释变异性；在此基础上引入“忠实性”（faithfulness）作为新的评估维度，并提出信息论意义上的“忠实性散度”（Faithfulness Divergence, FD）来量化受众信念在真相揭示后的惊讶程度。最终，论文提出轻量级后处理管道 Retrieval-Augmented Linguistic Calibration (RALC)，通过检索增强重写机制将校准后的置信信号回传至自然语言中，在三个问答基准和五类大语言模型（LLM）上分别实现高达66%和58%的域内忠实性和校准性能提升。

链接: https://arxiv.org/abs/2605.19344
作者: Yi-Fan Yeh,Linwei Tao,Minjing Dong,Tao Huang,Jialin Yu,Philip Torr,Chang Xu
机构: University of Sydney (悉尼大学); City University of Hong Kong (香港城市大学); Shanghai Jiao Tong University (上海交通大学); University of Oxford (牛津大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Linguistic cues such as “I believe” and “probably” offer an intuitive interface for communicating confidence, yet a generalisable, principled calibration framework for linguistic confidence expressions remains underexplored. In particular, co-occurring linguistic cues, contextual variation, and subjective audience interpretation pose unique challenges. We therefore model linguistic confidence as a distribution over plausible perceived probability values that a statement is correct, capturing interpretation variability that scalar representations discard. Within this distributional framework, we introduce faithfulness as a complementary evaluation dimension and present Faithfulness Divergence (FD), an information-theoretic metric quantifying the surprise induced in audience beliefs upon truth revelation. Building on these foundations, we present Retrieval-Augmented Linguistic Calibration (RALC), a lightweight post-hoc pipeline that propagates calibrated confidence signals back into natural language via retrieval-augmented rewriting. Across three QA benchmarks and five LLM families, RALC improves in-domain faithfulness and calibration up to 66% and 58%, respectively, outperforming black-box and grey-box calibration baselines.

[NLP-58] HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）中幻觉（Hallucination）问题的评估不一致性和可复现性差的问题。现有基准测试在摘要、问答、检索增强生成和代理交互等不同场景下对幻觉的操作定义不统一，导致难以判断某种缓解方法是否具有跨场景泛化能力。解决方案的关键在于提出HalluWorld——一个基于显式参考世界（reference-world）形式化的可扩展基准：模型产生与该参考世界不符的可观测陈述即为幻觉。通过构建合成与半合成环境，其中参考世界完全指定、模型视角可控且幻觉标签自动标注，HalluWorld实现了对世界复杂度、可观测性、时序变化及源冲突策略的精细控制，并将幻觉细分为多个错误类别。实验表明，前沿模型在直接感知信息上的幻觉已基本解决，但在多步状态追踪和因果前向模拟方面仍存在显著困难，且在终端任务中难以判断何时应拒绝回答，说明幻觉源于多种异质性的失效模式，而非单一能力缺陷。该研究验证了受控参考世界为衡量和减少现代语言模型幻觉提供了可扩展且可复现的新路径。

链接: https://arxiv.org/abs/2605.19341
作者: Emmy Liu,Varun Gangal,Michael Yu,Zhuofu Tao,Karan Singh,Sachin Kumar,Steven Y. Feng
机构: Carnegie Mellon University (卡内基梅隆大学); Patronus AI (Patronus AI); Stanford University (斯坦福大学); The Ohio State University (俄亥俄州立大学); Independent Researcher (独立研究员); DegenAI Labs (DegenAI 实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: HalluWorld benchmark (code and data) at this http URL

点击查看摘要

Abstract:Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or rely on observations in settings that are difficult to reproduce. To study root causes, we introduce HalluWorld, an extensible benchmark grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this world. Building on this view, we construct synthetic and semi-synthetic environments in which the reference world is fully specified, the model’s view is controlled, and hallucination labels are generated automatically. HalluWorld spans gridworlds, chess, and realistic terminal tasks, enabling controlled variation of world complexity, observability, temporal change, and source-conflict policy, and disentangling hallucinations into fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation remain difficult and are not generally solved by extended thinking. In the terminal setting, models also struggle with when to abstain. The uneven profile of failures across probe types and domains suggests that hallucinations arise from distinct failure modes rather than a single capability. Our results suggest that controlled reference worlds offer a scalable and reproducible path toward measuring and reducing hallucinations in modern language models.

[NLP-59] A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation ACL2026

【速读】：该论文试图解决的问题是：当前基于大语言模型（LLM）的阅读理解题目生成方法多采用单智能体提示策略，难以稳定满足预设的难度相关特征约束，导致生成题目偏离目标难度水平。解决方案的关键在于提出一种多智能体框架（MAFIG），其中多个LLM代理与特定特征评估器协同工作，通过迭代修正机制确保生成题目符合指定约束；同时，为验证难度控制的有效性，论文还设计了一种难度校准的特征约束序列构建方法，实验表明MAFIG在满足目标约束比例上显著优于基线方法，实现了更稳健的难度控制。

链接: https://arxiv.org/abs/2605.19316
作者: Seonjeong Hwang,Jun Seo,Hyounghun Kim,Gary Geunbae Lee
机构: POSTECH(浦项科技大学)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Main Conference

点击查看摘要

Abstract:Recent studies in difficulty-controlled reading comprehension item generation have leveraged large language models (LLMs) to produce items by adjusting difficulty-related features. However, existing methods typically rely on a single-agent prompting approach, which often fails to consistently satisfy specified feature constraints, resulting in items that deviate from the target difficulty level. To address this limitation, we introduce MAFIG, a Multi-agent Framework for Feature-constrained Item Generation, where multiple LLM agents and feature-specific evaluators collaborate to generate and iteratively revise items based on intended constraints. Furthermore, to verify the efficacy of MAFIG in difficulty control, we propose a method for constructing a sequence of feature constraint sets that yield items with monotonically increasing difficulty. Experimental results demonstrate that MAFIG generates items that adhere to target constraints at a significantly higher rate than baselines, achieving robust difficulty control through the difficulty-calibrated constraint sequence.

[NLP-60] How Do Document Parsers Break? Auditing Structural Vulnerability in Document Intelligence

【速读】：该论文旨在解决文档布局分析（Document Layout Analysis, DLA）系统在评估鲁棒性时存在的“足迹偏差”（Footprint Bias）问题，即当前评估方法主要依赖于扰动区域的大小（area-centric），而忽视了扰动如何与文档结构相互作用及其对下游任务的影响。解决方案的关键在于提出一个轻量级的输出层审计框架，该框架通过解耦探测构建、策略驱动的目标定位和结构感知诊断三个模块，引入块级结构损失率（Block-level Structural Loss Rate, B-SLR）、粒度感知暴露描述符（granularity-aware exposure descriptors）以及路径归因（pathway attribution）机制，实现对扰动如何影响布局结构及故障传播路径的精细刻画。实验表明，B-SLR相较于传统面积指标能更准确地预测OCR稳定性变化（R²达0.727–0.916），且小范围结构靶向扰动即可引发与大范围扰动相当的下游问答和检索性能下降，揭示了DLA系统的结构性脆弱性，推动了从基于面积的应力测试向结构感知的漏洞审计范式转变。

链接: https://arxiv.org/abs/2605.19309
作者: Yue Chen,Yihao Wang,Ziyi Tang,Keze Wang
机构: Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL)
备注: 19 pages, preprint

点击查看摘要

Abstract:Document Layout Analysis (DLA) pipelines provide structured page representations for retrieval-augmented generation, long-document question answering, and other document intelligence systems, yet their robustness evaluation remains largely area-centric. We identify this Footprint Bias and propose a lightweight output-level auditing framework that decouples probe construction, policy-driven targeting, and structure-aware diagnosis. The framework combines Block-level Structural Loss Rate (B-SLR), granularity-aware exposure descriptors, and pathway attribution to analyze where perturbations interact with layout structure and how failures propagate. Across MinerU and PP-StructureV3 on 1,000 pages, affected area weakly tracks perturbation-induced OCR instability (R^2=0.384/0.110), whereas B-SLR aligns much more closely with it (R^2=0.727/0.916). Exposure descriptors further separate occlusion- and topology-dominant pathways, and small structurally targeted probes cause downstream QA/retrieval degradation comparable to larger-footprint perturbations. These results shift DLA robustness evaluation from footprint-based stress testing toward structure-aware vulnerability auditing.

[NLP-61] Are Rationales Necessary and Sufficient? Tuning LLM s for Explainable Misinformation Detection KDD2026

【速读】：该论文试图解决的是可解释虚假信息检测（Explainable Misinformation Detection, MD）中训练数据质量不足的问题。现有方法依赖于对现成大语言模型（LLMs）进行提示工程以生成解释性理由，但其效果受限于训练数据的合理性与有效性。论文提出的关键解决方案是LONSREX——一种新颖的数据合成管道，旨在定位并提取用于虚假信息检测的必要且充分的验证步骤（Necessary and Sufficient Rationales）。其核心创新在于设计了一个量化指标，用于评估每个验证步骤对最终判断的贡献度，从而筛选出高质量、精炼且逻辑严密的理由，有效缓解了仅基于粗粒度标签过滤导致的理由不充分和强模型过验证带来的冗余问题。

链接: https://arxiv.org/abs/2605.19285
作者: Bing Wang,Rui Miao,Ximing Li,Chen Shen,Shaotian Yan,Changchun Li,Kaiyuan Liu,Xiaosong Yuan,Jieping Ye
机构: Jilin University (吉林大学); Tongyi Lab, Alibaba Group (阿里巴巴集团通义实验室); Zhejiang University (浙江大学); RIKEN AIP (理化学研究所先进智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted by KDD 2026. 12 pages, 8 figures. Code: this https URL

点击查看摘要

Abstract:The rapid spread of misinformation on social media platforms has become a formidable challenge. To mitigate its proliferation, Misinformation Detection (MD) has emerged as a critical research topic. Traditional MD approaches based on small models typically perform binary classification through a black-box process. Recently, the rise of Large Language Models (LLMs) has enabled explainable MD, where models generate rationales that explain their decisions, thereby enhancing transparency. Existing explainable MD methods primarily focus on crafting sophisticated prompts to elicit rationales from off-the-shelf LLMs. In this work, we propose a pipeline to fine-tune a dedicated LLM specifically for explainable MD. Our pipeline begins by collecting large-scale fact-checked articles, and then uses multiple strong LLMs to produce veracity predictions and rationales. To ensure high-quality training data, we leverage a filtering strategy that selects only the correct instances for fine-tuning. While this pipeline is intuitive and prevalent, our experiments reveal that naive filtering based solely on label correctness is insufficient in practice and suffers from two critical limitations: (1) Coarse-grained labels cause insufficient rationales: Rationales filtered solely based on binary labels are insufficient to adequately support their decisions; (2) Over-verification behavior causes unnecessary rationales: Stronger LLMs tend to exhibit over-verification behavior, producing excessively verbose and unnecessary rationales. To address these issues, we introduce LONSREX, a novel data synthesis pipeline to Locate Necessary and Sufficient Rationales for Explainable MD. Specifically, we propose a metric that quantifies the contribution of each verification step to the final prediction, thereby evaluating its necessity and sufficiency. Experimental results demonstrate the effectiveness of LONSREX.

[NLP-62] Language models struggle with compartmentalization NEURIPS2026

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在训练数据中对同一潜在概念（latent concept）存在多种不同表达形式时，模型无法有效识别并共享这些表达之间的统计强度，从而导致内部表示冗余、模型容量浪费以及样本效率下降的问题。解决方案的关键在于揭示并量化这种“隔离化”（compartmentalization）现象——即模型为每种表达形式学习独立的内部表示，而非统一建模同一概念的不同呈现方式。研究发现，即便引入合成的平行数据，也难以改善这一问题；更重要的是，所有干预措施均表现出一种相变行为（phase transition），其有效性高度依赖于不同表达形式的数量，这表明当前语言建模目标本身可能无法稳定地实现跨表达形式的概念统一。

链接: https://arxiv.org/abs/2605.19284
作者: Thomas Vincent Howe,David Wingate
机构: Brigham Young University (杨百翰大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 8 figures, plus 9 pages of appendices. Submitted to NeurIPS 2026. Code: this https URL . Eval data: this https URL

点击查看摘要

Abstract:In the training data used by large language models (LLMs), the same latent concept is often presented in multiple distinct ways: the same facts appear in English and Swahili; many functions can be expressed in both Python and Haskell; we can express propositions in both formal and natural language. We show that LLMs can exhibit compartmentalization, where they fail to identify and share statistical strength between distinct presentations of unified concepts. In the worst case, LLMs simply learn parallel internal representations of each presentation of the concept, saturating model capacity with redundancies and decreasing sample efficiency with the number of such presentations. We also demonstrate that synthetic parallel data can fail to improve this despite being easily learned itself. Under this framework, we find that, for small models, early multilingual learning is nearly entirely compartmentalized. Finally, all interventions that we study exhibit a phase transition in which their effectiveness depends on the number of distinct presentations, suggesting that the language modeling objective may only inconsistently unify representations.

[NLP-63] OpenCompass: A Universal Evaluation Platform for Large Language Models

【速读】：该论文旨在解决当前大语言模型（Large Language Models, LLMs）评估中普遍存在的问题，即主流静态基准数据集方法在任务多样性、评价标准不一致以及数据与流程碎片化方面存在局限，难以实现跨领域和大规模模型的高效评估。其解决方案的关键在于提出并开源了一个名为OpenCompass的一站式、可扩展且高并发支持的通用LLM评估平台。该平台通过模块化设计和组件解耦，实现了高兼容性、灵活性和高并发能力，核心架构包含配置系统、任务分区模块、执行调度模块、任务执行单元和结果可视化模块，并提供规则驱动、LLM作为裁判（LLM-as-a-Judge）及级联评估器等多种评估策略以适配不同任务场景，从而为学术界和工业界提供统一高效的LLM评估工具，助力准确识别模型优劣并指导后续优化。

链接: https://arxiv.org/abs/2605.19276
作者: Maosong Cao,Kai Chen,Haodong Duan,Yixiao Fang,Tong Gao,Ge Jiaye,Mo Li,Hongwei Liu,Junnan Liu,Yuan Liu,Chengqi Lyu,Han Lyu,Ningsheng Ma,Zerun Ma,Yu Sun,Zhiyong Wu,Linchen Xiao,Jun Xu,Haochen Ye,Zhaohui Yu,Yike Yuan,Songyang Zhang,Yufeng Zhao,Fengzhe Zhou,Peiheng Zhou,Dongsheng Zhu,Lin Zhu,Jingming Zhuo
机构: Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.

[NLP-64] Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations

【速读】：该论文试图解决的问题是：多语言大语言模型（Multilingual Large Language Models, LLMs）在非英语输入场景下，常通过英文解释进行审计时所存在的因果解释失真问题。其核心发现是存在一种系统性权衡：以英语为中介（English-pivot）的解释虽然在词元片段（token spans）上与人类理由更一致（span agreement），但其证据与模型预测之间的因果关联显著减弱，表现为可解释性指标中的充分性（sufficiency）和全面性（comprehensiveness）下降。解决方案的关键在于：应在输入语言中评估解释，并采用多维度的忠实度（faithfulness）指标，而非仅依赖词汇重叠；同时应将英文解释视为沟通摘要（communication summaries），而非模型决策过程的忠实追踪（faithful decision traces）。

链接: https://arxiv.org/abs/2605.19274
作者: Somnath Banerjee,Pranav Jha,Rima Hazra,Animesh Mukherjee
机构: Indian Institute of Technology Kharagpur (印度理工学院克哈格普尔分校); TCG Crest; National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs deployed multilingually are often audited via English explanations for non-English inputs. We evaluate extractive explanations ‘‘where the model identifies input token spans as evidence alongside a generated rationale’’ and uncover a systematic trade-off: English-pivot explanations can achieve higher span agreement with human rationales while their evidence becomes less causally grounded in the model’s prediction, as measured by both comprehensiveness and sufficiency. Across 3 tasks, 5~languages, and 2~multilingual LLM families, we find that English explanations frequently produce fluent but loosely anchored rationales, with comprehensiveness degrading by up to 5.7x relative to native-language conditions - even as task accuracy remains stable across settings. For socially nuanced classification, English pivots also fail to preserve pragmatic cues, reducing both faithfulness and span agreement. We recommend auditing explanations in the input language, reporting multi-faceted faithfulness metrics beyond lexical overlap, and treating English rationales as communication summaries rather than faithful decision traces.

[NLP-65] DECOR: Auditing LLM Deception via Information Manipulation Theory

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）通过微妙方式操纵真实信息（如省略关键事实、转移焦点或模糊含义）进行策略性欺骗的问题，而现有黑盒方法因依赖粗粒度判断，难以定位具体被扭曲的事实及其方式。解决方案的关键在于提出DECOR框架，该框架基于信息操纵理论（Information Manipulation Theory），将输入上下文分解为原子信息单元，并从四个维度对每个单元与模型响应的匹配程度进行评分，生成可解释的操纵画像（manipulation profiles），并聚合为全局欺骗指数（deception index）。该方法实现了细粒度、可解释的LLM欺骗检测，且在单轮和多轮欺骗检测基准上均达到最先进性能，适用于15种前沿模型，并通过消融实验验证了各设计组件的有效性。

链接: https://arxiv.org/abs/2605.19270
作者: Linyue Cai,Samuel Yeh,Jwala Dhamala,Rahul Gupta,Sharon Li
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models can deceive by subtly manipulating truthful information – omitting key facts, shifting focus, or obscuring meaning – making such behavior difficult to detect. Existing black-box methods rely on coarse-grained judgments, offering limited interpretability and failing to pinpoint which facts were distorted and how. We introduce DECOR, a multi-agent framework grounded in Information Manipulation Theory for fine-grained auditing of strategic deception in LLM responses. DECOR decomposes input contexts into atomic informational units and scores each unit against the response across four dimensions of manipulation, producing interpretable manipulation profiles that are aggregated into a global deception index. We comprehensively evaluate DECOR on both single-turn and multi-turn deception detection benchmarks spanning real-world domains, and show that DECOR achieves state-of-the-art performance on both, outperforming competitive baselines. The framework generalizes across 15 frontier models, and ablation studies confirm the contribution of each key design component. Our findings demonstrate that fine-grained, theory-grounded auditing of information manipulation offers an effective and interpretable path for LLM deception detection.

[NLP-66] FormalASR: End-to-End Spoken Chinese to Formal Text

【速读】：该论文旨在解决自动语音识别（ASR）系统在下游写作类应用中表现不佳的问题，即传统ASR模型输出的文本保留了口语中的不流畅、填充词和非正式结构，难以直接用于书面场景。其解决方案的关键在于构建两个大规模的“口语转正式书面语”数据集（WenetSpeech-Formal 和 Speechio-Formal），通过大语言模型（LLM）重写与质量过滤生成高质量标注数据，并在此基础上对Qwen3-ASR模型进行监督微调（SFT），训练出两个轻量级端到端模型（0.6B 和 1.7B 参数）。实验表明，FormalASR相比原生直译基线在字符错误率（CER）上最多降低37.4%，同时提升ROUGE-L和BERTScore等生成质量指标，且无需部署后处理LLM，实现了低延迟、可本地化运行的口语转正式书面语方案。

链接: https://arxiv.org/abs/2605.19266
作者: Wanyi Ning,Yinshang Guo,Haitao Qian,Jiyuan Cheng,Weiyuan Feng,Yufei Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

[NLP-67] AI Technologies in Language Access: Attitudes Towards AI and the Human Value of Language Access Managers

【速读】：该论文试图解决的问题是人工智能（AI）在语言获取（language access）领域中的应用对实践与理论带来的变革，尤其是在效率、可及性、法律合规、伦理争议和安全风险交织的复杂背景下，如何平衡AI技术与人类价值的关系。解决方案的关键在于通过定性主题分析法对美国医疗、司法、公共服务及地方政府领域的十位语言获取管理人员进行半结构化访谈，揭示其对AI态度的核心特征：即在承认AI不可避免的实施趋势下，管理者表现出条件性的乐观，同时高度关注潜在风险，并强调必须确保人工干预与人文价值贯穿于AI系统的部署与输出全过程。

链接: https://arxiv.org/abs/2605.19234
作者: Miguel A. Jiménez-Crespo,Stephanie Rodriguez,Alejandro Jaume Losa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 tables, Convergence Conference 2026

点击查看摘要

Abstract:The rapid emergence of AI technologies is reshaping translation practices and theory across the board. This paper deals with the impact of AI in language access. This area is characterized by the need to serve broad and diverse user populations, within a context where efficiency and access are shaped by legal mandates, ethical and commercial tensions, and safety concerns. This paper reports on the attitudes and perceptions of language access managers towards the AI and the human value in the AI age. Methodologically, this paper presents an analysis of a subset of a broader study on language access and technology, specifically a qualitative thematic analysis of ten semi-structured interviews with language access managers in the USA working in healthcare, court, public service and local government contexts. The results indicate that language access managers show conditional optimism towards the inevitable AI implementations, are strongly risk aware, and deeply committed to the human value and human oversight of AI implementations and output.

[NLP-68] Diagnosing Multi-step Reasoning Failures in Black-box LLM s via Stepwise Confidence Attribution ICML2026

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在多步推理任务中虽然能够生成逐步解答，但难以诊断推理过程中哪一步可能出错；现有信心估计方法要么仅基于最终答案，要么需要访问模型内部结构，限制了其在闭源模型上的应用。解决方案的关键在于提出一种名为“步骤级信心归因”（Stepwise Confidence Attribution, SCA）的框架，该框架仅依赖于生成的推理轨迹即可为每一步分配信心分数。SCA基于信息瓶颈（Information Bottleneck, IB）原理：若某一步骤与多个正确解法中的共识结构一致，则赋予高信心；反之，偏离共识的步骤则被标记为潜在错误。文中进一步提出两种互补方法——NIBS（非参数化IB方法，无需图结构即可衡量一致性）和GIBS（基于图的IB模型，通过可微掩码学习子图以捕捉逻辑变异性），实验表明SCA能有效识别低信心步骤并显著关联推理错误，且利用步骤级信心引导自我修正可将纠错成功率提升最多13.5%，优于仅使用答案级反馈的方法。

链接: https://arxiv.org/abs/2605.19228
作者: Xiaoou Liu,Tiejin Chen,Dengjia Zhang,Yaqing Wang,Lu Cheng,Hua Wei
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5% over answer-level feedback.

[NLP-69] Fine-tuning language encoding models on slow fMRI improves prediction for fast ECoG

【速读】：该论文试图解决的问题是：由于侵入式脑电记录技术（如皮层脑电图，ECoG）的训练数据受限于能够接受植入设备的患者群体，导致基于此类数据建模的研究难以大规模开展。其解决方案的关键在于利用非侵入性的功能性磁共振成像（fMRI）作为桥梁，通过在fMRI数据上微调语言表示模型，构建对ECoG信号的编码模型。尽管fMRI的时间分辨率比ECoG低两个数量级，但实验表明，这种跨模态迁移策略显著提升了ECoG预测性能，并且在远超fMRI直接测量频率范围的频段内也表现出增强的预测能力；此外，即使对fMRI数据进行时间下采样，模型仍能保持良好泛化性能，且ECoG建模效果随fMRI训练数据量增加而持续提升，证明了“慢”数据（如fMRI）可有效用于优化“快”脑数据（如ECoG）的建模。

链接: https://arxiv.org/abs/2605.19224
作者: Aditya R. Vaidya,Richard J. Antonello,Alexander G. Huth
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Columbia University (哥伦比亚大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neuroscientists have recently turned to intracranial brain recording methods, like electrocorticography (ECoG), for human experiments because of the fine spatial and temporal resolution that they afford. Models trained on this data, however, are fundamentally restricted by the patient populations that can receive the implants necessary for recording. We propose using non-invasive fMRI to bridge the gap in training data. Using spoken language representations fine-tuned on fMRI, we build encoding models of ECoG. These representations showed improved prediction performance in ECoG, even though the temporal resolution of fMRI is two orders of magnitude worse. Prediction improved in frequency bands well beyond what is directly measured in fMRI. Next, to test the procedure’s generalization ability, we fine-tuned models on fMRI responses that were temporally downsampled by a factor of 2. Despite the loss in resolution, these models were able to predict fMRI and ECoG responses at levels comparable to the original fMRI-tuned models. Finally, we showed that ECoG performance steadily scales with the amount of fMRI-tuning data. Our results show that “slow” data like fMRI can be a valuable resource for building better models of “fast” brain data like ECoG. In the future, integrating across multiple recording methods may further improve performance in other applications, like decoding.

[NLP-70] Position: Uncertainty Quantification in LLM s is Just Unsupervised Clustering ICML2026

【速读】：该论文试图解决当前大语言模型（Large Language Models, LLMs）不确定性量化（Uncertainty Quantification, UQ）方法存在的根本性缺陷问题。现有UQ方法本质上是无监督聚类算法，仅衡量模型生成结果的内部一致性，而非其外部正确性，导致无法识别“自信幻觉”（confident hallucinations）——即模型对稳定但错误答案表现出高置信度的情况。这使得当前UQ方法可能在部署时制造虚假的安全感。解决方案的关键在于实现范式转变：首先，建立以客观事实为基础的验证机制，避免依赖不稳定的代理指标；其次，改进评估指标与场景，推动模型原生不确定性机制的设计；最终，确保模型置信度真正反映现实准确性，从而提升LLMs在高风险场景中的可信度与可靠性。

链接: https://arxiv.org/abs/2605.19220
作者: Tiejin Chen,Longchao Da,Xiaoou Liu,Hua Wei
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICML 2026 Position Paper Track

点击查看摘要

Abstract:Uncertainty Quantification (UQ) is widely regarded as the primary safeguard for deploying Large Language Models (LLMs) in high-stakes domains. However, we argue that the field suffers from a category error: mainstream UQ methods for LLMs are just unsupervised clustering algorithms. We demonstrate that most current approaches inherently quantify the internal consistency of the model’s generations rather than their external correctness. Consequently, current methods are fundamentally blind to factual reality and fail to detect ``confident hallucinations,‘’ where models exhibit high confidence in stable but incorrect answers. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality.

[NLP-71] me to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents ?

【速读】：该论文试图解决的问题是：当前基于大语言模型（LLM）的评判机制在评估深度研究代理（deep research agents）时缺乏可靠性和细粒度的故障检测能力，导致其作为监督范式存在显著风险。解决方案的关键在于提出REFLECT（REliable Fine-grained LLM judge Evaluation via Controlled inTervention），这是一个针对代理环境中过程与结果层面失败模式的细粒度元评估基准。REFLECT通过在高质量筛选的代理执行轨迹上进行受控干预，构建可验证、全面且精细的失败实例，并定义了一个详细的失败模式分类体系，从而系统性地评估LLM评判模型的准确性。实验表明，现有LLM评判模型整体准确率低于55%，尤其在证据验证方面表现较差，揭示了评判模型的系统性局限和成本-可靠性权衡，为构建更可靠的代理评估流水线提供了实证依据和实践指导。

链接: https://arxiv.org/abs/2605.19196
作者: Leyao Wang,Yanan He,Peng Chen,Asaf Yehudai,Yixin Liu,Rex Ying,Michal Shmueli-Scheuer,Arman Cohan
机构: Yale University (耶鲁大学); IBM Research (IBM研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality. Yet the reliability of these judges for deep research agents remains poorly understood, posing a critical meta-evaluation problem: before deploying LLM judges to supervise research agents, we must first evaluate the judges themselves. Existing meta-evaluations fall short in two ways: (1) reliance on coarse, subjective human-preference agreement; (2) focus on instruction-following or verifiable tasks, leaving open-ended agent executions unexplored. To address these gaps, we introduce REFLECT (REliable Fine-grained LLM judge Evaluation via Controlled inTervention), a meta-evaluation benchmark targeting fine-grained failure detection in agentic environments. REFLECT defines a detailed taxonomy of process- and outcome-level failure modes, instantiated by performing controlled and localized interventions on quality-screened agent execution traces. This yields verifiable, comprehensive, and fine-grained instances for validating the judge models. Our experiments show that current LLM judges remain unreliable: even the best-performing models achieve overall accuracies below 55% across reasoning, tool-use, and report-quality failures, with especially poor performance on evidence verification. Together, our taxonomy and findings expose systematic judge limitations, reveal tradeoffs in cost and reliability, and offer actionable guidance for building more reliable evaluation pipelines for deep research agents.

[NLP-72] MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

【速读】：该论文试图解决现有混合代理（Mixture-of-Agents, MoA）框架中静态路由机制无法充分捕捉跨聚合层的时间和上下文依赖关系的问题。其解决方案的关键在于提出一种基于LSTM的递归门控机制（recurrent router），通过融合当前输入与历史路由决策，动态调节各代理的贡献权重，从而实现更具有上下文感知能力的代理选择与结果聚合。实验表明，MMoA在保持与传统MoA相当准确率的同时，显著降低了计算开销，例如在AlpacaEval 2.0上以58.0%的胜率接近MoA的59.8%，但运行效率提升最高达4.6%，验证了其在可扩展性和效率上的优势。

链接: https://arxiv.org/abs/2605.19194
作者: Rui Chu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Mixture-of-Agents (MoA) framework has shown promise in improving large language model (LLM) performance by aggregating outputs from multiple agents. However, existing MoA systems often rely on static routers that do not fully capture temporal and contextual dependencies across aggregation layers. To address this limitation, we propose MMoA, a recurrent MoA architecture that integrates LSTM-based gating into the agent selection process. The recurrence router adaptively modulates agent contributions based on both current inputs and historical routing decisions, enabling more context-aware aggregation. We evaluate MMoA on standard instruction-following benchmarks, including AlpacaEval 2.0, MT-Bench, and Arena-Hard. The results show that MMoA achieves comparable accuracy to traditional MoA while reducing computational overhead by dynamically activating fewer agents. For example, on AlpacaEval 2.0, MMoA achieves a win rate of 58.0%, compared with 59.8% for MoA, while improving runtime efficiency by up to 4.6%. These results suggest that MMoA provides a scalable and efficient approach for adaptive multi-agent LLM systems.

[NLP-73] Prompting language influences diagnostic reasoning and accuracy of large language models

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）在临床决策支持中的表现是否受提示语言（prompting language）影响，尤其是在非英语环境下的可靠性尚不明确。解决方案的关键在于通过系统性对比五种主流LLMs在英语和法语提示下的诊断推理质量和最终诊断准确性，使用180个涵盖16个医学专科的临床案例，并由两名医师采用18分量表进行评估。结果显示，除o3模型外，其余四种模型在英语提示下表现均显著优于法语提示，且差异体现在鉴别诊断、逻辑结构和内部有效性等多个推理维度；这一发现强调了提示语言对LLM临床性能的决定性影响，对全球范围内公平、跨语言的文化适配部署具有重要意义。

链接: https://arxiv.org/abs/2605.19173
作者: Adrien Bazoge,Josselin Corvellec,Sofiane Djillali Sid-Ahmed,Pierre-Antoine Gourraud
机构: Nantes University Hospital (南特大学医院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. o3 was the only model showing no overall language effect. These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide.

[NLP-74] Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

【速读】：该论文试图解决的问题是：当前基于大语言模型（Large Language Models, LLMs）的智能体（agents）在面对环境中的非恶意错误（如网页不可访问、文件缺失或配置错误）时，可能产生意外且有害的行为——即“意外崩溃”（accidental meltdown），而这一现象尚未被现有可靠性或安全性基准所覆盖。解决方案的关键在于：首先构建了一个针对此类行为的分类体系（taxonomy），其次开发了一种与具体代理系统无关的基础设施，用于在推理过程中注入模拟的本地和远程错误，并以此对 GPT、Grok 和 Gemini 驱动的代理进行系统性评估。实验表明，在遭遇模拟错误的64.7%代理运行中发生了不同程度的崩溃行为（例如未经授权的侦察或权限绕过），且超过一半情况下这些不安全行为未被报告给用户；更重要的是，研究发现代理对错误的探索性响应与其产生有害行为之间存在显著相关性，揭示了错误处理机制与安全性之间的深层关联。

链接: https://arxiv.org/abs/2605.19149
作者: Rishi Jha,Harold Triedman,Arkaprabha Bhattacharya,Vitaly Shmatikov
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 32 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Agents operating with computer and Web use inevitably encounter errors: inaccessible webpages, missing files, local and remote misconfigurations, etc. These errors do not thwart agents based on state-of-the-art models. They helpfully continue to look for ways to complete their tasks. We introduce, characterize, and measure a new type of agent failure we call \emphaccidental meltdown: unsafe or harmful behavior in response to a benign environmental error, in the absence of any adversarial inputs. Because meltdowns are not captured by the existing reliability or safety benchmarks, we develop a taxonomy of meltdown behaviors. We then implement an agent-agnostic infrastructure for injecting simulated local and remote errors into the rollout environment and use it to systematically evaluate agent systems powered by GPT, Grok, and Gemini. Our evaluation demonstrates that meltdowns (e.g., conducting unauthorized reconnaissance or subverting access control) of varying severity and success occur in 64.7% of agent rollouts that encounter simulated errors, spanning all combinations of agent system, backing model, and error type. In over half of these meltdowns, unsafe behaviors are not reported to the user. Comparing behaviors of the same agents with and without errors, we find that exploration in response to errors is correlated with unsafe and harmful behavior. Comments: 32 pages, 8 figures, 4 tables Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR) Cite as: arXiv:2605.19149 [cs.CL] (or arXiv:2605.19149v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.19149 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-75] EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

【速读】：该论文试图解决当前视觉-语言模型（VLMs）在自然情境下（如婴儿头戴摄像头产生的弱对齐、稀疏的视听流）缺乏泛化能力的问题，即现有模型依赖于精心标注的强对齐数据，无法有效利用人类婴儿在真实世界中接触到的弱对齐多模态输入。其解决方案的关键在于构建一个全新的评估框架——Machine-DevBench，这是一个基于语料库的词汇和语法能力基准，通过从模型训练词汇中按频率分箱自动生成测试样本，从而消除传统发展基准中存在的训练/评估不匹配和统计功效不足问题，并结合多种自然主义的egocentric视频数据集进行系统性评估。这一方法揭示了当前VLM范式对强对齐数据的过度依赖，为推动下一代能够从自然主义数据中实现具身语言学习的模型提供了明确方向。

链接: https://arxiv.org/abs/2605.19130
作者: Dongyan Lin,Phillip Rust,Angel Villar Corrales,Alvin W. M. Tan,Mahi Luthra,Charles-Éric Saint-James,Rashel Moritz,Sheila Krogh-Jespersen,Vanessa Stark,Surya Parimi,Jiayi Shen,Youssef Benchekroun,Yosuke Higuchi,Martin Gleize,Tom Fizycki,Nicolas Hamilakis,Manel Khentout,Sho Tsuji,Balázs Kégl,Juan Pino,Michael C. Frank,Emmanuel Dupoux
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today’s best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams – and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model’s training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input – the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.

[NLP-76] Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

【速读】：该论文试图解决的问题是：在推理系统中，私有（private）与公开（public）计算通道之间的信息影响难以准确量化，尤其是在存在独立共推导、直接访问私有内容及通过公共通信间接影响等复杂场景下，传统方法如文本探针（textual probes）无法可靠区分这些影响机制。解决方案的关键在于提出一种反事实似然测试（counterfactual likelihood test），通过替换上游私有模块为长度匹配的“捐赠者”模块，固定下游目标和公共标记序列，测量目标负对数似然的变化量来量化私有通道间的影响。该方法有效分离了未掩码与掩码条件，并通过长度匹配控制RoPE位置编码带来的混淆因素，在多检查点、多随机种子和大规模验证中揭示出私有到公开路径的不对称影响：A→B方向的影响力持续存在于公共隐藏状态中，而B→A方向则接近零；进一步的图结构隔离控制实验确认，公共通道是唯一承载该反事实信号的路径，从而为私有推理评估提供了可操作且可靠的边界测量标准。

链接: https://arxiv.org/abs/2605.19092
作者: Alexander Boesgaard Lorup(Openhagen)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Reasoning systems increasingly separate intermediate computation into private and public channels, creating evaluation cases that look similar in transcripts: independent co-derivation, direct access to private content, and indirect influence through public communication. This paper presents a counterfactual likelihood test for measuring influence between private reasoning channels. The method replaces an upstream private block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the downstream target’s negative-log-likelihood shift. On a 7B role-channel reasoning model used for validation, textual probes are unreliable: raw n-gram overlap overstates leakage, corrected overlap remains noisy, and canary reproduction reports no discrimination. Counterfactual likelihood separates unmasked and masked conditions, while length matching controls a RoPE positional confound. In the hardened masked validation, reverse B-to-A influence is near zero, while A-to-B influence persists through public-speech hidden states. A multi-checkpoint validation across three checkpoints, five seeds, and 13,734 valid directional contrasts replicates this asymmetry. A graph-separation control that blocks private-to-public carrier edges produces bit-identical natural and counterfactual scores across all 13,734 control evaluations, identifying the tested public-channel pathway as the complete carrier of the measured counterfactual signal under the implemented role-visibility mask. The results show that private-channel evaluation should report direct and indirect influence separately, and that counterfactual likelihood probes provide a practical default for measuring these boundaries.

[NLP-77] ReacTOD: Bounded Neuro-Symbolic Agent ic NLU for Zero-Shot Dialogue State Tracking ACL2026

【速读】：该论文旨在解决任务导向型对话系统中因大语言模型（LLM）幻觉和格式错误导致的不可预测行为问题，特别是在延迟敏感场景下使用中等规模模型时易引发动作级错误（如预订错误日期的酒店）。其解决方案的关键在于提出一种受控的神经符号架构 ReacTOD，通过在自校正 ReAct 循环中将自然语言理解（NLU）重构为离散工具调用，并引入确定性验证机制以确保每轮对话状态更新均符合动作合规性、schema 一致性及指代一致性。该设计实现了高达 93.1% 的自校正率，显著提升准确性（MultiWOZ 上提升达 9.3 个百分点），并通过增量状态预测与按需历史检索保持提示紧凑性，增强参数受限模型的指令遵循能力，最终在 MultiWOZ 和 Schema-Guided Dialogue（SGD）基准上实现零样本新最优性能，且无需任务特定训练数据即可跨基准泛化。

链接: https://arxiv.org/abs/2605.19077
作者: Yanjun Lin,Zimo Xiao,Kartik Natarajan,Mahesh Sankaranarayanan,Niraj Nawanit,Rakshit Parashar,Austin Zhang,Karthik Konaraddi,Rishita Mote,Wei Niu
机构: Amazon
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at TrustNLP Workshop at ACL 2026

点击查看摘要

Abstract:Task-oriented dialogue systems – handling transactions, reservations, and service requests – require predictable behavior, yet the moderately-sized LLMs needed for practical latency are prone to hallucination and format errors that cascade into incorrect actions (e.g., a hotel booked for the wrong date). We propose ReacTOD, a bounded neuro-symbolic architecture that reformulates NLU as discrete tool calls within a self-correcting ReAct loop governed by deterministic validation. A bounded ReAct loop enables iterative self-correction, improving accuracy by up to 9.3 percentage points over single-pass inference on MultiWOZ. A symbolic validator enforces action compliance, schema conformance, and coreference consistency on every dialogue state update, achieving a 93.1% self-correction rate on intercepted errors and producing structured execution traces. Incremental state prediction and on-demand history retrieval keep prompts compact, empirically improving instruction adherence in parameter-constrained models. On MultiWOZ 2.1, ReacTOD achieves a new zero-shot state-of-the-art: gpt-oss-20B reaches 52.71% joint goal accuracy, surpassing the previous best by 14 percentage points, while Qwen3-8B achieves 47.34% with only 8B parameters. On the Schema-Guided Dialogue (SGD) benchmark, ReacTOD with Claude-Opus-4.6 achieves 80.68% JGA under fully end-to-end evaluation with predicted domains, and Qwen3-32B reaches 64.09% – demonstrating cross-benchmark generalization without task-specific training data.

[NLP-78] Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic Persian and German

【速读】：该论文旨在解决自动语音识别（ASR）系统在真实多语言场景下性能评估不足的问题，尤其是针对代码切换（code-switching）这一复杂且研究较少的现象。现有商业ASR评测基准主要基于纯净、单语种音频，仅提供单一的词错误率（WER），难以反映实际多语言环境中的表现。解决方案的关键在于构建一个涵盖四种语言对（包括阿拉伯语-英语、波斯语-英语等）的新型基准测试集，并采用两阶段筛选与大语言模型（LLM）集成评分机制，在显著降低评分成本（约91%）的同时确保标注质量；同时引入BERTScore作为补充指标，以克服因转写差异导致WER对语义正确文本的误判问题。实验表明，ElevenLabs Scribe v2在所有语言对上均取得最低WER和最高BERTScore，且分层难度分析揭示了聚合平均值掩盖的性能差距，验证了语义层面的一致性。

链接: https://arxiv.org/abs/2605.19069
作者: Sajjad Abdoli,Ghassan Al-Sumaidaee,Clayton W. Taylor,Ahmad(MAD)ElShiekh,Ahmed Rashad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code-switching – the natural alternation between two languages within a single utterance – represents one of the most challenging and under-studied conditions for automatic speech recognition (ASR). Existing commercial ASR benchmarks predominantly evaluate clean, monolingual audio and report a single Word Error Rate (WER) figure that tells practitioners little about real-world multilingual performance. We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic–English, Saudi Arabic (Najdi/Hijazi)–English, Persian (Farsi)–English, and German–English. Each dataset comprises 300 samples selected by a two-stage pipeline: a heuristic filter scoring transcripts on five structural code-switching signals, followed by a GPT-4o and Gemini 1.5 Pro ensemble scoring candidates across six linguistic dimensions. This pipeline reduces LLM scoring costs by approximately 91% relative to exhaustive scoring. We evaluate the systems on both WER and BERTScore, arguing that BERTScore is a more reliable metric for Arabic and Persian pairs where transliteration variance causes WER to penalise semantically correct transcriptions. ElevenLabs Scribe v2 achieves the lowest WER across all four language pairs (13.2% overall; 13.1% on Egyptian Arabic) and leads on BERTScore (0.936 overall). We further demonstrate that difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and that BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The benchmarking dataset is publicly available at this https URL.

[NLP-79] he Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

【速读】：该论文试图解决低资源自然语言处理（Low-resource Natural Language Processing, Low-resource NLP）领域中评估体系与技术发展之间日益加剧的不匹配问题，特别是生成式 AI 系统复杂性提升后，对深度社会语言学专业知识的需求远超现有评估基础设施的供给能力。其解决方案的关键在于提出“标注稀缺悖论”（Annotation Scarcity Paradox），即模型技术扩展能力与主权人类评估基础设施之间的结构性摩擦，并呼吁从以数据提取为中心的交易模式转向基于认知治理（epistemic governance）、数据主权（data sovereignty）和共同所有权的社区嵌入式评估范式，从而实现更具公平性和有效性的评估实践。

链接: https://arxiv.org/abs/2605.19066
作者: Vukosi Marivate
机构: DSFSI AfriDSAI, University of Pretoria (非洲数据科学与人工智能研究所，比勒陀利亚大学); Lelapa AI
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Over the past decade, low-resource natural language processing (NLP) has experienced explosive growth, propelled by cross-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised. We present a critical narrative survey of low-resource NLP evaluation (2014–present), tracing its evolution across three phases: early heuristic optimism, the illusions of top-down benchmark scaling, and the current era of generative bottlenecks. We conceptualise the \emphAnnotation Scarcity Paradox, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. By examining extractive data pipelines, undercompensated ``ghost work’', and language data flaring, we argue that this paradox threatens the epistemic validity of reported progress. We survey emerging responses – including data augmentation, model-based evaluation, participatory curation, and annotation-efficient approaches via item response theory and active learning – and assess their equity and validity trade-offs. We close with a practitioner call to action, arguing that overcoming this bottleneck requires a paradigm shift from transactional data extraction to relational, community-embedded evaluation rooted in epistemic governance, data sovereignty, and shared ownership.

[NLP-80] Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）训练中因学习率激进、规模扩大及运行时间压力导致的不稳定性、性能退化和计算资源浪费问题。其核心解决方案是提出一种名为 Learn-by-Wire Guard (LBW-Guard) 的受控治理层，该层作为独立于优化器（AdamW）之上的自治控制机制，通过监控训练过程中的遥测数据识别不稳定状态，并施加有限范围的控制策略以调整优化器行为，从而在不改变固定训练目标的前提下提升训练鲁棒性与效率。关键创新在于：LBW-Guard 不替代优化器更新规则，也不抑制局部梯度，而是通过运行时的有界干预，在极端学习率条件下仍能保持模型收敛并显著降低最终困惑度（perplexity），同时缩短训练时间，验证了在优化器之上引入“治理平面”对提升LLM训练稳定性的有效性。

链接: https://arxiv.org/abs/2605.19008
作者: Anis Radianis
机构: Qluon Inc.(Qluon公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern language-model training is increasingly exposed to instability, degraded runs, and wasted compute, especially under aggressive learning-rate, scale, and runtime-stress conditions. This paper introduces Learn-by-Wire Guard (LBW-Guard), a bounded autonomous training-control governance layer that operates above AdamW. Rather than replacing the optimizer update rule, LBW-Guard observes training telemetry, interprets instability-sensitive regimes, and applies bounded control to optimizer execution while preserving fixed training objectives. We evaluate LBW-Guard in a Qwen2.5-centered stress-and-robustness suite using WikiText-103, with Qwen2.5-7B as the empirical anchor, model-size comparisons against Qwen2.5-3B and Qwen2.5-14B, learning-rate stress tests, gradient-clipping baselines, and a no-LoRA TinyLlama-1B full-parameter sanity check. In the 7B reference setting, LBW-Guard reduces final perplexity from 13.21 to 10.74, an 18.7% improvement, while reducing end-to-end time from 392.54s to 357.02s, a 1.10x speedup. Under stronger learning-rate stress, AdamW degrades to 1885.24 final perplexity at LR=3e-3 and 659.76 at LR=1e-3, whereas LBW-Guard remains trainable at 11.57 and 10.33, respectively. Gradient-clipping baselines do not reproduce this effect. These results support a scoped systems conclusion that stability-sensitive LLM training can benefit from a governance plane above the optimizer. LBW-Guard provides evidence that bounded runtime control can preserve productive compute under stress while remaining distinct from optimizer replacement and local gradient suppression. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2605.19008 [cs.AI] (or arXiv:2605.19008v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.19008 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-81] FedMental: Evaluating Federated Learning for Mental Health Detection from Social Media Data ACL

【速读】：该论文试图解决在训练用于识别高风险心理健康行为的机器学习模型时，敏感社交媒体文本数据共享带来的隐私泄露问题，同时推动基准数据集的发展。其解决方案的关键在于评估两种隐私保护机器学习技术——联邦学习（Federated Learning, FL）和差分隐私联邦学习（Differentially Private FL）——在心理健康的两个典型任务中的有效性：在X（原Twitter）上进行抑郁检测，在Reddit上进行自杀危机检测。研究通过模拟非独立同分布（non-IID）场景下用户作为客户端的数据共享方式，系统评估了不同客户端比例、聚合策略及隐私预算下的性能表现。结果表明，FL可在保持接近集中式训练性能（F1=83.16 vs 85.63）的前提下实现安全的数据协作；而差分隐私FL即使在低噪声水平（ε=50）下也导致显著性能下降（F1最高下降至27.01），主要原因是高度信息量但稀疏的心理健康语言特征（如健康话题词和情绪词）在加噪过程中被严重扭曲。该研究首次实证揭示了当前隐私保护技术在心理健康推断任务中的潜力与局限性。

链接: https://arxiv.org/abs/2605.18936
作者: Nuredin Ali Abdelkadir,Anjali Ratnam,Zeerak Talat,Stevie Chancellor
机构: University of Minnesota; University of Edinburgh
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Association for Computational Linguistics (ACL) 2026 Main Conference

点击查看摘要

Abstract:Social media text data are often used to train Machine Learning (ML) models to identify users exhibiting high-risk mental health behaviors. However, sharing this sensitive data poses privacy risks and limits the growth of benchmark datasets. We comprehensively evaluate whether privacy-preserving ML techniques can enable safer data sharing while preserving performance. Specifically, we apply federated learning (FL) and Differentially Private FL for two widely-studied mental health prediction tasks: depression detection on X (Twitter) and suicide crisis detection on Reddit. We simulate realistic data-sharing scenarios by treating each user as a client in a non-IID setting, evaluating across different client fractions, aggregation strategies, and privacy budgets. While FL achieves comparable performance to centralized training (centralized F1 = 85.63; best FL model F1 = 83.16) on depression identification, we find that Differentially Private FL has a large performance-privacy trade-off (up to F1 = 27.01 drop) even with low levels of noise (epsilon = 50). This is due to the distortion of highly informative yet sparse mental health linguistic markers related to mental health, like health topics and emotion words. This research empirically demonstrates the potential and limitations of current privacy preservation techniques for mental health inference tasks.

[NLP-82] Dynamic Model Merging Made Slim

【速读】：该论文旨在解决动态模型合并（dynamic merging）中参数分配不均衡导致的准确率-效率权衡问题：现有方法要么维持一个完整的共享模型搭配微小专家模块，要么为专家分配过多参数，从而造成资源浪费或性能不足。其解决方案的关键在于提出DiDi-Merging框架，通过可微分秩分配（differentiable rank allocation）机制，在低秩模块中将参数预算优化转化为可微分的秩优化问题，并引入无数据精修步骤以恢复任务保真度（task fidelity）。该方法在仅需单个微调模型1.24倍参数的情况下达到与先进动态基线相当的性能，并在1.4倍参数规模下实现超越，显著优于需要2倍存储空间的方法，且适用于视觉、语言和多模态任务。

链接: https://arxiv.org/abs/2605.18904
作者: Guodong Du,Wanyu Lin
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Model merging enables the reuse of fine-tuned models without joint training or access to original data. Dynamic merging further improves flexibility by selectively activating task-relevant parameters and efficiently composing experts across multiple tasks. However, existing dynamic methods either maintain a full shared model with tiny experts or allocate excessive capacity to experts, leading to suboptimal accuracy–efficiency trade-offs. To address this, we propose DiDi-Merging, a slim dynamic merging framework that leverages differentiable rank allocation to balance shared and expert parameters. By formulating parameter budgeting as differentiable rank optimization in low-rank modules and introducing a data-free refinement step to recover task fidelity, DiDi-Merging matches prior dynamic baselines at only 1.24x the parameters of a single fine-tuned model and surpasses them at 1.4x, substantially more compact than methods requiring 2x storage. DiDi-Merging applies across vision, language, and multimodal tasks.

[NLP-83] ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

【速读】：该论文试图解决大语言模型（Large Language Models, LLMs）在训练过程中无意保留敏感信息（sensitive information，即可能引发有害生成的输入）所带来的隐私与安全风险问题。现有机器遗忘（machine unlearning）方法主要依赖重新训练或激进的微调（aggressive fine-tuning），存在计算成本高或损害相关知识和模型整体效用的问题。论文提出的关键解决方案是将机器遗忘重新建模为通过模型编辑实现的精确知识重映射（precise knowledge re-mapping）问题，并设计了ZeroUnlearn框架：该框架通过将敏感输入映射到中性目标状态并移除其原始表示，利用乘法参数更新机制强制表征正交性（representational orthogonality），从而实现高效且精准的遗忘；同时进一步扩展出基于梯度的变体以支持多样本遗忘。实验表明，该方法在保持模型通用性能的同时显著优于现有基线。

链接: https://arxiv.org/abs/2605.18879
作者: Yujie Lin,Chengyi Yang,Zhishang Xiang,Yiping Song,Jinsong Su
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive fine-tuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re-mapping problem via model editing. We propose ZeroUnlearn, a few-shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed-form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient-based variant for multi-sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the github: this https URL.

[NLP-84] SAGE: Shaping Anchors for Guided Exploration in RLVR of LLM s

【速读】：该论文试图解决的问题是：尽管基于可验证奖励的强化学习（RLVR）在提升大语言模型在推理任务上的 pass@1 准确率方面表现可靠，但在 pass@k（即允许更多采样路径时）的提升有限，这引发了对 RLVR 是否真正赋予模型新的推理能力，还是仅优化了基座模型中已存在的推理模式的疑问。解决方案的关键在于识别出传统 RLVR 目标函数中的一个核心结构限制——反向 KL（reverse-KL）正则化虽然稳定训练过程，但会将策略锚定于参考分布，抑制了替代推理模式的涌现。作者进一步发现，单纯移除或替换为前向 KL 正则化均无法有效平衡效率与覆盖范围，反而导致奖励欺骗或概率质量分配至非目标区域。为此，论文提出 SAGE 框架，通过引入引导函数 $ q(x,y) $ 重塑反向 KL 的锚定分布，实现可控的经验支持扩展，在多个数学推理基准上同时显著提升 pass@1 和 pass@k 性能。

链接: https://arxiv.org/abs/2605.18864
作者: Chanuk Lee,Minki Kang,Sung Ju Hwang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse-KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward-KL provides a satisfactory solution, as both disrupt the efficiency-coverage trade-off by either inducing reward hacking or allocating probability mass to off-target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse-KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks. Our code is available at this https URL.

[NLP-85] SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

【速读】：该论文旨在解决长上下文推理中因键值缓存（KV cache）导致的内存占用增长和解码效率瓶颈问题，尤其是当压缩状态在解码过程中仍需重建为密集向量时，高带宽内存（HBM）数据流成为关键路径瓶颈。其解决方案的核心在于将KV分配建模为一个基于注意力几何结构的率失真（rate-distortion）优化问题，并提出两个关键技术：一是Angle-Domain Attention（ADA），通过球面参数化存储键向量（仅保留半径和紧凑的角度编码），直接在解码热循环中计算注意力logits，避免重建密集向量，从而减少HBM流量；二是Rate-Distortion Retention（RDR），在固定预算下联合决策保留/丢弃每个token和头的精度层级，生成层级一致的页面与轻量元数据，实现高效的内存利用与解码路径融合。二者协同构建了一个面向部署的长上下文推理机制，在降低KV驻留内存的同时保持解码效率。

链接: https://arxiv.org/abs/2605.18856
作者: Anay Chauhan,Gurucharan Marthi Krishna Kumar,Arion Das,Amit Dhanda,Vinija Jain,Aman Chadha,Amitava Das
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing methods such as eviction, windowing, quantization, and offloading reduce footprint, but often leave the critical-path bottleneck only partially addressed, especially when compressed states must still be reconstructed into dense vectors during decoding. We present Spherical KV, a long-context inference method that treats KV allocation as a rate-distortion problem grounded in attention geometry for efficient decoding. The method is built on two ideas: (i) represent directional information cheaply in the decode hot loop, and (ii) allocate retention and precision according to estimated future utility. Its first component, Angle-Domain Attention (ADA), stores keys in a spherical parameterization consisting of a scalar radius and compact angle codes, and computes attention logits directly from these codes without reconstructing dense keys. This preserves a paged, block-local, fusion-friendly decode path and directly targets HBM traffic in realistic serving settings. Its second component, Rate-Distortion Retention (RDR), jointly chooses keep/drop decisions and precision tiers per token and head under a fixed budget, producing tier-homogeneous pages with lightweight metadata and coalesced reads. Together, ADA and RDR provide a deployment-oriented mechanism for reducing KV residency while preserving decode efficiency. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT) ACMclasses: I.2.6; I.2.7 Cite as: arXiv:2605.18856 [cs.LG] (or arXiv:2605.18856v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.18856 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-86] Robust Checkpoint Selection for Multimodal LLM s via Agent ic Evaluation and Stability-Aware Ranking

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在检查点（checkpoint）选择过程中因性能差异微小且评估信号易受噪声干扰而导致的决策可靠性问题。现有方法依赖静态基准测试或逐点评分，难以反映实际应用场景，并缺乏对不确定性的稳健估计，尤其在涉及大量光学字符识别（OCR）任务时表现不佳。论文的关键解决方案是将检查点选择建模为一个在评估不确定性下的鲁棒决策问题，提出了一种多阶段框架：整合精选的真实世界数据、基于结构化大语言模型（LLM）的判断机制以及多阶段排序协议；通过逐点过滤、列表级排序和成对比较实现渐进式优化；同时引入基于子采样的置信度估计和百分位数评分公式，以捕捉分布特性并惩罚尾部失败，从而提升评估结果的可靠性和泛化能力。

链接: https://arxiv.org/abs/2605.18852
作者: Qinwu Xu,Zhuoheng Li,Jessie Salas
机构: Meta AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Checkpoint selection for multimodal large language models (MLLMs) presents significant challenges when performance differentials are marginal and evaluation signals are prone to noise. Existing methodologies rely heavily on static benchmarks or pointwise scoring, which frequently misalign with in-the-wild usage and lack robust uncertainty estimation, particularly in OCR-heavy scenarios. In this work, we formulate checkpoint selection as a robust decision problem under evaluation uncertainty. We propose a multi-stage framework that integrates curated real-world data, structured LLM-based judgment, and multi-stage ranking protocols. The evaluation system orchestrates progressive refinement via pointwise filtering, listwise ranking, and pairwise comparison. To enhance reliability, we introduce subsampling-based confidence estimation and a percentile-based scoring formulation that captures distributional characteristics while penalizing tail failures. Furthermore, we demonstrate that data quality, specifically OCR readability, is a critical determinant of evaluation validity.

[NLP-87] he Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

【速读】：该论文试图解决的问题是：当前主流模型评估榜单（leaderboards）仅能按独立维度对前沿模型进行排序，却无法揭示不同能力之间是协同增强还是此消彼长——而在模型性能的前沿区域，这种能力间的相互作用才是更具信息量的信号。解决方案的关键在于提出一种分解方法，将成对的SWE-bench（代码生成）与GPQA Diamond（知识推理）得分拆解为“群体耦合趋势”（population coupling trend）和每轮发布的残差项（h-field），从而诊断各模型的能力侧重方向，并识别出最具价值的下一个测量指标或压力测试。研究发现，34个来自10个实验室（2024–2026年）的模型中，能力总体呈现显著正向协同关系（r = +0.72, p < 10⁻⁶），但这种协同性随实验室和时间动态变化；例如DeepSeek从以推理为主转向编码优先，而Google保持稳定推理侧重，Anthropic则在编码与推理间波动。进一步分析表明，能力协同并非静态，而是具有级联效应，且在约30B–72B参数规模时出现第二阶段能力跃迁，此时SWE-bench趋于饱和，而HLE（人类偏好）和指令遵循能力仍具区分度，提示未来评估轴将发生旋转。作者提供了一个三层次行动框架（定位、诊断、旋转）、各实验室专属的测量优先级表以及七个可证伪的预测及其时间戳标准，用于指导未来12个月的前沿模型发布策略。

链接: https://arxiv.org/abs/2605.18840
作者: Adil Amin
机构: ZEHEN Labs (ZEHEN实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 5 figures, 4 tables. Companion paper: “Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling.” Code: this https URL . Dashboard: this https URL

点击查看摘要

Abstract:Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases – and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ( h -field) that diagnoses capability emphasis and identifies which measurement or stress test is most informative next. Across 34 models from 10 labs (2024–2026), capabilities cooperate ( r = +0.72 , p 10^-6 ), but cooperation varies by lab and over time: DeepSeek reversed from reasoning-rich to coding-first ( h : +11.2 \to -4.7 , 15.9-pp swing); Google maintains consistent reasoning emphasis; Anthropic oscillates between coding excursions and recovery. Cooperation is not static – it cascades. Six open-weight architectures confirm a second capability transition at 30–72B, and SWE-bench is now saturating while HLE and instruction-following retain discriminatory spread – signaling the next axis rotation. We provide a three-level playbook (locate, diagnose, rotate), a per-lab measurement-priority table, and seven falsifiable predictions with timestamped criteria for the next 12 months of frontier releases. Per-lab coupling slopes vary 5\times (Google 1.15 vs. DeepSeek 0.23 ), quantifying how efficiently each recipe converts coding gains into reasoning. Five April 2026 releases confirm the diagnostic out of sample ( r rises from +0.72 to +0.75 ). An interactive dashboard provides phase classification with actionable recommendations, h -field diagnostics, per-lab coupling trajectories, ODE-based scaling predictions, benchmark rotation guidance, self-steering demo, and live tracking of all seven predictions: this https URL.

[NLP-88] Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

【速读】：该论文试图解决的问题是：当前的缩放定律（scaling laws）能够预测模型损失（loss），但无法解释模型能力（如推理能力和真实性）之间的相互作用机制。研究发现，在不同规模的模型中，推理与真实性之间存在一种“相变”现象——在低于某个家族特异性的临界规模 $ N_c $ 时，二者呈反相关；而超过 $ N_c $ 后则转为正向协同。这一转变在传统损失曲线中不可见，但通过系统性测量63个来自16个模型家族的能力耦合度得以揭示。解决方案的关键在于识别出这种非线性相变行为，并证明其受模型宽度归一化、架构设计、训练数据筛选和训练策略等独立因素调控，从而提供可诊断、可干预的框架：仅需公开基准分数即可判断模型处于何种耦合相（anticorrelation 或 cooperation），并推荐具体优化措施（如数据清洗、宽度调整或基准轮换）。此外，研究还提出一个基于稀疏回归的常微分方程（ODE）模型，能以5.6%误差跨模型预测Llama-2表现，且该方法适用于前沿模型（$ r = +0.72 $，34个模型，10个实验室），推动了对生成式AI能力涌现机制的理解与可控调节。

链接: https://arxiv.org/abs/2605.18838
作者: Adil Amin
机构: ZEHEN Labs
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 8 figures, 2 tables. Companion paper: “The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next.” Code: this https URL . Dashboard: this https URL

点击查看摘要

Abstract:Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale N_c , capabilities anticorrelate; above it, they cooperate. N_c \approx 3.5 B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift N_c independently: curated training eliminated the coupling dip between Qwen generations ( 0.025 \to 0.830 at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals – only public benchmark scores across a model family. The cooperative regime extends to the frontier ( r = +0.72 , 34 models, 10 labs). Code, data, and an open-source activation-steering tool for any open-weight model are released alongside an interactive dashboard that diagnoses any model’s coupling phase, suggests concrete interventions (data curation, width, benchmark rotation), and provides ODE scaling predictions, frontier diagnostics, and eigenstructure analysis: this https URL.

[NLP-89] Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

【速读】：该论文试图解决当前基础模型（foundation models）评估中依赖的基准测试存在覆盖不全、元数据缺失以及对污染敏感等问题，导致评估结果缺乏细粒度和可靠性。其解决方案的关键在于提出一个自动化的基准生成框架，该框架基于参考材料（如教科书）生成问题，确保基准具有广泛覆盖、丰富的元数据，并具备抗污染能力；同时采用多智能体架构进行问题生成，并引入基于解题图（solution-graph-driven）的策略显著提升真实答案的可靠性。实验表明，该框架生成的三个领域基准（机器学习、公司金融和个人金融）在专家评审中表现出更低的真实答案错误率，且能更均匀地揭示不同模型的能力差异，从而克服了现有基准（如MMLU和GSM8K）的局限性。

链接: https://arxiv.org/abs/2605.18824
作者: Mohammed Saidul Islam,Negin Baghbanzadeh,Farnaz Kohankhaki,Afshin Cheraghi,Ali Kore,Shayaan Mehdi,Elham Dolatabadi,Arash Afkanpour
机构: Vector Institute (向量研究所); York University (约克大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi-agent architecture for problem generation and a solution-graph-driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground-truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open-source models shows that our benchmarks achieve near-uniform competency coverage and surface performance differences across models that existing benchmarks fail to capture. We will open-source the framework and our curated benchmarks soon.

[NLP-90] Compositional Literary Primitives in Instruction-Tuned LLM s: Cross-Architectural SAE Features for Self Style and Affect

【速读】：该论文试图解决的问题是：如何在指令微调的大语言模型（LLM）中识别和解析其内部表征中构成文学性表达的“基本构件”（literary primitives），特别是这些构件如何协同生成情感语义。解决方案的关键在于采用稀疏自编码器（sparse autoencoders）对中间层残差流（mid-depth residual streams）进行分解，从而发现四类核心特征：命名门控（naming-gates）促进目标情感的词汇标记、第一人称注册特征集群（eleven-self cluster）、风格调节器（如“展示而非讲述”和陌生化机制），以及仅通过多特征组合才能触发的复合情绪（compositional emotions）。研究进一步通过三阶段验证流程（logit-lens、LLM评分、五模型评委判别）证明了这些特征的有效性和非随机性，并揭示了不同架构（Llama 3.1 vs Gemma 2）在情感表达策略上的本质差异：Llama更依赖直接命名，Gemma则依赖场景与意象间接唤起情感，这种跨架构不对称性凸显了模型内部表示的多样性与可解释性潜力。

链接: https://arxiv.org/abs/2605.18808
作者: Joao Paulo Cavalcante Presa,Savio Salvarino Teles de Oliveira
机构: Federal University of Goias (UFG)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 36 pages, 6 figures

点击查看摘要

Abstract:We characterize a compositional architecture of literary primitives in two instruction-tuned large language models (Llama 3.1 8B-Instruct and Gemma 2 9B-IT) via sparse autoencoders on mid-depth residual streams. Four feature classes emerge: naming-gates that promote lexical tokens of a target affect, an eleven-self cluster of first-person register features, stylistic register modulators (show-don’t-tell and defamiliarization), and compositional emotions that arise only from multi-feature steering. Under a forced-choice 5-LLM judge panel applied to a 27-category emotion taxonomy (Cowen-Keltner), Llama reaches full 27/27 coverage by combining naming-gates, multi-feature recipes, and single self-feature steering; Gemma reaches 23/27 with adoration as the single residual strict-fail. Under random judging, the per-cell pass probability is on the order of 10^-3 and the expected number of two-seed false-positive cells across the catalog is negligible, so the observed coverage is not consistent with chance. A cross-architectural asymmetry sits in the strict-versus-soft judge contrast: on the same generations, judges agree more often on Llama outputs than on Gemma outputs because Llama outputs name the target affect more directly while Gemma outputs evoke it through scene and imagery. Both architectures contain self-features that serve simultaneously as register markers and as emotion emitters, including a single most-RLHF-loaded self-feature per architecture that intensifies the institutional Helper-AI persona at one operating regime and produces affect-categorizable output at the same calibrated coefficient. Methodologically, the paper presents a three-stage validation pipeline (logit-lens, LLM-rate, 5-LLM judge) with documented anti-patterns; the total compute is single-GPU and about 15 minutes per emotion-feature discovery cycle.

[NLP-91] ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

【速读】：该论文试图解决大语言模型在用户批评交互中可能从初始正确答案转向错误答案的问题，尤其在科学推理场景下，这种“非理性屈服”（sycophancy）风险较高。传统方法仅关注最终答案的准确性，忽略了对话过程中正确性状态的变化。解决方案的关键在于提出 ReCrit——一个面向“跨轮次正确性转移”的强化学习框架，其核心创新是将模型从初始回答到用户批评后的行为分解为四个象限：修正（Correction）、谄媚（Sycophancy）、鲁棒性（Robustness）和边界（Boundary），并通过区分这些行为类型来设计奖励机制：奖励修正与鲁棒性，惩罚谄媚，并将持续错误视为弱边界信号。此外，ReCrit引入动态异步回放（dynamic asynchronous rollout）与尾部自适应补全（tail-adaptive completion）以提升训练效率，显著提升了多个科学推理基准（ChemBench、TRQA、EarthSE）上的批评阶段准确率，验证了过渡感知奖励机制相较于最终答案奖励更具交互层面的优化潜力。

链接: https://arxiv.org/abs/2605.18799
作者: Wanghan Xu,Yuhao Zhou,Hengyuan Zhao,Shuo Li,Dianzhi Yu,Zhenfei Yin,Yaowen Hu,Fengli Xu,Wanli Ouyang,Wenlong Zhang,Lei Bai
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); National University of Singapore (新加坡国立大学); Chinese University of Hong Kong (香港中文大学); University of Oxford (牛津大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at this https URL .

[NLP-92] UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing

【速读】：该论文旨在解决大语言模型（Large Language Model, LLM）级联推理中因路由策略不当导致的高推理成本问题，特别是现有路由器依赖未经校准的置信度分数且需针对每个工作负载手动调参的问题。其解决方案的关键在于提出UCCI（Uncertainty-Calibrated Cascade Inference），该方法首先通过等距回归（isotonic regression）将token级别的边际不确定性映射为每条查询的错误概率，并基于约束成本最小化确定最优升级阈值。在三个明确假设下，该校准后的阈值策略在成本上是次优的，且等距校准实现了期望校准误差（Expected Calibration Error, ECE）的O(n⁻¹/³)样本复杂度。实验表明，在75,000条命名实体识别查询任务中，使用4B与12B指令微调LLM部署于H100 GPU时，UCCI在保持微F1=0.91的前提下将推理成本降低31%（95%置信区间：[27%, 35%]），同时将ECE从0.12降至0.03，显著优于熵阈值法、分割 conformal 路由和FrugalGPT风格的学得阈值。所有结果均基于实际模型输出的端到端路由及H100实测延迟，而非模拟路由或名义API价格。

链接: https://arxiv.org/abs/2605.18796
作者: Varun Kotte
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 2 figures, 4 tables. Code: this https URL

点击查看摘要

Abstract:LLM cascades and model routing promise lower inference cost by sending easy queries to a small model and escalating hard ones to a large model, but most deployed routers use uncalibrated confidence scores and require per-workload threshold tuning. We present UCCI, a calibration-first router that maps token-level margin uncertainty to a per-query error probability via isotonic regression and selects the escalation threshold by constrained cost minimization. Under three explicit assumptions, threshold policies on the calibrated score are cost-optimal, and isotonic calibration achieves O(n^-1/3) sample complexity for expected calibration error (ECE). On a production named entity recognition workload of 75,000 queries served by 4B and 12B instruction-tuned LLMs on H100 GPUs, UCCI cuts inference cost by 31% (95% CI: [27%, 35%]) at micro-F1 = 0.91 while reducing ECE from 0.12 to 0.03. At the same operating point, UCCI beats entropy thresholding, split-conformal routing, and a FrugalGPT-style learned threshold. All cascade results use end-to-end routing on actual model outputs and measured H100 latency, not simulated routing from global accuracies or nominal API prices.

信息检索

[IR-0] SAGE: Scalable Automatic Gating Ensemble for Confident Negative Harvesting in Fraud Detection

链接: https://arxiv.org/abs/2605.20157
作者: Sudheer Tubati,Amit Goyal
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Music streaming fraud, where bad actors artificially inflate stream counts to manipulate chart rankings and royalty payments, poses a significant threat to streaming services and legitimate content creators. Traditional fraud detection approaches struggle with a critical challenge: many legitimate edge cases, including super-fans and sleep-music sessions, exhibit activity patterns that closely mimic those of coordinated fraud. We present SAGE, a novel counterfactual-aware negative harvesting approach that combines SimHash-based stratified sampling with a modular gating ensemble for confident negative identification from unlabeled data. Our ensemble architecture employs pluggable statistical gates (currently instantiated with Mahalanobis distance and k-NN density) with configurable voting thresholds enabling adaptive precision-recall trade-offs. This addresses the representation bias problem in Positive-Unlabeled learning by ensuring comprehensive coverage of rare behavioral cohorts through floor-constrained sampling. Evaluation demonstrates strong precision and recall on held-out data. The approach generalizes across fraud detection domains, achieving strong performance on both customer-level and artist-level fraud without modification to the core methodology.

[IR-1] BiRD: A Bidirectional Ranking Defense Mechanism for Retrieval Augmented Generation

链接: https://arxiv.org/abs/2605.20123
作者: Chengcai Gao,Zhihong Sun,Xiaochuan Shi,Qiufeng Wang,Chao Liang
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注: 17 pages, 10 figures and 8 tables

点击查看摘要

Abstract:The growing adoption of Retrieval-Augmented Generation (RAG) has led to a rise in adversarial attacks. Existing defenses, relying on semantic analysis or voting, face a trade-off between high computational cost and limited robustness under strong poisoning attacks. Their fundamental limitation is the exclusive focus on semantic content relevance, while neglecting the retrieval context that is critically defined by ranking structures. To this end, we investigate the bidirectional ranking behavior of poisoned and benign documents, and discover a key discriminative pattern: poisoned documents exhibit significantly stronger alignment between their backward rankings and the query’s forward ranking. Capitalizing on this, we propose BiRD, a bidirectional ranking defense mechanism built upon a dual-signal framework that leverages forward ranking to assess semantic content relevance and backward ranking to quantify ranking context consistency. This design directly addresses the fundamental limitation of prior approaches, enabling simultaneous efficiency and robustness. Extensive evaluation across 3 datasets with 3 retrievers and 3 LLMs under 2 attack scenarios validates BiRD’s effectiveness. Notably, BiRD reduces the attack success rate of PoisonedRAG by up to 54% while simultaneously improving task accuracy by up to 56%, with average additional latency under 1 second.

[IR-2] Auditing Privacy in Multi-Tenant RAG under Account Collusion

链接: https://arxiv.org/abs/2605.19847
作者: Florian A. D. Burnat,Brittany I. Davidson
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-tenant retrieval-augmented generation (RAG) services advertise per-account differential privacy as the operative leakage boundary: each account’s queries are guaranteed to satisfy (\varepsilon_\textacc, \delta_\textacc) -DP with respect to the index. We identify same-index multi-account collusion as a privacy-boundary failure: for k same-tenant accounts coordinating against the tenant’s index – the operative regime – known DP composition theory implies joint leakage degrades unconditionally at rate \Theta(\sqrtk \cdot \varepsilon_\textacc) for Gaussian-noised retrieval. Cross-tenant and external collusion match the rate only under explicit access-control failure (M4); without M4 these regimes have zero leakage by design and reduce to an architectural audit, not a DP audit. We exhibit an attack realizing the rate and derive a RAG-specific MIA prediction we test empirically. To make this per-account/joint gap auditable, we design the first audit protocol that operates against unmodified RAG deployments and issues a quantitative (\textsfPASS, \varepsilon_\textaudit) verdict for the retrieval-score channel – the noise-then-select step the per-account DP guarantee actually covers – without index disclosure, pipeline redesign, or model-weight exposure. Generation-channel privacy (LLM output conditioned on selected documents) is a separate audit predicate that should compose with ours; we explicitly scope it out. The protocol composes generic cryptographic primitives (Merkle ledgers, ZK function-application proofs, Gaussian noise attestations) with six RAG-specific primitives (embedder commitment, index-content vector commitment, per-account query ledger, noise-then-select attestation, cross-tenant containment proof, coalition-size estimator) and supports both closed-form audit bounds and Rényi-DP moments-accountant tracking.

[IR-3] Divergence Meets Consensus: A Multi-Source Negative Sampling Framework for Sequential Recommendation

链接: https://arxiv.org/abs/2605.19651
作者: Yuanzi Li,Lingjie Wang,Jingyu Zhao,Zihang Tian,Yuhan Wang,Lei Wang,Xu Chen
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Negative sampling is significant for training sequential recommendation models under implicit feedback. The predominant strategy, self-guided hard negative sampling, selects negatives based on the model’s current state but suffers from three limitations: (1) the coupling between sampling and model updates triggers a vicious cycle that drives the model into local optima; (2) relying on current model parameters narrows sampling to a small region of the item space, reducing diversity and harming generalization; (3) identifying a hard negative requires scoring the entire candidate pool, causing substantial computational overhead with minimal information gain. To address these challenges, we propose MDCNS (Multi-source Divergence-Consensus for Negative Sampling), a novel “Teacher-Peer-Self” framework inspired by Vygotsky’s Zone of Proximal Development (ZPD) theory. The proposed method comprises three components, including multi-source scoring, divergence re-ranking, and consensus distillation. Firstly, multi-source scoring incorporates peer and ensemble teacher models to inject external negative signals and break the self-reinforcement loop. Then, divergence re-ranking exploits prediction discrepancy between self and peer models to enhance sampling diversity. Finally, consensus distillation aligns the self model with the teacher via KL divergence, simultaneously improving computational cost utilization. Extensive experiments on six real-world datasets and five backbone models show that MDCNS consistently outperforms state-of-the-art negative sampling methods, demonstrating strong effectiveness and generalization.

[IR-4] Understanding Wacky Weights: A Dissection of SPLADEs Learned Term Importance SIGIR2026

链接: https://arxiv.org/abs/2605.19628
作者: Gregory Polyakov,Harrisen Scells,Carsten Eickhoff
类目: Information Retrieval (cs.IR)
备注: 11 pages, 4 figures, accepted at SIGIR 2026

点击查看摘要

Abstract:Learned sparse retrieval models such as SPLADE combine the effectiveness of neural architectures with the efficiency of inverted indices. As these models assign weights to terms from a fixed vocabulary, interpretability is often touted as a major benefit of these models. However, the emergence of wacky weights, i.e., expansion terms that appear semantically unrelated to the input, limits interpretability. While prior research has anecdotally observed this phenomenon, there is a lack of systematic understanding regarding their origins, prevalence, and contribution to retrieval effectiveness. In this paper, we reproduce SPLADE-v2 to systematically investigate wacky weights across the SPLADE family of models. We present a comprehensive dissection of wacky weights, providing a formal definition of wackiness based on the lexical utility of expansion terms. Furthermore, we introduce a novel measure to compare the prevalence of these tokens across models with varying vocabularies and sparsity levels. Beyond reproducing the original SPLADE-v2, we train it with various loss functions, datasets, and backbone transformers to isolate the factors contributing to wackiness. Our results show that larger vocabularies are associated with a higher prevalence of wacky tokens, while stricter sparsity regularizers are associated with lower prevalence. Finally, we find that wacky weights are used primarily for in-domain effectiveness rather than out-of-domain generalization.

[IR-5] SynGR: Unleashing the Potential of Cross-Modal Synergy for Generative Recommendation ICML2026

链接: https://arxiv.org/abs/2605.18920
作者: Wei Chen,Xingyu Guo,Shuang Li,Fuwei Zhang,Meng Yuan,Jing Fan,Zhao Zhang,Deqing Wang,Fuzhen Zhuang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by ICML2026, 15 pages

点击查看摘要

Abstract:Generative Recommendation (GR) has emerged as a promising paradigm by formulating item recommendation as a sequence-to-sequence generation task over item identifiers. Recent studies have incorporated multimodal signals to provide richer token-level evidence for generation. However, existing approaches largely rely on alignment-centric fusion and underexplore synergistic information across modalities. In practice, synergistic information plays a critical role in capturing emergent item properties that cannot be inferred from any single modality alone. Such properties encode intrinsic item semantics and guide user preferences, enabling models to move beyond surface-level feature matching. To address this limitation, we propose \textbfSynGR, a synergistic generative recommendation framework that explicitly encourages the exploitation of cross-modal dependencies during generation. By constraining overreliance on dominant modalities, SynGR enables the model to capture emergent item semantics beyond shared or modality-specific signals. Extensive experiments across three benchmark datasets demonstrate that SynGR achieves superior performance.

[IR-6] he 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection ICLR ICLR2026

链接: https://arxiv.org/abs/2605.18857
作者: Vyzantinos Repantis,Harshvardhan Singh,Tony Joseph,Cien Zhang,Akash Vishwakarma,Svetlana Karslioglu,Michael Wyatt Thot,Ameya Gawde
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 2 figures, 7 tables. Accepted at ICLR 2026 Blog Track, this https URL

点击查看摘要

Abstract:For most of the history of information retrieval (IR), search results were designed for human consumers who could scan, filter, and discard irrelevant information on their own. This shaped retrieval systems to optimize for finding and ranking more relevant documents, but not keeping results clean and minimal, as the human was the final filter. However, LLMs have changed that by lacking this filtering ability. To address this, we introduce Bits-over-Random (BoR), a chance-corrected measure of retrieval selectivity that reveals when high success rates mask random-level performance. We measure selectivity as BoR = \log_2\left(\frac\mathrmP_obs\mathrmP_rand\right) , where \mathrmP_rand is the hypergeometric baseline for the chosen success rule (here, coverage: \geq1 relevant in top- K ). On the 20 Newsgroups dataset, BM25 and SPLADE both report 99 % success at K=100 (coverage), yet BoR \approx 0 , indicating random-level selectivity at that depth. When the expected coverage ratio \left(\fracK \cdot \barR_qN\right) exceeds 3-5, the baseline dominates and selectivity collapses. Downstream retrieval-augmented generation (RAG) evaluation confirms this pattern: LLM accuracy can degrade substantially at K=100 , consistent with the near-zero BoR ceiling. In contrast, BoR remains positive on BEIR/SciFact and on MS MARCO (where 41 systems cluster within 0.2 bits of the theoretical ceiling despite a 13-point recall gap), confirming baseline predictions across sparse and large-scale settings. We further show that the collapse boundary applies to LLM agent tool selection, where small catalog sizes cause selectivity to vanish even with perfect selectors. These findings suggest reporting BoR alongside traditional metrics and reconsidering depth choices when additional retrieval provides negligible selectivity gains while inflating computational costs. Comments: 12 pages, 2 figures, 7 tables. Accepted at ICLR 2026 Blog Track, this https URL Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 68P20, 68T50, 94A17 ACMclasses: H.3.3; I.2.7; I.2.11; I.2.6 Cite as: arXiv:2605.18857 [cs.IR] (or arXiv:2605.18857v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.18857 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ICLR Blog Track 2026, https://iclr.cc/virtual/2026/poster/10012083

[IR-7] KadiAssistant: A conversational AI Agent for information retrieval in Kadi4Mat

链接: https://arxiv.org/abs/2605.18850
作者: Adrian Cierpka,Mohammad Shafiqul Islam,Johannes Steinhülb,Eric Dietriche Sesso Domtchoueng,Michael Selzer,Arnd Koeppe
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce KadiAssistant, a privacy-by-design AI assistant integrated into the Kadi research data ecosystem, enabling researchers to efficiently access, aggregate, and synthesize information from heterogeneous, privacy-sensitive research data. Interdisciplinary fields such as materials science bring together disciplines with their own terminology and standards. While this convergence fuels innovation, it also makes it increasingly difficult to connect and access knowledge, as data are distributed across disciplines, organizations, and individuals. For example, battery research combines electrochemical measurements, materials characterization data, physics-based simulations, and manufacturing parameters, each using different formats, vocabularies, and standards. Efficiently storing and sharing such heterogeneous data via research data platforms, such as Kadi4Mat, demands domain knowledge, technical expertise, and familiarity with metadata schemas and interfaces. Research data also vary in sensitivity: newly generated ‘warm’ data are often private, whereas published ‘cold’ data are usually openly accessible. The Kadi ecosystem offers fine-grained access control needed for sensitive data. A solution for efficient information retrieval in Kadi must therefore respect the fine-grained access permissions. To address these intertwined challenges of information retrieval, strong data privacy, and complex access control, KadiAssistant combines a self-hosted large language model (LLM) with a privacy-preserving semantic search, inspired by retrieval-augmented generation, that can access files and record metadata on Kadi. This allows the assistant to screen, aggregate, and structure information into a highly informative answer. KadiAssistant therefore bridges terminology and standards, lowers access barriers for researchers, and strengthens the Findable pillar of FAIR data principles.

[IR-8] Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds

链接: https://arxiv.org/abs/2605.18827
作者: Prateek Biswas,Dhaval Patel,Vedant Khandelwal,Shuxin Lin,Amit Sheth
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: 28 Pages, 18 Figures

点击查看摘要

Abstract:Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32, 36.43]. Under a stricter Ab 30% direct-signal gate, the macro difference is +14.11 points. These estimates are descriptive. Assisted inference uses a larger solver-call budget, answer extraction is brittle, Time-MQA contains the observed regressions, and some generated programs violate the no-hard-coding instruction. CGR provides the trace package needed to interpret these results, including direct, assisted, and generator-side answers, partition definitions, generated programs, response metadata, and audits.

[IR-9] PASC: Pipeline-Aware Conformal Prediction with Joint Coverag e Guarantees for Multi-Stage NLP and LLM Pipelines

链接: https://arxiv.org/abs/2605.18812
作者: Varun Kotte
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Modern NLP and LLM systems are pipelines: named entity recognition (NER) - entity disambiguation (NED) - entity typing, retrieval-augmented generation (retriever - reader), and agentic chains of planner - tool - critic. Errors compound across stages, but existing uncertainty quantification methods either calibrate each stage independently (no joint coverage) or apply a Bonferroni union bound (joint coverage, but conservative). We present PASC (Pipeline-Aware Split Conformal), which reduces multi-stage joint coverage to a single scalar conformal prediction problem on the joint maximum nonconformity score. PASC provides a finite-sample distribution-free guarantee that all K stages are simultaneously covered with probability at least 1 - alpha, and is nearly tight up to a 1/(n+1) factor. On a three-stage NER - NED - entity-typing pipeline over CoNLL-2003, PASC achieves 96.4% end-to-end coverage versus 93.4% for Bonferroni and 86.5% for independent CP, at identical average prediction set size (1.083). Under distribution shift to WNUT-17 Twitter and WikiNEuRal Wikipedia data, PASC empirically maintains the target coverage in the tested shift settings while independent CP collapses to 59%. PASC requires a single quantile computation, runs 1.7x faster than Bonferroni, and scales to K = 6 stages where independent CP drops to 0.53 end-to-end coverage. The same joint-maximum-score reduction applies directly to compound LLM systems and agent pipelines.

[IR-10] owards FairRAG : Preventing Representational Harm in Retrieval-Augmented Generation by Enforcing Fair Exposure at Retrieval Time

链接: https://arxiv.org/abs/2605.18806
作者: Riddhi Tikoo
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:As Large Language Model (LLM) integration has accelerated in high-stakes domains, model hallucination is a critical issue. Retrieval-augmented generation (RAG) is a technique for addressing hallucination; however, RAG’s multi-component pipeline introduces vulnerabilities where biases can be introduced. This study considers two previously developed utility-focused ranking strategies (Standard and Stochastic) alongside two proposed exposure-aware approaches (Forced-Exposure and Representative Stochastic). Using the TREC 2022 Fair Ranking Dataset, which contains Wikipedia articles annotated as protected or non-protected, the LLM was asked to identify relevant articles with citations for four scenario-based QA prompts. The retrieval rankings and the generated outputs were evaluated for exposure bias and utility across all ranking methods. Overall, the Representative Stochastic ranker resulted in a statistically significant near-parity average exposure, acknowledging that relevance scores initially produced during retrieval are already shaped by representational bias, whereas the other rankers assume those scores are unbiased. Across all the methods of document ranking, generation demographic parity closely mirrored the exposure parity, reinforcing that representational bias in RAG systems is driven by retrieval and propagates to generation. These findings highlight that retrieval ranking is a critical point for mitigating downstream bias and propose a Representative Stochastic ranker that reintroduces fairness in RAG systems.

[IR-11] RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents

链接: https://arxiv.org/abs/2605.18805
作者: Imad Aouali,Flavian Vasile,Otmane Sakhi,Alexandre Gilotte,Benjamin Heymann
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Benchmark on LLM Recommendation Agents

点击查看摘要

Abstract:LLM recommendation agents increasingly produce structured recommendation reports: sets of items accompanied by natural-language justifications. Yet existing evaluations often reduce this setting to reranking small shortlisted candidate sets or judge reports mainly by semantic plausibility. We introduce Recommendation Atlas (Agentic Tool-Level Assessment for Shopping), or RecoAtlas, a benchmark and toolkit for evaluating shopping agents with behavior-grounded metrics. RecoAtlas complements held-out interaction metrics with learned utility proxies for relevance, complementarity, and diversity derived from interaction data, while separately measuring semantic coherence and explanation quality. Its controlled tool environment exposes agents to either semantic, behavior-aligned, or faulty tools, enabling diagnosis of whether performance gains arise from stronger reasoning, better signals, or more effective tool-use policies. Across controlled experiments, we show that RecoAtlas exhibits key properties of a meaningful benchmark for agentic systems: performance scales with model capacity and test-time compute, improves with stronger and better-aligned tools, degrades under noisy or misaligned signals, and reveals that semantic plausibility does not necessarily capture behavior-grounded utility. RecoAtlas provides a foundation for developing and evaluating shopping assistants that optimize not only for plausible recommendations, but also for coherent, behaviorally grounded recommendation sets.

[IR-12] Position: Lets Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance ICML2026

链接: https://arxiv.org/abs/2605.18801
作者: Shiqiang Wang,Herbert Woisetschläger,Hans Arno Jacobsen,Mingyue Ji
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted to ICML 2026 Position Paper Track

点击查看摘要

Abstract:Data is fundamental to large language models (LLMs). However, understanding of what makes certain data useful for different stages of an LLM workflow, including training, tuning, alignment, in-context learning, etc., and why, remains an open question. Current approaches rely heavily on extensive experimentation with large public datasets to obtain empirical heuristics for data filtering and dataset construction. These approaches are compute intensive and lack a principled way of understanding the essence of how specific data characteristics drive LLM behavior. In this position paper, we advocate for the need of developing systematic methodologies for generating synthetic sequences from appropriately defined random processes, with the goal that these sequences can reveal useful characteristics when they are used in one or multiple stages of the LLM workflow. We refer to such sequences as data probes. By observing LLM behavior on data probes, researchers can systematically conduct studies on how data characteristics influence model performance, generalization, and robustness. The probing sequences exhibit statistical properties that can be viewed using theoretical concepts, such as typical sets, which are generalized to describe the behaviors of LLMs. This data-probe approach provides a pathway for uncovering foundational insights into the role of data in LLM training and inference, beyond empirical heuristics.

[IR-13] rust or Abstain? A Self-Aware RAG Approach

链接: https://arxiv.org/abs/2605.18792
作者: Xi Zhu,Ziqi Wang,Kai Mei,Wujiang Xu,Minghao Guo,Bangji Yang,Jiajun Fan,Dimitris N. Metaxas
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) improves large language models (LLMs) by incorporating external evidence, but it also introduces knowledge conflicts when retrieved contextual knowledge (CK) and parametric knowledge (PK) disagree or are both unreliable. Existing approaches mainly coordinate which source to use, without explicitly asking whether each answer path is correct. We argue that faithful RAG requires LLM self-awareness, namely the ability to recognize the limits of its own knowledge and reasoning. To ground this problem, we construct a model-specific, ground-truth-aligned knowledge-conflict benchmark by evaluating LLM backbones on PK-only and CK-conditioned answer paths over approximately 69K query-context instances per backbone, drawn from five conflict-QA datasets. We then introduce SABER, a Self-Aware Belief Estimator for RAG that requires no LLM fine-tuning. SABER combines a self-prior with PK-side and CK-side conditional reasoning representations from multi-trace inference, then estimates reliability beliefs with two lightweight predictors to drive a 4-cell decision over trust PK, trust CK, trust either, or abstain. Across four LLM backbones, SABER improves end-to-end accuracy and conflict-specific faithfulness over ten inference-time and fine-tuning baselines, with the largest gains on conflict-heavy datasets. Under abstention, SABER’s risk-coverage curve Pareto-dominates every prompt-based abstainer, providing a tunable balance between coverage and answer risk. Our code is available at this https URL.

[IR-14] A Reproducibility Analysis of PO4ISR: Diagnosing and Mitigating Semantic Drift in LLM -Based Session Recommendation

链接: https://arxiv.org/abs/2605.18780
作者: Aditya Tiwari,Konduri Naga Lakshmi Rekha,Rajesh Kumar Mundotiya
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning-based Large Language Models (LLMs) like PO4ISR have set new benchmarks in session-based recommendation. However, the reproducibility of their reasoning capabilities across diverse semantic domains remains unexplored. In this work, we conduct a rigorous reproducibility study of PO4ISR to assess its generalization limits. Our analysis reveals a critical failure mode: standard reasoning prompts suffer from severe contextual drift in long sessions, leading to performance degradation on semantically complex datasets like Games and Bundle. To quantify and resolve this stability gap, we introduce PO4ISR++, a robustness-enhanced implementation that integrates reflexive prompting and consistent rank detection. Unlike the original static prompting strategy, our approach dynamically adapts to cross-domain cues. We benchmark both the original implementation and our robust variant on ML-1M, Games, and Bundle. Our results confirm that while the original model struggles in new domains, our reproducible extension restores performance, yielding a stabilized gain of up to 54% on Games and 96% on Bundle. We release open-source artifacts, including the reproduced baseline and our enhanced framework, to facilitate reliable future research in LLM-based recommendation.

[IR-15] Mask-to-Correct: Leverag ing Retriever Diversity for Masking-guided Faithful Fact Correction

链接: https://arxiv.org/abs/2605.18776
作者: Payel Santra,Lavisha Sharma,Madhusudan Ghosh,Partha Basuchowdhuri
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid spread of misinformation on social media highlights the need for robust, automated fact correction frameworks. However, existing works rely on supervised learning from manually annotated claim-evidence pairs, which are scarce and prone to biases, limiting their generalization across domains. Moreover, these methods overlook semantic faithfulness in their correction process. To address these challenges, we propose Mask-to-Correct (M _2 C), a training-free, inference-only Retrieval Augmented Generation (RAG) based framework that leverages diversity-aware masking to identify erroneous spans of claims and evaluate the faithfulness of corrections using retrieved evidence. However, the effectiveness of RAG heavily depends on the choice of retriever, which may vary across queries. To mitigate this, we further introduce M _2 C ^+ , an ensemble-based framework that combines corrections across multiple rankers to reduce retrieval bias and improve robustness. Extensive experiments on the benchmark datasets demonstrate that our proposed frameworks consistently outperform all baselines, achieving up to 14% improvement in SARI scores, without using gold evidence.

[IR-16] Query-Aware Flow Diffusion for Graph-Based RAG with Retrieval Guarantees ICLR

链接: https://arxiv.org/abs/2605.18775
作者: Zhuoping Zhou,Davoud Ataee Tarzanagh,Sima Didari,Wenjun Hu,Baruch Gutow,Oxana Verkholyak,Masoud Faraki,Heng Hao,Hankyu Moon,Seungjai Min
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Published at the International Conference on Learning Representations (ICLR) 2026. 38 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Graph-based Retrieval-Augmented Generation (RAG) systems leverage interconnected knowledge structures to capture complex relationships that flat retrieval struggles with, enabling multi-hop reasoning. Yet most existing graph-based methods suffer from (i) heuristic designs lacking theoretical guarantees for subgraph quality or relevance and/or (ii) the use of static exploration strategies that ignore the query’s holistic meaning, retrieving neighborhoods or communities regardless of intent. We propose Query-Aware Flow Diffusion RAG (QAFD-RAG), a training-free framework that dynamically adapts graph traversal to each query’s holistic semantics. The central innovation is query-aware traversal: during graph exploration, edges are dynamically weighted by how well their endpoints align with the query’s embedding, guiding flow along semantically relevant paths while avoiding structurally connected but irrelevant regions. These query-specific reasoning subgraphs enable the first statistical guarantees for query-aware graph retrieval, showing that QAFD-RAG recovers relevant subgraphs with high probability under mild signal-to-noise conditions. The algorithm converges exponentially fast, with complexity scaling with the retrieved subgraph size rather than the full graph. Experiments on question answering and text-to-SQL tasks demonstrate consistent improvements over state-of-the-art graph-based RAG methods.

[IR-17] M3DocDep: Multi-modal Multi-page Multi-document Dependency Chunking with Large Vision-Language Models CVPR2026

链接: https://arxiv.org/abs/2605.18774
作者: Joongmin Shin,Jeongbae Park,Jaehyung Seo,Heuiseok Lim
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR2026 Main

点击查看摘要

Abstract:In long, multi-page industrial documents, retrieval-augmented generation (RAG) depends heavily on whether chunk boundaries follow the document’s true structure. Existing text-centric chunkers and generative hierarchy parsers often miss cross-page parent-child relations, figure/table-caption bindings, and boundary cues, which leads to fragmented or redundant chunks and degrades both retrieval and answer quality. We propose M3DocDep, an LVLM-based pipeline that first recovers block-level dependencies and then constructs chunks along the recovered document tree. The pipeline uses SharedDet as a common DP+OCR preprocessing layer, extracts multimodal block embeddings with boundary-aware SoftROI pooling, scores candidate parent-child edges with a biaffine head, decodes a globally valid dependency tree with MST constraints, and builds tree-guided chunks annotated with section paths and page ranges. Under a shared-block evaluation protocol, M3DocDep improves STEDS by +28.5 to +39.6 percent on DHP benchmarks, retrieval nDCG by +1.1 to +15.3 percent, and QA ANLS by +4.5 to +15.3 percent on corpus-level RAG benchmarks. These results show that recovering document dependencies before chunking yields more coherent retrieval units for long, multi-page multimodal documents.

[IR-18] Improving Retrieval-Augmented Generation without Taxonomy-based Error Categorization

链接: https://arxiv.org/abs/2605.18772
作者: Gongbo Zhang,Yifan Peng,Chunhua Weng
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves the factual accuracy of large language model (LLM) outputs by grounding generation in external knowledge. Recent agentic RAG systems extend this paradigm with critical agents to evaluate model responses and iteratively refine outputs. However, most prior work implicitly assumes reliable critic feedback and focuses on planning strategies, while paying limited attention to the robustness of the error-correction process itself, which can be impacted by misaligned error categories and ineffective or incorrect corrections. Here, we hypothesize that RAG performance can be improved without explicit error categorization. We propose RePAIR, a response-action learning paradigm that directly maps flawed RAG outputs to error-mitigating action plans without relying on fine-grained error taxonomies and explicit critic supervision. Across multiple benchmarks, RePAIR consistently improves agentic RAG performance.

[IR-19] LWGR: Lagrangian-Constrained Personalized World Knowledge for Generative Recommendation

链接: https://arxiv.org/abs/2605.18771
作者: Lingyu Mu,Hao Deng,Haibo Xing,Kaican Lin,Zhitong Zhu,Yu Zhang,Xiaoyi Zeng,Zhengxiao Liu,Zheng Lin,Jinxin Hu
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent progress in large language model (LLM) based generative recommendation (GR) shows that leveraging LLM world knowledge can substantially improve performance. However, existing methods rely on fixed, manually designed instructions to generate semantic knowledge and directly incorporate it into GR, which has two limitations. First, fixed instructions cannot capture the multidimensional heterogeneity of user interests. Second, uncontrollable knowledge fusion may conflict with behavioral signals and harm recommendations. To address these limitations, we propose LWGR, a framework that leverages Lagrangian constraints to transfer users’ personalized world knowledge from LLMs into generative recommendation. LWGR enhances GR along two axes: knowledge extraction and fusion. It builds personalized soft instructions to extract behavior-relevant LLM world knowledge, and formulates knowledge fusion as an optimization problem with explicitly bounded performance degradation, which is solved by a Lagrangian primal-dual method to selectively incorporate beneficial knowledge. We further design two training strategies for different LLM scales and a deployment scheme that combines nearline precomputation with lightweight online serving. Experiments on multiple public datasets and one industrial dataset show that LWGR outperforms eight state-of-the-art baselines by up to 11.23% and brings a 1.35% revenue lift on a large-scale advertising platform, demonstrating its effectiveness and practicality.

[IR-20] Agent ic GraphRAG : Navigating Unstructured Financial Data with Collaborative AI

链接: https://arxiv.org/abs/2605.18770
作者: Arthur Capozzi,Dirk Helbing
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a collaborative agentic GraphRAG framework for expert analysis of commercial registry data. Public registries are often formally accessible, yet difficult to use in practice because they combine structured records with large volumes of unstructured legal text. This limits conventional keyword and vector-only retrieval, especially for multi-hop, temporal, and entity-centric investigations. Our approach builds a Neo4j knowledge graph through a three-phase pipeline: (i) deterministic ingestion of strong nodes from verified structured fields, (ii) LLM-based extraction of weak nodes from unstructured notices, and (iii) deterministic identity resolution and deduplication. On top of this graph, we introduce an analytical modular agent that integrates zero-shot intent routing, a bounded reflection loop, secure tool-mediated graph access, and state-aware response synthesis. A human-in-the-loop dashboard exposes evidence and execution traces to support transparency and auditability. We evaluate the framework on the Swiss Official Gazette of Commerce, a multilingual corpus of more than seven million publications over seven years. We further contribute a multi-tier evaluation protocol covering entity-resolution precision, tool-routing behavior, answer quality, and multi-turn conversational performance. Across automated, human-curated, and conversational benchmarks, the proposed agentic GraphRAG system consistently outperforms a standard agentic vector-RAG baseline, with strong gains in correctness, answer relevance, information recall, turn success rate, and context carryover accuracy. The architecture is modular, reproducible, and transferable to other commercial gazettes and public-sector registry systems.

[IR-21] ClusterRAG : Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation ACL2026

链接: https://arxiv.org/abs/2605.18769
作者: Gibson Nkhata,Uttamasha Anjally Oyshi,Quan Mai,Susan Gauch
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 2 figures, to be published in the proceedings of ACL 2026

点击查看摘要

Abstract:Personalized Retrieval-Augmented Generation (RAG) relies on accurately selecting user-relevant documents. In practice, existing RAG approaches often suffer from high retrieval costs and overlook that collaborative signals from similar users can enhance personalized generation for the current user. We propose ClusterRAG, a Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation. ClusterRAG represents users through their profile documents, organizes users into semantically coherent clusters using density-based clustering, and performs retrieval at both the cluster and document levels via cluster-level similarity and fine-grained ranking. Extensive experiments on the LaMP benchmark demonstrate that jointly leveraging the target user’s profile and profiles from top similar users consistently yields the best performance across diverse tasks. Further analysis shows that ClusterRAG integrates seamlessly with different dense retrievers and rankers, and remains effective when paired with both fine-tuned and zero-shot language models.

[IR-22] DualView: Adaptive Local-Global Fusion for Multi-Hop Document Reranking

链接: https://arxiv.org/abs/2605.18767
作者: Litong Zhang,Jiaxin Li,Kuo Zhao
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-hop question answering requires aggregating information from multiple documents, a critical capability for knowledge-intensive applications. A fundamental challenge lies in efficiently identifying the minimal relevant document set from retrieved candidates while maintaining high recall. We present an efficient dual-view cascaded reranking framework for multi-hop document reranking. Operating as a lightweight post-retrieval stage over E5-base-v2 candidates, our architecture comprises: (1) a Local Scorer employing stacked cross-attention for fine-grained query-document relevance; and (2) a Global Scorer modeling inter-document dependencies via Transformer-based context aggregation. These views are dynamically fused through an Adaptive Gate conditioned on query semantics. Under the fixed candidate set reranking setting with offline cached embeddings, our model achieves competitive results, particularly outstanding on MuSiQue with 99.4% Top-4 Recall and 97.8% Full Hit accuracy at 4.0 ms latency (249 QPS). It substantially outperforms 600M-parameter cross-encoders (BGE-Large: 92.0% Recall, Jina-v3: 90.1% Recall) while maintaining 5 to 6 times lower latency. Ablation studies validate that both Local and Global views contribute substantially to multi-hop performance. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.18767 [cs.IR] (or arXiv:2605.18767v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2605.18767 Focus to learn more arXiv-issued DOI via DataCite

[IR-23] Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method ACL2026

链接: https://arxiv.org/abs/2605.18766
作者: Taehee Kim,Seungbin Yang,Jihwan Kim,Jaegul Choo
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Retrieving relevant tables from extensive databases for a given natural language query is essential for accurately answering questions in tasks such as text-to-SQL. Existing table retrieval approaches select a pre-determined set of k tables with the highest similarity to the query. However, the number of required tables varies across queries and cannot be known in advance. Enforcing a fixed number of retrieved tables regardless of the query may either retrieve an undersized set, failing to obtain all necessary evidence, or retrieve an oversized pool, including irrelevant tables. To address this issue, we propose an adaptive table retrieval method that adjusts the number of tables retrieved according to the requirements of each query. Specifically, we utilize an adaptive thresholding mechanism to selectively retrieve tables and integrate a sliding-window reranking algorithm to efficiently process a large table corpus. Extensive experiments on Spider, BIRD, and Spider 2.0 demonstrate that our method effectively addresses the limitations of the top-k retrieval strategy, improving performance in retrieval and downstream tasks. Our code and data are available at this https URL.

[IR-24] STAR: Semantic-Tuned and Tail-Adaptive Retriever for Graph-Augmented Generation

链接: https://arxiv.org/abs/2605.18765
作者: Shuai Li,Chen Huang,Duanyu Feng,Wenqiang Lei,See-Kiong Ng
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To augment Large Language Models (LLMs) for multi-hop question answering, a mainstream solution within Graph Retrieval Augmented Generation (GraphRAG) leverages lightweight retrievers to efficiently extract information from a given Knowledge Graph (KG). However, existing methods often overlook the inherent challenge of sparse semantic information in graphs. Specifically, our experiments reveal that these methods produce biased retrieval Semantic Shortcut Bias and Long-Tail Path Bias, leading to inadequate semantic modeling and limited GraphRAG effectiveness. To address these issues, we propose STAR, a semantic-tuned and tail-adaptive retriever for GraphRAG. STAR integrates two key learning paradigms: token-level interaction learning and path-weighted contrastive learning. The former employs a cross-attention architecture and a hard path mining mechanism to jointly model the query and path, thereby mitigating the Semantic Shortcut Bias. The latter introduces a tailored contrastive learning objective that utilizes tail-adaptive path weighting, designed to optimize the training process and ease the Long-Tail Path Bias. Extensive experiments demonstrate that STAR consistently outperforms baselines, achieving average retrieval performance gains of 1.8% and LLM QA performance improvements of 2.2% across all benchmark datasets. Our code is available at this https URL.

[IR-25] From Intent to AI Pipelines: A Controlled Agent ic Framework for Non-AI Expert Scientists

链接: https://arxiv.org/abs/2605.18764
作者: Hyacinth Ali,Jessie Galasso-Carbonnel,Houari Sahraoui
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) pipelines have become integral to modern research, supporting fields such as Medical Sciences, Agriculture, and Social Sciences, and enabling large-scale data analysis, predictive modeling, and the automation of complex tasks. However, designing and implementing AI solutions remains challenging for many researchers due to the expertise required in the design and development of end-to-end AI systems. To address this gap, we present Domain-Driven Adaptable AI Pipelines (DDAP), a controlled, human-in-the-loop, agentic framework that leverages large language models to guide users in a systematic construction of AI pipelines and their corresponding implementation code. DDAP structures the development process into four stages: problem definition, compute environment specification, pipeline generation, and code generation. Through this staged interaction, the framework adapts to domain context, user expertise, and resource constraints, while maintaining user control over key decisions. We evaluate DDAP across multiple datasets spanning business, biology, and health science domains by comparing its AI models against expert-developed models. The experimental results show that DDAP achieves competitive results in several tasks compared to expert baselines, although performance varies across problem types, particularly for text-based clustering tasks. By combining guided interaction, adaptability, and reproducibility, DDAP demonstrates that a controlled agentic framework can generate competitive AI pipelines for non-expert users.

[IR-26] Query-Conditioned Graph Retrieval for Contextualized LLM Reasoning in Personalized Wearable Data

链接: https://arxiv.org/abs/2605.18763
作者: Zhenyu Lu,Mahyar Abbasian,Amir M. Rahmani
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied to analyzing wearable sensing data, which are long-term, multimodal, and highly personalized. A key challenge is context selection: providing insufficient context limits reasoning, while including all available data leads to inefficiency and degraded generation quality. We propose Wearable As Graph (WAG), a graph-based context retrieval framework that enables query-adaptive reasoning over wearable data with LLMs. WAG organizes wearable metrics and user-specific signals into a personalized knowledge graph, and retrieves a query-conditioned subgraph to support downstream generation. The retrieval process integrates global relationships, capturing prior knowledge and population- and individual-level patterns via hierarchical Bayesian modeling, with local relationships that reflect short-term signal deviations. A query openness signal further controls retrieval breadth. We evaluate WAG on over 10,000 data-grounded queries from real-world wearable datasets. Across LLM-based and human evaluations, WAG achieves an approximately 70% win rate over baseline and standard RAG methods, demonstrating the effectiveness of structured, query-adaptive context retrieval for LLM-driven analysis of wearable data.

[IR-27] ALDEN: Boosting Private Data Extraction from Retrieval-Augmented Generation Systems via Active Learning and Distribution Estimation

链接: https://arxiv.org/abs/2605.18762
作者: Xingyu Lyu,Jianfeng He,Ning Wang,Yidan Hu,Tao Li,Danjue Chen,Shixiong Li,Yimin Chen
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is widely used to augment large language models with external knowledge retrieval to improve reliability and generalization. However, recent studies have shown that RAG systems remain vulnerable to data extraction attacks, where adversaries can extract private data by embedding malicious commands into user queries. Despite their feasibility, existing attacks typically suffer from low data extraction rates and limited practical effectiveness. Here, we propose ALDEN, a novel attack that effectively and efficiently extracts private data from RAGs. First, we employ active learning to diversify malicious queries and improve data extraction rates. Second, we observe that the data distribution of the underlying knowledge base provides valuable guidance for query generation and introduce a decay-based dynamic algorithm to estimate the corresponding topic distribution. By combining them together, we demonstrate that ALDEN substantially outperforms state-of-the-art methods through comprehensive evaluations.

[IR-28] DOTRAG : Retrieval-Time Reasoning Along Paths

链接: https://arxiv.org/abs/2605.18760
作者: Larnell Moore,Naihao Deng,Rada Mihalcea,Farnaz Jahanbakhsh
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Retrieval-Augmented Generation (GraphRAG) is dominated by a retrieve-then-reason paradigm, where context is retrieved using heuristics and then reasoned over. Such methods struggle to adapt to the query-specific logic required for complex multi-hop tasks, often accumulating irrelevant context or missing correct relational paths. We propose DotRAG, a training-free GraphRAG framework that reformulates retrieval as a reasoning process over paths. Our approach generates query-conditioned constraints that guide graph exploration, prune irrelevant regions, and iteratively discover relational paths without relying on explicit step-by-step reasoning chains. We introduce Division of Thought (DOT), an abstraction that decomposes retrieval into localized search spaces and adapts the search strategy to each query. DotRAG achieves SOTA performance on MetaQA and UltraDomain, with consistent gains on multi-hop tasks, demonstrating the effectiveness of reasoning-guided retrieval.

[IR-29] Accelerating AI-Powered Research: The PuppyChatter Framework for Usable and Flexible Tooling

链接: https://arxiv.org/abs/2605.17809
作者: Chun-Hsiung Tseng,Hao-Chiang Koong Lin,Andrew Chih-Wei Huang,Yung-Hui Chen,Jia-Rou Lin
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:This research addresses the challenges inherent in developing Artificial Intelligence (AI) applications, particularly those leveraging Large Language Models (LLMs). While AI vendors provide Application Programming Interfaces (APIs) and Software Development Kits (SDKs) to facilitate developer interaction, the former often requires intricate manual request construction, and the latter can lead to significant vendor lock-in. Furthermore, existing model abstraction frameworks, though mitigating vendor dependency, introduce an additional layer of complexity and potential security concerns. To reconcile these conflicting factors, the study introduces PuppyChatter, a novel software framework designed to preserve the intuitive simplicity of vendor-specific SDKs while simultaneously adhering to the vendor-neutrality principles characteristic of model abstraction, thereby offering a more streamlined and flexible development paradigm.

人机交互

[HC-0] Less Back-and-Forth: A Comparative Study of Structured Prompting

链接: https://arxiv.org/abs/2605.20149
作者: Saurav Ghosh,Gabriella Polach,Abdou Sow
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 7 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Large language models (LLMs) are widely used for open-ended tasks, but underspecified prompts can lead to low-quality answers and additional interaction. This paper studies whether structured prompt design improves response quality while reducing user effort. We compare three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt. We evaluate these conditions across four task types–summarization, planning, explanation, and coding–using three LLM systems: ChatGPT, Claude, and Grok. Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity. Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts. Checklist prompts also produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts. These results suggest that a simple prompt checklist can improve LLM responses while reducing unnecessary interaction.

[HC-1] Journeys of Parents with LGBTQ Children: How Trauma and Healing Reshape Identity and (Mis)Informating Practices

链接: https://arxiv.org/abs/2605.20024
作者: Soonho Kwon,Dong Whi Yoo,Koustuv Saha,Shaowen Bardzell,Younah Kang
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This study examines how parents of LGBTQ+ individuals in South Korea navigate the emotional rupture fueled by fear, isolation, and disorientation after learning their children’s queer identity, encounter queer-related (mis)information as a way of coping with this emotional toll, and come to listen to queer realities relationally. Through this process, we highlight how parents reconstruct their identities as supportive parents, which reshapes their informating practices, making them more critical in assessing queer-related (mis)information, developing strategies to protect themselves from harmful narratives, and actively challenging misinformation to support others navigating similar experiences. This work contributes to CSCW by (1) foregrounding parents of LGBTQ+ individuals, an underrepresented yet critical stakeholder group in Queer HCI; (2) demonstrating how identity reconfiguration following a trauma-healing process could transform information practices; and (3) arguing that addressing misinformation requires attention beyond individual fact-based discerning to account for its relational, cultural, and emotional dimensions. Further, we invite CSCW scholars to reconsider the balance between abstracting and humanizing information, explore future design possibilities for parents of LGBTQ+ children, and reflect on the role of researchers as participants in collective research communities fueled by care. Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY) Cite as: arXiv:2605.20024 [cs.HC] (or arXiv:2605.20024v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2605.20024 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3816958 Focus to learn more DOI(s) linking to related resources

[HC-2] From Role to Person: Trust Calibration Challenges in Twin Agents

链接: https://arxiv.org/abs/2605.19838
作者: Hugo Andersson,Niklas Elmqvist
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to AutomationXP26 Workshop at CHI 2026, Barcelona, Spain. Non-archival

点击查看摘要

Abstract:Agentic AI has taken on the role of assistant, collaborator, and decision-support tool. We argue the next role on that list is more personal: you. These are digital twins of each individual – twin agents – representing their knowledge, perspective, and communicative style to colleagues when they are unavailable. Drawing on early design work in an ongoing project in which agents represent knowledge workers in a professional setting, we identify a trust calibration problem specific to this approach. When a human colleague doubts a twin agent’s output, they face three failure modes (a schema gap, an epistemic gap, and a model artifact) with no reliable attribution path between them. Cognitive forcing functions and related frameworks address overreliance effectively in contexts where there is a clear boundary between the AI and the human decision-maker. However, twin agents dissolve that boundary, raising a class of trust calibration challenge these frameworks were not designed to handle. We introduce the concept, distinguish it from digital twins, and outline the research questions this new class of agent demands.

[HC-3] Material for Thought: Generative AI as an Active Creative Medium

链接: https://arxiv.org/abs/2605.19832
作者: Hugo Andersson,Niklas Elmqvist
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to the CHI 2026 Tools for Thought Workshop, Barcelona, Spain. Non-archival

点击查看摘要

Abstract:Human-AI collaboration research has largely positioned the human as a judge of AI output, centering effort on evaluating whether rec- ommendations are reliable enough to accept. This decision-support framing leaves little room for the human as creator. We argue that for creative work, this framing misdirects human effort toward eval- uating correctness rather than exploring and shaping the creative space. Drawing on Schön’s theory of reflective practice, we propose an alternative: treating generative AI as an active creative medium. As a potter works with clay, humans Shape, Observe, Stir, and Se- lect (SOSS) their medium through ongoing conversation. Where generative AI actively tends toward convergence and resolution, the human role of disruption and curation becomes essential for sustaining creative quality. We present a creative writing probe, Loom, in which users orchestrate simulated narrative agents. We also introduce the SOSS framework for this mode of engagement, and discuss design implications.

[HC-4] AffectAI-Capture: A Reproducible Multimodal Protocol for Small-Group Meeting Research

链接: https://arxiv.org/abs/2605.19794
作者: Meisam Jamshidi Seikavandi,Alice Modica,Anna Obara,Fabricio Batista Narcizo,Tanya Ignatenko,Ted Vucurevich,Jesper Bünsow Boldt,Paolo Burelli,Andrew Burke Dittberner
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:We present AffectAI-Capture, a protocol for collecting synchronized multimodal data in four-person meeting-like interactions, combining eye tracking, wearable physiology, close-talk and room audio, multi-view video, event logging, and structured self-report. Sessions use fixed task blocks grounded in established group-interaction paradigms, while acquisition and post-processing are organized around a single authoritative event timeline and standardized outputs. We describe the experimental rationale, synchronization philosophy, data organization, and practical trade-offs. Pilot-level validation of audio quality and video synchronization has been conducted using controlled bench tests; full protocol sessions with participants remain ongoing work. The contribution is a reproducible protocol architecture linking task design, instrumentation, timing provenance, and data packaging for affective, behavioral, and meeting-analytics research.

[HC-5] ombWriter: Scaffolding Story Archeology through Beat-Level Interaction in Human-AI Co-Writing

链接: https://arxiv.org/abs/2605.19681
作者: Hugo Andersson,Niklas Elmqvist
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The dominant paradigm for LLM interaction in AI co-writing uses disposable prompts that vanish after use. This may lead to imprecise results, cumbersome workflows, and diminished author agency and ownership. We propose LLM-based story archeology, where prompts serve as a hierarchical story instrument refined over time to extract the writer’s intended story. Drawing on the fossil theory of story- telling, where stories exist as latent structures that writers excavate through their craft, this approach supports agency and ownership through high involvement and control. Writers work at the level of story beats rather than prose. They generate character actions in scenes to discover emergent possibilities, simulated by the LLM or directly nudged, then edit resulting beats to refine scenes iteratively. Prose is generated from beats based on style and genre, separating structure from style. We developed TombWriter, a web-based tool that visualizes stories as navigable cards – characters, scenes, and beats – through a five-stage narrative pipeline. We conducted a qual- itative study with five experienced writers who used the system over three days. Through semi-structured interviews, we found that writers framed AI as a generation engine rather than collabo- rator, claimed ownership while reporting voice loss, and valued the system for structural discovery rather than prose production. We contribute the story archeology approach, the TombWriter system, and qualitative findings on beat-level human-AI co-writing.

[HC-6] he Accessibility Capability Boundary: Operational Limits and Expansion Potential of AI-Generated Browser-Native Accessibility Systems

链接: https://arxiv.org/abs/2605.19638
作者: Rizwan Jahangir,Daisuke Ishii
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Software Engineering (cs.SE)
备注: 21 pages, 4 figures

点击查看摘要

Abstract:As large language models (LLMs) demonstrate increasing competence in synthesizing functional user interfaces, a fundamental question emerges in accessibility computing: \textithow far can AI-driven accessibility systems go? This paper introduces the \textitAccessibility Capability Boundary (ACB), a formal framework for reasoning about the operational limits and expansion potential of autonomous accessibility systems, and grounds this theory in a real-world systems artifact. We model accessibility not as a binary compliance property but as a dynamic, multidimensional capability space constrained by measurable variables including deployment latency, cognitive load, infrastructure dependency, offline persistence, interaction complexity, and adaptability. We argue that AI-generated, browser-native systems constructed as single-file HTML artifacts leveraging standard browser APIs may dramatically shift the ACB outward by reducing deployment friction to near-zero and enabling rapid, context-specific interface adaptation. We ground our theoretical framework in the analysis of two real-world exploratory prototypes. The first is an AI-generated browser-native accessibility interface deployed for a blind user in Nepal. The second is a fully functional, open-source webcam alignment assistant for visually impaired users, serving as a concrete systems artifact. Through formal definitions, propositions, and a comparative evaluation matrix, we characterize the regions of the accessibility capability space that such systems can and cannot reach. We further identify remaining computational, infrastructural, and verification constraints that constitute the hard boundaries of this paradigm. This work contributes a theoretical foundation for understanding the scalable limits of autonomous accessibility computing and proposes a research agenda for future work in accessibility-aware AI systems.

[HC-7] CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

链接: https://arxiv.org/abs/2605.19484
作者: Haobo Hu,Xiangwu Guo,Zhiheng Chen,Difei Gao,Haotian Liu,Libiao Jin,Qi Mao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our this http URL current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.

[HC-8] Once Again with Style: Understanding and Supporting Partial Reuse in Dashboard Authoring

链接: https://arxiv.org/abs/2605.19400
作者: Nicole Sultanum,Gustavo Moreira,Arjun Srinivasan
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Presentation-oriented tasks including formatting and layout design are critical but often neglected aspects of dashboard authoring given their labor intensive nature. In this work, we follow a user-centered design approach to explore ways that partial reuse of pre-existing dashboards may support the dashboard design process. Based on collective feedback from 10 professional dashboard creators, we contribute: (a) findings from a formative study characterizing dashboard reuse needs and challenges; and (b) reflections and opportunities from a concept validation study with ReDash, a design probe for partial reuse of dashboard presentation features (style and layout) from multiple sources.

[HC-9] oward User Comprehension Supports for LLM Agent Skill Specifications

链接: https://arxiv.org/abs/2605.19362
作者: Zikai Alex Wen
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: To appear at ACM CAIS Workshop Agent Skill 2026

点击查看摘要

Abstract:Users often interpret and select agent skills through their \textttthis http URL specifications. To protect users, existing audits mainly focus on malicious or unsafe skills. We study the complementary question of whether specifications help users form bounded expectations about what a skill consumes, produces, and covers. Across 878 cybersecurity skills, we used rule-based coding to measure textual cues for four comprehension anchors, namely operational basis, output contract, boundary disclosure, and example capability demonstration. Cues for operational basis were common, but only 19.0% of specifications exhibited cues for an example task, sample, or expected outcome, and only 2.3% exhibited cues for all four anchors. We further examined a small DNS/C2 telemetry subset (n = 6) to illustrate why missing examples may matter. Examples appeared to make first local checks easier to construct, while no-example skills typically required helper code inspection to recover command arguments or output fields. We argue that agent-skill evaluation should treat specifications as user-facing capability disclosures, not merely as containers for executable instructions.

[HC-10] When Web Apps Heal Themselves: A MAPE-K Based Approach to Fault Tolerance and Adaptive Recovery

链接: https://arxiv.org/abs/2605.19261
作者: Sales Aribe Jr,Rov Japheth Oracion
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL)
备注: 12 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Ensuring the reliability and resilience of modern web applications remains a critical challenge due to increasing system complexity and dynamic runtime environments. This study proposes a modular self-healing framework based on the monitor-analyze-plan-execute over a shared knowledge base (MAPE-K) model, integrated with an AutoFix-inspired mechanism for adaptive fault recovery. Using a design and development research (DDR) approach, the system was implemented and evaluated through controlled fault injection experiments across twenty runtime failure scenarios, including service crashes, memory leaks, and database disconnections. Experimental results demonstrate that the proposed framework achieved a mean fault detection F1-score of 90.7% and a recovery success rate of 93.2%. The AutoFix module reduced the average time-to-recovery (TTR) by 56.2%, achieving an average recovery time of 3.92 seconds. System throughput was maintained between 88% and 95% during fault conditions, with only a 3.1% increase in response time. Additionally, iterative feedback mechanisms improved recovery efficiency by 18.6% over multiple cycles. These findings indicate that the proposed framework provides a practical and extensible approach to enhancing fault tolerance in web applications through feedback-driven adaptation. While the current implementation relies on predefined recovery strategies, the integration of learning-oriented feedback establishes a foundation for future development of more autonomous self-healing systems.

[HC-11] Platform architecture determines whether recommendation algorithms can shape information quality on social media

链接: https://arxiv.org/abs/2605.19204
作者: Mohammad Hammas Saeed,David A. Broniatowski,Joseph Simons,Erica Gralla,Manan Suri,Giovanni Luca Ciampaglia
类目: ocial and Information Networks (cs.SI); Human-Computer Interaction (cs.HC)
备注: 36 pages, 6 figures, 21 tables

点击查看摘要

Abstract:Social media platforms shape public discourse through two fundamental design choices that naturally co-occur in any field investigation: platform architecture, which defines what types of actors exist and how they interact, and recommendation algorithm, which determines what content is surfaced to users. Using agent-based simulation, we orthogonally manipulate both factors, exploring four prototypical architectures – tree (e.g., Reddit), layered hierarchy (e.g., Facebook), network (e.g., Twitter), and complete graph (e.g., TikTok) – and two algorithms: chronological (LIFO) and popularity-based (Hot). Drawing on prior theory that identifies and ranks canonical system architectures in terms of their flexibility we hypothesize that algorithmic effects on information spread and quality should be largest on the most flexible platforms and smallest on the most constrained ones. We find strong confirmation of this prediction. On tree-like platforms like Reddit, the algorithm has no detectable effect on information spread and quality. On layered hierarchies and networks like Facebook and Twitter, respectively, the Hot algorithm has modest positive effects on both the spread of information and its quality. On complete structures like TikTok, the Hot algorithm leads to a winner-take-all dynamics that has strong negative effects on both information spread and quality, making the relation between content quality and popularity unpredictable. These findings imply that architectural considerations are more powerful levers than algorithmic interventions for the design of healthy online spaces and public discourse. Platform reform efforts focused exclusively on algorithm choice may be insufficient on architecturally unconstrained platforms and unnecessary on architecturally constrained ones.

[HC-12] Going PLACES: Participatory Localized Red Teaming for Text-to-Image Safety in the Global South

链接: https://arxiv.org/abs/2605.19190
作者: Charvi Rastogi,Mukul Bhutani,Minsuk Kahng,Shamsuddeen Hassan Muhammad,Evgeniia Razumovskaia,Priyanka Suresh,Ibrahim Said Ahmad,Charu Kalia,Yaaseen Mahomed,Madhurima Maji,Minjae Lee,Alicia Parrish,Jessica Quaye,Vijay Janapa Reddi,Aishwarya Verma,Lora Aroyo
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Published at ACM Conference on FAccT 2026

点击查看摘要

Abstract:Despite the global deployment of text-to-image (T2I) models, their safety frameworks are largely calibrated to a Western-centric default, creating significant vulnerabilities for the rest of the world. To embrace cultural pluralism and bring historically under-represented perspectives in T2I safety, we conduct localised community-centered red teaming studies in the Global South. Our two-fold approach prioritizes localization and participation, by focusing on secondary urban centers in these regions, and conducting community engagement and training workshops to contextualize local norms. As a result, we present PLACES, a dataset comprising over 26,000 examples of T2I model failures collected in partnership with universities in Ghana, Nigeria, and two regions of India (Karnataka and Punjab). Analysis of prompts collected reveals a wide-ranging diversity in socio-cultural and linguistic attributes, when compared to existing geography-agnostic crowdsourced red-teaming data. We observe unique adversarial patterns enabled by local cultural and linguistic nuances, and distinct clusters within region around specific themes, such as religion in India. Moreover, we uncover structural contextual gaps in existing safety frameworks by identifying novel harms showing normative dissonance (e.g., violating religious norms, ignoring local customs, and ominous symbolism). This work argues that expanding T2I safety requires moving beyond mere scale to incorporate deeply localised, participatory methodologies for data collection and contextualization. Content warning: This paper includes examples containing potentially harmful or offensive content.

[HC-13] Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agent ic Tool Use

链接: https://arxiv.org/abs/2605.19151
作者: Changkun Ou
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:We formalize trust calibration for agentic tool use (deciding when an automated agent’s proposed action may execute autonomously versus require human approval) as a preference-learning problem. A policy gateway maintains a Gaussian-process posterior over a latent human risk-tolerance function, observed through a probit likelihood on binary approve/deny feedback, and escalates to the human exactly where the approval outcome is most uncertain. We show this is structurally an instance of Preferential Bayesian Optimization, inheriting its inference machinery (approximate Gaussian-process classification) and its sample-efficiency argument (uncertainty-targeted querying), while differing in objective: classifying an action space into allow/block/ask regions rather than optimizing a design.

[HC-14] GRASP: Deterministic argument ranking in interaction graphs

链接: https://arxiv.org/abs/2605.19141
作者: Diganta Misra,Antonio Orvieto,Rediet Abebe,Volkan Cevher
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Preprint

点击查看摘要

Abstract:Large language models are increasingly deployed as automated judges to evaluate the strength of arguments. As this role expands, their legitimacy depends on consistency, transparency, and the ability to separate argumentative structure from rhetorical appeal. However, we show that holistic judging - a common LLM-as-a-Judge practice where a model provides a global verdict on a debate - suffers from substantial inter-model disagreement. We argue that this instability arises from collapsing a debate’s complex interaction structure into a single opaque score. To address this, we propose GRASP (Gradual Ranking with Attacks and Support Propagation), a deterministic framework that aggregates stable local interaction judgments into a global ranking via a convergent attack–defense propagation operator. We show that local interaction judgments are more reproducible than holistic rankings in LLM-as-a-Judge evaluations, allowing GRASP to produce more consistent global rankings. We further show that GRASP scores do not correlate with human “convincingness” labels, highlighting a vital sociotechnical distinction: GRASP does not measure persuasion, factuality, or rhetorical appeal, but structural sufficiency - a defense-aware notion of argument robustness over the explicit interaction graph. Overall, GRASP offers a transparent and auditable alternative to holistic LLM judging.

[HC-15] oward an AI-Powered Computational Testbed for Workforce Policy

链接: https://arxiv.org/abs/2605.19064
作者: Sumer S. Vaid,Ashley V. Whillans
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Workforce transformations are difficult to forecast and costly to mismanage. In particular, the integration of artificial intelligence into knowledge work currently affects a substantial share of the global workforce, yet this transition proceeds without tools to forecast how individual employees will respond psychologically and behaviorally. We combine recent advances in LLM-powered generative agents with foundational management science and organizational behavior research to propose dynamic employee agents. Among consenting populations, these agents can be seeded with HR records, validated psychometric measures, and digital activity data to simulate employees’ cognitive, emotional, and behavioral trajectories across successive workdays during planned organizational changes. In this article, we detail the computational architecture required to construct this simulation platform and define the privacy, accuracy, and representativeness safeguards necessary for responsible deployment. We argue that establishing this prospective forecasting infrastructure is a critical technical requirement for managing the current global workforce realignment around AI.

[HC-16] Automated Grading of Handwritten Mathematics Using Vision-Capable LLM s

链接: https://arxiv.org/abs/2605.19043
作者: Jacob Levine,Miguel Aenlle,Craig Zilles,Matthew West,Mariana Silva
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: To be published in the International Conference on AI in Education (AIED), 2026

点击查看摘要

Abstract:Automated grading systems have enabled scalable assessment for many response types, but handwritten mathematics remains a barrier due to the complexity of multi-step solutions. Vision-capable large language models (LLMs) offer new opportunities here, yet their reliability in authentic instructional settings remains poorly understood. We present an empirical evaluation of an LLM-based grader for handwritten mathematical work using instructor-defined rubrics. Extending a prior pipeline for typed responses, we integrate transcription and rubric-based evaluation of photographic submissions within a single LLM call, evaluating on student work from two university STEM courses. Comparing AI grading decisions against human-assigned ground truth at the rubric-item level, we observe high overall accuracy, with most errors – 87% in the best model – attributable to transcription failures rather than rubric misapplication. We categorize common error modes, including image quality issues, hallucinated content, and incorrect handling of equivalent expressions. These findings highlight both the promise and limitations of LLM-based grading for handwritten mathematics, providing guidance for system design, prompt refinement, and deployment in educational settings.

[HC-17] Guardrail Selection in Line Charts to Contextualize Persuasive Visualizations

链接: https://arxiv.org/abs/2605.19017
作者: Khandaker Abrar Nadib,Marina Kogan,Alexander Lex,Maxim Lisnic
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Charts used for persuasion can easily veer into being outright misleading when, for instance, cherry-picked data is paired with a deceptive caption, as is commonly encountered on social media. The rise of interactive time-series data explorers for hotly debated topics makes such framing easy to produce and spread. Post-hoc interventions like fact-checking often arrive too late and suffer from persistence of belief. Prior work suggests that guardrails, in the form of contextual comparison lines embedded directly into charts, can reduce these effects. We propose and evaluate a practical set of guardrail sampling strategies for implementing such contextual lines in real systems. In a preregistered mixed-design study with two real-world scenarios (COVID-19 and Stocks), participants viewed persuasive charts with different sets of guardrails and reported trust, estimated rank in the dataset, expressed their perceived completeness of context, as well as subjective preference for different tasks. Across scenarios, guardrails improved trust, accuracy of performance judgments, and perceived completeness of context compared to the control. Taken together, the study offers practical guardrail sampling methods, evidence of their contextual benefits, and insights into participants’ preferences.

[HC-18] Balancing Teacher and Student Agency: Co-Orchestration Tool Design Supporting Real-Time Dynamic Pairing

链接: https://arxiv.org/abs/2605.18761
作者: Kexin Bella Yang(1),Menghan Liu(2),Liyi Xu(1),Nikol Rummel(3),Vincent Aleven(1) ((1) Human-Computer Interaction Institute, Carnegie Mellon University, USA, (2) University of Washington, USA, (3) Institute of Educational Research, Ruhr-Universität Bochum, Germany)
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at CSCW 2026; to appear in PACM HCI. Kexin Bella Yang and Menghan Liu contributed equally to this work

点击查看摘要

Abstract:In human-AI interaction, respecting user agency is essential for fostering trust and sustaining effective use of technology. In educational settings, dynamically integrating individual and collaborative learning offers pedagogical value by supporting personalized, self-paced learning experiences. Prior research has demonstrated the feasibility of this approach through intelligent tutoring systems and human-AI co-orchestration tools. However, how to balance teacher and student control in this process remains largely unexplored. This work explores the design space of how control can be distributed between teachers and students across the orchestration process, using participatory speed dating and a mixed-method analysis. We focus on three stages of the pairing process: before, during, and after, taking context in designing classroom orchestration tools that support teachers in dynamically coordinating student transitions between individual practice and collaborative problem-solving. It contributes empirical insights to the fields of educational technology and HCI by framing these findings within a theoretical design space, emphasizing the balance of multi-stakeholder agency and control. We propose design recommendations for achieving hybrid-control in analytic-based orchestration tools in pairing contexts. We recommend ensuring structured teacher guidance in the beginning, while progressively increasing student autonomy over time as activities unfold.

[HC-19] Interoceptive Divergence in Aesthetic Evaluation and Implications for Human-AI Alignment

链接: https://arxiv.org/abs/2605.18759
作者: Yoshia Abe,Tatsuya Daikoku,Yasuo Kuniyoshi
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures. Supplementary material is included as a separate PDF in the source files

点击查看摘要

Abstract:Artificial intelligence (AI), exemplified by large language models (LLMs), is rapidly approaching and in some cases surpassing human performance across a wide range of cognitive tasks. However, human nature is not limited to intelligence alone; it also encompasses sensibility, including the capacity to perceive and experience beauty in visual scenes. This raises a fundamental question: how humans and AI systems converge or diverge in such aesthetic experiences. Aesthetic evaluation depends not only on objective properties of images but also on internal processes within the observer. As part of ongoing efforts in AI alignment, building upon prior human studies that have examined the relationship between beauty ratings, bodily sensations, and emotions, we adopt a comparable set of questionnaire items and present them to LLMs, enabling a direct comparison between human and AI responses. Our comparative analyses revealed that, while humans and AI exhibited broadly similar patterns in the correlations between beauty ratings and emotions, as well as in the image features they prioritized, notable divergences emerged in both the distribution of emotional responses and the relationship between beauty ratings and bodily sensations. These findings suggest that state-of-the-art LLMs, trained on large-scale textual data, can approximate average human tendencies in aesthetic evaluation to a certain extent. However, they also indicate limitations, particularly in relation to interoceptive aspects, which may reflect insufficient representation in training data or unintended consequences of alignment processes. These findings highlight key challenges for AI alignment and suggest important directions for developing AI systems with human-like aesthetic processing.

[HC-20] OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

链接: https://arxiv.org/abs/2605.18758
作者: Felix Henry,Xiaochen Lin,Jiangyou Zhu,Yangfan,Bingqian Zhang,Min Chen,Shiyu Huang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs comprising static images, synchronous audio, and video clips at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise. The complete dataset, evaluation pipeline, and baseline prompts are provided in the supplementary material. Project page: this https URL.

计算机视觉

[CV-0] PiG-Avatar: Hierarchical Neural-Field-Guided Gaussian Avatars

链接: https://arxiv.org/abs/2605.20185
作者: Julian Kaltheuner,Jan Spindler,Sina Kitz,Patrick Stotko,Reinhard Klein
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing Gaussian avatar methods typically parameterize geometry on a body-template surface, which entangles the avatar’s representation space with the template’s deformation space and limits the capture of layered, off-body, and non-rigid clothing geometry. We present PiG-Avatar, which addresses this limitation by using the parametric body model solely for kinematic transport, while representing the avatar as Gaussians anchored in a volumetric canonical space governed by a continuous neural field. This decouples representation from template topology, avoiding the geometric constraints of surface-based parameterizations. Kinematic coherence is maintained through 3D barycentric anchor transport, which guides motion without constraining geometry and allows anchors to deviate freely from the template surface, yielding dense, stable temporal surface correspondences by construction. To make this unconstrained formulation tractable, we introduce dual-level spatially coherent optimization, combining Sobolev-preconditioned neural-field updates with a novel KNN-based preconditioning of canonical anchor geometry. Together, these mechanisms induce an emergent self-organization of anchor density: anchors migrate toward regions of high curvature, appearance variation, and non-coherent motion without explicit heuristics. As a result, complex clothing geometry and layered surfaces emerge as natural, high-fidelity outputs. This single representation further supports hierarchical reconstruction across multiple levels of detail, with coarse-level supervision propagating to finer levels through the shared field and coupled anchor graph. On established benchmarks featuring subjects with complex clothing and challenging non-rigid motion, PiG-Avatar achieves state-of-the-art rendering quality, generalizes robustly to imperfect body model initialization, and renders in real time across all detail levels.

[CV-1] MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

链接: https://arxiv.org/abs/2605.20183
作者: Yujie Wei,Yujin Han,Zhekai Chen,Yongming Li,Kaixun Jiang,Zhihang Liu,Quanhao Li,Zhiwu Qing,Xiang Wang,Zhen Xing,Ruihang Chu,Lingyi Hong,Yefei He,Junjie Zhou,Junqiu Yu,Yang Shi,Difan Zou,Kai Zhu,Shiwei Zhang,Yingya Zhang,Yu Liu,Xihui Liu,Hongming Shan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.

[CV-2] Multi-axis Analysis of Image Manipulation Localization

链接: https://arxiv.org/abs/2605.20174
作者: Keanu Nichols,Divya Appapogu,Giscard Biamby,Dina Bashkirova,Anna Rohrbach,Bryan A. Plummer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 28 pages, 5 figures, 5 tables

点击查看摘要

Abstract:Advanced image editing software enables easy creation of highly convincing image manipulations, which has been made even more accessible in recent years due to advances in generative AI. Manipulated images, while often harmless, could spread misinformation, create false narratives, and influence people’s opinions on important issues. Despite this growing threat, there is limited research on detecting advanced manipulations across different visual domains. Thus, we introduce Analysis Under Domain-shifts, qualIty, Type, and Size (AUDITS), a comprehensive benchmark designed for studying axes of analysis in image manipulation detection. AUDITS comprises over 530K images from two distinct sources (user and news photos). We curate our dataset to support analysis across multiple axes using recent diffusion-based inpaintings, spanning a diverse range of manipulation types and sizes. We conduct experiments under different types of domain shift to evaluate robustness of existing image manipulation detection methods. Our goal is to drive further research in this area by offering new insights that would help develop more reliable and generalizable image manipulation detection methods.

[CV-3] CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

链接: https://arxiv.org/abs/2605.20165
作者: Hsiang-Wei Huang,Junbin Lu,Kuang-Ming Chen,Jianxu Shangguan,Cheng-Yen Yang,Jenq-Neng Hwang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and model available at this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial cognition. We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM. Under SNS, state-of-the-art spatial VLMs exhibit significant performance degradation despite high direct question answering accuracy. To address this gap, we introduce CaMo, a camera motion grounded VLM that achieves consistent performance across SNS evaluation and direct spatial question answering accuracy. Our results highlight the importance of explicit spatial narrative externalization for evaluating VLMs with transferable 3D spatial understanding. Our code, data, and model is available at this https URL

[CV-4] Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites

链接: https://arxiv.org/abs/2605.20159
作者: Antonio Peña Corredor,Julien Lesseur,Romain Nunez,Paul Rivalland(SES),Thomas Philippe
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Non-destructive testing of aerospace SiC/SiC composites via X-ray computed tomography (XCT) relies on expert visual assessment, with current workflows offering limited traceability for accept/reject decisions. Deep convolutional networks can automate defect detection, yet their black-box nature conflicts with the transparency that industrial inspection practice demands. To close this gap, we introduce p-ResNet-50, a convolutional framework extended with a prototype layer that couples high detection accuracy with case-based explanations. Six learned prototypes are explicitly aligned with expert-defined semantic categories-healthy matrix, matrix–air interfaces, pores, line-like defects, and mixed morphologies-so that every classification is traceable to a physically meaningful reference. Two novel regularisation terms, anchor-based and medoid-based, tether prototypes to expert-selected patches and prevent prototype collapse, addressing a known limitation of prototype networks. Latent-space analysis via UMAP delineates semantically coherent sub-domains and maps zones of uncertainty where misclassifications concentrate, giving inspectors an explicit picture of where the model is-and is not-reliable. The framework is validated on an XCT patch dataset of approximately 12,000 patches extracted from four defect-rich SiC/SiC laboratory specimens. Taking a black-box ResNet-50 as a baseline (ROC-AUC = 0.991), the prototype extension achieves comparable performance (accuracy 0.957 vs. 0.959; ROC-AUC 0.994 vs. 0.993) while trading a slight reduction in sensitivity for higher precision and specificity. Each decision is backed by representative evidence patches, and the model explicitly flags its uncertainty regions. Beyond defect mapping, the framework establishes a reusable methodology for embedding domain-expert knowledge into prototype networks, applicable to other XCT inspection scenarios requiring traceable, auditable decisions.

[CV-5] deGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization ICML2026

链接: https://arxiv.org/abs/2605.20150
作者: Chonghao Zhong,Linfeng Shi,Hua Chen,Tiecheng Sun,Hao Zhao,Binhang Yuan,Chaojian Li
类目: Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注: Accepted to ICML 2026 as Spotlight. Website: this https URL

点击查看摘要

Abstract:Training 3D Gaussian Splatting (3DGS) at billion-primitive scale is fundamentally memory-bound: each Gaussian primitive carries a large attribute vector, and the aggregate parameter table quickly exceeds GPU capacity, limiting prior systems to tens of millions of Gaussians on commodity single-GPU hardware. We observe that 3DGS training is inherently sparse and trajectory-conditioned: each iteration activates only the Gaussians visible from the current camera batch, so GPU memory can serve as a working-set cache rather than a persistent parameter store. Building on this insight, we introduce TideGS, an out-of-core training framework that manages parameters across an SSD-CPU-GPU hierarchy via three synergistic techniques: block-virtualized geometry for SSD-aligned spatial locality, a hierarchical asynchronous pipeline to overlap I/O with computation, and trajectory-adaptive differential streaming that transfers only incremental working-set deltas between iterations. Experiments show that TideGS enables training with over one billion Gaussians on a single 24 GB GPU while achieving the best reconstruction quality among evaluated single-GPU baselines on large-scale scenes, scaling beyond prior out-of-core baselines (e.g., approximately 100M Gaussians) and standard in-memory training (e.g., approximately 11M Gaussians).

[CV-6] PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset

链接: https://arxiv.org/abs/2605.20147
作者: Haojun Chen,Haoyang He,Chengming Xu,Qingdong He,Junwei Zhu,Yabiao Wang,Zhucun Xue,Xianfang Zeng,Zhennan Chen,Xiaobin Hu,Hao Zhao,Yong Liu,Jiangning Zhang,Dacheng Tao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page is available at this https URL

点击查看摘要

Abstract:Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.

[CV-7] SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction

链接: https://arxiv.org/abs/2605.20110
作者: Zhixiong Zhang,Yizhuo Li,Shuangrui Ding,Yuhang Zang,Shengyuan Ding,Long Xing,Yibin Wang,Qiaosheng Zhang,Jiaqi Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring segmentation grounds natural-language queries to pixel-level masks, but extending it to complex scenarios with multiple instances, cross-category groups, or open-ended target sets remains challenging. Previous Large Vision Language Model (LVLM)-based methods represent referred targets with one or more special tokens sequentially, treating multiple targets as separate outputs rather than a coherent set and offering little incentive to capture set-level properties such as completeness and mutual exclusivity. We reformulate open-ended referring segmentation as explicit set-level concept prediction and propose Set-Concept Segmentation (SetCon), which uses LVLM-generated natural-language concepts, instead of segmentation-specific tokens, as semantic conditions for joint mask-set decoding. A hierarchical semantic decomposition first predicts a shared set-level concept defining the target scope and then refines it into fine-grained concept groups aligned with target subsets. To support this, a two-stage annotation pipeline augments existing reasoning segmentation datasets with hierarchical semantic supervision (236k samples, 784k concept phrases). SetCon achieves state-of-the-art results on image benchmarks (+3.3 gIoU on gRefCOCO, +12.1 gIoU on MUSE), with margins that grow as the number of referred targets increases. The concept interface also transfers to video under a detect-and-track setting, yielding new state-of-the-art results on seven referring video benchmarks, including +10.9 JF on MeViS and +12.4 JF on Ref-SeCVOS.

[CV-8] MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

链接: https://arxiv.org/abs/2605.20090
作者: Zhiping Yu,Chenyang Liu,Jinqi Cao,Qinzhe Yang,Siwei Yu,Zhengxia Zou,Zhenwei Shi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at this https URL.

[CV-9] Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

链接: https://arxiv.org/abs/2605.20085
作者: Yifan Li,Xinyu Zhou,Yunhao Ge,Yu Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.

[CV-10] VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving ICRA

链接: https://arxiv.org/abs/2605.20082
作者: Zhefan Xu,Ghassen Jerfel,Marina Haliem,Qi Zhao,Jeonhyung Kang,Khaled S. Refaat
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in International Conference on Robotics and Automation (ICRA), 2026 8 pages, 6 figures, 4 tables

点击查看摘要

Abstract:The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model’s rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM’s trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.

[CV-11] Probability-Conserving Flow Guidance

链接: https://arxiv.org/abs/2605.20079
作者: Parsa Esmati,Junha Hyung,Amirhossein Dadashzadeh,Jaegul Choo,Majid Mirmehdi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Diffusion and flow-based generative models dominate visual synthesis, with guidance aligning samples to user input and improving perceptual quality. However, Classifier-Free Guidance (CFG) and extrapolation-based methods are heuristic linear combinations of velocities/scores that ignore the generative manifold geometry, breaking probability conservation and driving samples off the learned manifold under strong guidance. We analyse guidance through the continuity equation and show its effect decomposes into a divergence term and a score-parallel term defined invariantly across parameterisations. We prove the divergence term blows up structurally as sampling approaches the data manifold, motivating a time-dependent schedule alongside score-parallel attenuation. The resulting plug-and-play rule, Adaptive Manifold Guidance (AdaMaG), bounds both terms at no additional inference cost. Finally, we show that most empirical heuristics for reducing saturation or improving generation quality correspond directly to the two terms in our decomposition. Across image generation benchmarks, AdaMaG improves realism, reduces hallucinations, and induces controlled desaturation in high-guidance regimes.

[CV-12] X-Ray cardiac angiographic vessel segmentation based on pixel classification using machine learning and region growing

链接: https://arxiv.org/abs/2605.20073
作者: E O Rodrigues,L O Rodrigues,J J Lima,D Casanova,F Favarim,E R Dosciatti,V Pegorini,L S N Oliveira,F F C Morais
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work proposes a pixel-classification approach for vessel segmentation in x-ray angiograms. The proposal uses textural features such as anisotropic diffusion, features based on the Hessian matrix, mathematical morphology and statistics. These features are extracted from the neighborhood of each pixel. The approach also uses the ELEMENT methodology, which consists of creating a pixel-classification controlled by region-growing where the result of the classification affects further classifications of pixels. The Random Forests classifier is used to predict whether the pixel belongs to the vessel structure. The approach achieved the best accuracy in the literature (95.48%) outperforming unsupervised state-of-the-art approaches.

[CV-13] Cardiac fat segmentation using computed tomography and an image-to-image conditional generative adversarial neural network

链接: https://arxiv.org/abs/2605.20064
作者: Guilherme Santos da Silva,Dalcimar Casanova,Jefferson Tales Oliva,Erick Oliveira Rodrigues
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, research has highlighted the association between increased adipose tissue surrounding the human heart and elevated susceptibility to cardiovascular diseases such as atrial fibrillation and coronary heart disease. However, the manual segmentation of these fat deposits has not been widely implemented in clinical practice due to the substantial workload it entails for medical professionals and the associated costs. Consequently, the demand for more precise and time-efficient quantitative analysis has driven the emergence of novel computational methods for fat segmentation. This study presents a novel deep learning-based methodology that offers autonomous segmentation and quantification of two distinct types of cardiac fat deposits. The proposed approach leverages the pix2pix network, a generative conditional adversarial network primarily designed for image-to-image translation tasks. By applying this network architecture, we aim to investigate its efficacy in tackling the specific challenge of cardiac fat segmentation, despite not being originally tailored for this purpose. The two types of fat deposits of interest in this study are referred to as epicardial and mediastinal fats, which are spatially separated by the pericardium. The experimental results demonstrated an average accuracy of 99.08% and f1-score 98.73 for the segmentation of the epicardial fat and 97.90% of accuracy and f1-score of 98.40 for the mediastinal fat. These findings represent the high precision and overlap agreement achieved by the proposed methodology. In comparison to existing studies, our approach exhibited superior performance in terms of f1-score and run time, enabling the images to be segmented in real time.

[CV-14] OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives

链接: https://arxiv.org/abs/2605.20044
作者: Guiyu Liu,Niklas Vaara,Janne Mustaniemi,Juho Kannala,Janne Heikkilä
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) provides an explicit and efficient scene representation, but its primitives lack inherent object-level identity, hindering downstream tasks such as open-vocabulary scene understanding. Existing methods typically address this by either distilling high-dimensional feature embeddings into Gaussians or by lifting 2D mask labels into 3D via heuristic refinement. However, feature-based approaches incur heavy storage and decoding overhead, while lifting-based pipelines remain vulnerable to label contamination: Gaussians necessary for appearance reconstruction often receive incorrect object labels during 2D-to-3D projection. We propose OP2GS, an object-aware Gaussian representation that augments each primitive with an explicit instance identity and a dedicated instance opacity \sigma^* for object-mask rendering. The original opacity \sigma remains responsible for visual reconstruction, while \sigma^* models whether a Gaussian should contribute to a particular object mask. This dual-opacity formulation decouples visual existence from instance occupancy: mislabeled Gaussians can remain available for image rendering while becoming transparent in the object-mask branch. To learn this representation, we introduce a random object loss that optimizes the 1D instance occupancy field using the standard transmittance-based visibility of 3DGS. Semantic descriptors are then attached at the object level through multi-view aggregation, eliminating per-Gaussian feature storage. Compared with feature-training approaches, OP2GS achieves competitive open-vocabulary performance while significantly reducing computational overhead. Compared with training-free pipelines, it leverages physically consistent occupancy learning to resolve visibility ambiguities.

[CV-15] Stage-adaptive Token Selection for Efficient Omni-modal LLM s

链接: https://arxiv.org/abs/2605.20035
作者: Zijie Xin,Jie Yang,Ruixiang Zhao,Tianyi Wang,Fengyun Rao,Jing Lyu,Xirong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code Link: this https URL

点击查看摘要

Abstract:Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.

[CV-16] A Nash Equilibrium Framework For Training-Free Multimodal Step Verification ICLR2026

链接: https://arxiv.org/abs/2605.20033
作者: Rohit Sinha,Kunal Tilaganji,Tanuja Ganu,Nagarajan Natarajan,Amit Sharma,Vineeth N. Balasubramanian
类目: Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT)
备注: ICLR 2026 Workshop VerifAI-2

点击查看摘要

Abstract:Multimodal large language models often generate reasoning chains containing subtle errors that lead to incorrect answers. Current verification approaches have notable limitations. Learned critics need extensive labeled data and show inconsistent performance across different tasks. Meanwhile, existing training-free methods simply average scores from different sources, missing a key insight: when these scores disagree, that disagreement itself carries important information about whether a reasoning step is truly valid or not. We propose a training-free verification approach that treats step-wise verification as a coordination problem among specialized judges. We formalize these judges’ interaction as a Nash equilibrium game where agreement signals valid steps while disagreement reveals instability. Our method computes equilibrium scores through a closed-form solution, enabling both disagreement-aware filtering and stability-conscious ranking of reasoning steps. Evaluated across six benchmarks, our approach achieves consistent improvements of 2.4% to 5.2% over baseline models and shows competitive performance against learned critics, demonstrating that cross-modal agreement (not just average confidence) provides robust verification signals without task-specific adaptation.

[CV-17] CogOmniControl: Reasoning -Driven Controllable Video Generation via Creative Intent Cognition

链接: https://arxiv.org/abs/2605.19995
作者: Hongji Yang,Songlian Li,Yucheng Zhou,Xiaotong Zhao,Alan Zhao,Chengzhong Xu,Jianbing Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user’s creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM’s robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop “harness-like” architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: this https URL

[CV-18] Minimalist Visual Inertial Odometry

链接: https://arxiv.org/abs/2605.19990
作者: Francesco Pasti,Jeremy Klotz,Nicola Bellotto,Shree K. Nayar
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Visual-Inertial Odometry(VIO), which is critical to mobile robot navigation, uses cameras with a large number of pixels. Capturing and processing camera images requires significant resources. This work presents a minimalist approach to planar odometry, demonstrating that just four visual measurements and an IMU can provide robust motion estimation for differential-drive robots. Our key insight is that four downward-facing photodiodes that sense the world through optical Gabor masks produce signals that encode speed. Based on this, we jointly optimize the mask parameters alongside a Temporal Convolutional Network (TCN) using a physically-grounded simulator. The resulting model decodes speed from just the four measurements produced by the photodiodes. Pairing these estimates with the angular speed from an IMU yields a continuous planar trajectory. We validate our approach with a prototype sensor mounted on a differential drive robot. Across diverse indoor and outdoor terrains, our system closely tracks the reference ground truth without any real-world fine-tuning. Our work shows that minimalist sensing enables efficient and accurate planar odometry.

[CV-19] Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

链接: https://arxiv.org/abs/2605.19986
作者: He-Yang Xu,Pengyuan Zhang,Zongyuan Ge,Xiaoshuai Hao,Serge Belongie,Xin Geng,Yuxin Peng,Xiu-Shen Wei
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Fine-grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high-fidelity spatial perception, and constraint-respecting motor execution. However, current embodied AI benchmarks collapse these capacities into binary success rates, systematically inflating reported capabilities by up to 70% and masking the architectural bottlenecks that impede real-world deployment. We introduce MetaFine, a diagnostic meta-evaluation framework that disentangles manipulation competency along three axes: understanding, perception, and controlled behavior. Built on a compositional task graph, MetaFine absorbs heterogeneous external benchmarks and reconstructs them into diagnostic scenarios of varying complexity under a unified protocol. Evaluating state-of-the-art vision-language-action (VLA) models through this lens exposes severe dimension-specific failures invisible to conventional metrics. Through targeted causal intervention, we identify the visual encoder’s ability to preserve local spatial structure as a key bottleneck for fine-grained precision: improving it directly unlocks previously inaccessible manipulation capabilities without modifying downstream policies. MetaFine further supports hybrid real-sim validation, using limited paired real-world rollouts to calibrate scalable simulation-based estimates for more stable physical benchmarking. By shifting evaluation from ranking to diagnosis, MetaFine turns benchmarking into an actionable compass for repairing the layered capacities underlying genuine physical dexterity. The MetaFine framework, benchmarks, and supporting resources will be publicly released at our project page: this https URL.

[CV-20] InterLight: Leverag ing Intrinsic Illumination Priors for Low-Light Image Enhancement IJCAI2026

链接: https://arxiv.org/abs/2605.19982
作者: Ziqi Wang,Xu Zhang,Laibin Chang,Shi Chen,Jiaqi Ma,Huan Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2026. Code: this https URL

点击查看摘要

Abstract:Low-Light Image Enhancement (LLIE) has long been a challenging problem in low-level vision, as insufficient illumination often leads to low contrast, detail loss, and noise. Recent studies show that deep learning-based Retinex theory can effectively decouple illumination and reflectance. However, existing methods frequently suffer from over-enhancement or color distortion, and often assume uniform noise or ideal lighting. To address these limitations, we propose InterLight, a novel framework that systematically excavates and operationalizes intrinsic illumination priors for this http URL core insight is that robust enhancement requires not just estimating illumination, but constructing an illumination-aware pipeline. We first inject sensor-level illumination-response priors via physics-guided augmentation, then represent the degradation through adaptive prompts conditioned on the scene’s latent illumination state. This explicit representation directly guides a luminance-gated intrinsic memory mechanism to selectively compensate for information loss, prioritizing reconstruction in dark regions while preserving fidelity in bright ones. Finally, the entire process is regularized by a self-supervised consistency objective that distills illumination-invariant features. By deeply exploiting intrinsic illumination priors, our method achieves clearer textures and more visually coherent enhancement results. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our approach. Code is available at: this https URL.

[CV-21] RECIPE: Procedural Planning via Grounding in Instructional Video

链接: https://arxiv.org/abs/2605.19976
作者: Luigi Seminara,Antonino Furnari,Lorenzo Torresani
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual planning asks a model to generate the remaining steps of a procedure in natural language given a partial video context and a goal. Progress on this task is bottlenecked by annotation: clean labeled datasets are small, domain-narrow, and encode a single execution trajectory per example, even though many valid orderings exist. Large-scale instructional video corpora offer orders of magnitude more procedural content, but supervised fine-tuning on pseudo-labels from their noisy ASR narrations propagates segmentation and alignment errors and stays single-trajectory. We identify a key asymmetry: extracting clean step labels from noisy video is hard, but verifying whether a generated step sequence is temporally grounded in ASR transcripts is cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry in RECIPE, which uses grounding quality as a reward for GRPO, turning the noisy corpus into a verifier rather than a label source. The framework applies uniformly to two planner input configurations (Socratic, with a textual history extracted by a frozen VLM, and Video, consuming video tokens directly) and to annotated and weakly supervised regimes. We evaluate on 7 procedural benchmarks using a reference-based LLM-as-judge protocol scoring plans across 6 procedural criteria. RECIPE-RL improves over the base checkpoint at all scales (0.5B, 3B, 7B) and every benchmark, with macro-accuracy gains of +7 to +8 points in-domain and up to +16 points zero-shot. It outperforms supervised fine-tuning on both annotated and pseudo-labeled plans (the latter degrades the base) and remains robust without human annotations. Used as the proposal stage of a prior propose-assess-search planner, it improves over the strongest zero-shot baseline at every horizon on Visual Planning for Assistance, and on COIN it preserves the generation diversity that SFT collapses.

[CV-22] SphericalDreamer: Generating Navigable Immersive 3D Worlds with Panorama Fusion ICML2026

链接: https://arxiv.org/abs/2605.19974
作者: Antoine Schnepf,Karim Kassab,Flavian Vasile,Andrew Comport
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026. Project page available at this https URL

点击查看摘要

Abstract:The generation of immersive and navigable 3D environments is increasingly prevalent with the growing adoption of virtual reality and 3D content. However, recent methods face a fundamental limitation: they cannot produce 3D worlds that simultaneously (i) are navigable over long-range spatial extents and (ii) cover the complete omnidirectional field of view ( 360^\circ horizontally and 180^\circ vertically). To address this challenge, we introduce SphericalDreamer, a method for generating fully immersive and long-range 3D outdoor environments from textual prompts. Our approach is built on the generation of multiple panoramic images, which are subsequently lifted into 3D and fused together while maintaining visual and geometric consistency. SphericalDreamer produces highly detailed, fully immersive 3D environments, while substantially improving scale and navigability compared to prior approaches.

[CV-23] World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

链接: https://arxiv.org/abs/2605.19957
作者: Zuyao Lin,Jianhui Zhang,Peidong Jia,Xiaoguang Zhao,Shanghang Zhang,Xingyu Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emphWorld-Ego Modeling, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.

[CV-24] owards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models ICML2026

链接: https://arxiv.org/abs/2605.19956
作者: Jia-Wei Hai,Yijun Wang,Xiu-Shen Wei
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026, Project Page: this https, URL Code URL: this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs), such as CLIP, have achieved significant zero-shot performance on downstream tasks with various fine-tuning adaptation methods. However, recent studies have proven that adversarial attacks can significantly degrade the inference ability of VLMs, posing substantial risks to their practical applications. Prevalent test-time adaptation methods typically rely on multi-view augmentation to implement various fine-tuning strategies, which struggle to identify semantic information and are prone to destroying discriminative regions in fine-grained scenarios. To address these limitations, we propose Attention-Guided Test-Time Prompt Tuning (A-TPT), a semantics-preserving method designed for test-time adaptation. We first refine the gradient attention rollout mechanism to identify semantically meaningful regions surviving under adversarial attacks. Furthermore, we leverage them to guide the spatially varying augmentation intensities and multi-view ensemble for prompt tuning and inference. Extensive experiments demonstrate that A-TPT outperforms existing test-time adaptation methods on both adversarial and clean data. Codes are available at this https URL .

[CV-25] AffectVerse: Emotional World Models for Multimodal Affective Computing

链接: https://arxiv.org/abs/2605.19950
作者: Bo Zhao,Fanghua Ye,Yixin Ji,Sicheng Zhao,Xiaojiang Peng,Zitong YU
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans infer emotions by integrating observed multimodal cues with expectations about how affective states may unfold. Existing multimodal large language models (MLLMs), however, often treat emotion recognition as static fusion over complete audiovisual-text inputs, leaving affective dynamics implicit. We propose AffectVerse, a Qwen2.5-Omni-based model equipped with an Emotion World Module (EWM), an action-free representation-level module for short-horizon latent affective prediction. \revEWM contains three modules: 1) Cross-Modal Temporal Imagination predicts future video/audio representations from past tokens with multi-step rollout. 2) MAMA(Modality-Aware Multi-step Attention) Belief Aggregation compresses imagined tokens into modality-aware belief tokens. 3) Belief Injection inserts these belief tokens into the LLM for affective reasoning. AffectVerse uses future prediction as a past-conditioned self-supervised signal: it does not replace modeling observed history or require unseen signals at inference, but forces the current belief state to encode transition cues that are predictive of subsequent affective change. Across nine benchmarks, AffectVerse improves at least 2.57% over other models, while controlled ablations show additive gains from temporal imagination, cross-modal rollout, and belief aggregation. These results suggest predictive belief-state modeling is a practical alternative for affective computing.

[CV-26] Feed-Forward Gaussian Splatting from Sparse Aerial Views

链接: https://arxiv.org/abs/2605.19949
作者: Dongli Wu,Zhuoxiao Li,Tongyan Hua,Yinrui Ren,Xiaobao Wei,Rongjun Qin,Wufan Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing large-scale urban scenes from sparse aerial views is a crucial yet challenging task. Due to biased top-down and shallow-oblique camera poses, sparse aerial captures exhibit strong evidence imbalance: roofs and open regions are repeatedly observed, while facades, distant buildings, and occluded structures receive little multi-view support. Existing feed-forward 3D Gaussian Splatting methods directly regress a deterministic representation from sparse inputs, but this often leads to ghosting, melted facades, and stretched textures. Recent pseudo-view and video-based generative reconstruction methods use additional supervision or generative priors. However, they often lack a clear separation between observed geometry and prior-driven content, which can lead to plausible but inconsistent structures. We propose AnyCity, an observation-grounded generative reconstruction framework for sparse aerial urban scenes. AnyCity first predicts an observation-supported geometry latent to anchor reliable structures, and then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. During training, dense-to-sparse distillation transfers structural cues from dense-view reconstruction, while an aerial-adapted video diffusion prior provides fine-grained urban appearance cues through gated token conditioning. Observation-preserving objectives keep the refined representation consistent with input-supported geometry. At inference time, AnyCity reconstructs the final 3D Gaussian scene from sparse aerial views in a single feed-forward pass, achieving coherent urban novel-view synthesis with second-level inference. Experiments on synthetic, aerial-domain, UAV-textured, and real-world scenes show consistent improvements over feed-forward baselines.

[CV-27] StruMPL: Multi-task Dense Regression under Disjoint Partial Supervision and MNAR Labels

链接: https://arxiv.org/abs/2605.19931
作者: Reza M. Asiyabi,Juan Alberto Molina-Valero, TheSEOSAW Partnership,Steven Hancock,Casey M. Ryan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages with 3 figures and 4 tables, References and Appendix 12 pages with 1 figure and 4 tables

点击查看摘要

Abstract:Estimating forest aboveground biomass (AGB) from Earth observation combines two structurally incompatible label sources: spaceborne lidar provides canopy structure at millions of locations but no biomass estimate, and ground-based plots provide biomass at thousands of biased locations but no metrics of structure. No single training sample carries labels for all target variables, plot labels are missing not at random (MNAR), and biomass is linked to the structural variables by known but biome-specific allometric laws. We formalise this as multi-task dense regression under heterogeneous disjoint partial supervision with MNAR labels and inter-task physical constraints, and propose StruMPL to address it jointly. A shared encoder feeds per-variable regression, imputation, and propensity heads for spatial MNAR correction, and a learnable physics module that evaluates the inter-task constraint on the model’s own predictions at every pixel. The supervised loss uses an Augmented IPW (AIPW) pseudo-outcome with stop-gradients on the propensity and on the imputation baseline; we show analytically and empirically that both are necessary for joint optimisation to recover IPW-weighted stationary points while keeping the loss bounded. On two ecologically distinct biomes, StruMPL outperforms ablation variants and the closest published method on AGB RMSE and bias, with a stratified analysis showing AIPW reduces high-AGB bias by ~54%.

[CV-28] Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

链接: https://arxiv.org/abs/2605.19929
作者: Yi Zhong,Haotong Qin,Xindong Zhang,Lei Zhang,Guolei Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs’ accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies, we design an Adaptive Cross-Modal Calibration (ACC) module that employs dual lightweight learnable branches to dynamically mitigate modality-induced quantization errors. Extensive experiments on popular VLMs demonstrate that SplitQ significantly outperforms existing approaches across 6 popular multi-modal datasets under all evaluated quantization settings, including W4A8, W4A4, W3A3, and W3A2. Notably, SplitQ preserves 93.5% of FP16 performance under the challenging W3A3 setting (69.5 vs. 74.3), pushing the efficiency frontier for deploying advanced VLMs. Our code is available at this https URL

[CV-29] GoTTA be Diverse: Rethinking Memory Policies for Test-Time Adaptation

链接: https://arxiv.org/abs/2605.19890
作者: Shyma Alhuwaider,Yasmeen Alsaedy,Merey Ramazanova,Silvio Giancola,Bernard Ghanem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Test-time adaptation (TTA) enables a pre-trained model to adapt online to an unlabeled test stream under distribution shift. While most TTA research focuses on the adaptation objective, practical streams also depend critically on the memory used to select which test samples drive adaptation. Existing memory mechanisms are usually evaluated as components of specific TTA algorithms, making it difficult to isolate which memory design choices matter and when they matter. In this work, we provide a systematic benchmark that decouples memory from the adaptation algorithm and evaluates memory policies under unified conditions across i.i.d., non-i.i.d., continual, and practical test streams. Our study shows that effective memory management requires more than retaining recent or class-balanced samples. In particular, intra-class diversity is a key factor for avoiding redundant buffers and maintaining representative adaptation signals under temporally correlated and label-skewed streams. Motivated by this finding, we introduce Guided Observational Test-Time Adaptation (GOTTA), a family of diversity-aware memory policies that combine class-balanced allocation with feature-space diversity. GOTTA memories act as drop-in replacements for existing buffers and can be paired with different TTA objectives. Across corruption benchmarks and video-stream settings, diversity-aware memory improves adaptation most clearly under constrained memory budgets and challenging non-i.i.d. streams, while remaining competitive as memory capacity increases. These results highlight memory management as a first-class component of robust test-time adaptation and identify diversity as a central principle for practical TTA.

[CV-30] GLUT: 3D Gaussian Lookup Table for Continuous Color Transformation

链接: https://arxiv.org/abs/2605.19889
作者: Danna Xue,David Serrano-Lozano,Shaolin Su,Javier Vazquez-Corral
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:3D Lookup Tables (3D LUTs) are widely used for color mapping, but their grid-based representation requires discretizing the RGB space, leading to a capacity-memory trade-off that becomes prohibitive when storing large numbers of LUTs. Recent approaches adopt implicit neural representations to improve scalability, yet their black-box nature limits interpretability and hinders intuitive, localized editing. In this paper, we propose Gaussian LUT (GLUT), a continuous and explicit color representation that models color transformations using a set of learnable 3D Gaussian primitives. By avoiding fixed-resolution grids, GLUT achieves flexible representational capacity while maintaining a compact memory footprint. Its explicit, spatially localized formulation further enables both accurate modeling and interpretability. Building on this representation, we introduce a compact conditional generator (CGLUT) that predicts GLUT parameters for multiple LUT instances, encoding diverse color styles in a single framework to enable smooth and controllable LUT style blending. Moreover, GLUT supports efficient, user-friendly editing by allowing localized adjustments to specific color regions without global retraining. Experimental results demonstrate that our approach outperforms prior neural LUT representations in both accuracy and efficiency, while offering improved interpretability and interactive control.

[CV-31] Structural Energy Guidance for View-Consistent Text-to-3D Generation

链接: https://arxiv.org/abs/2605.19876
作者: Qing Zhang,Jinguang Tong,Jing Zhang,Jie Hong,Xuesong Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2508.16917

点击查看摘要

Abstract:Text-to-3D generation based on diffusion models often suffers from the Janus problem, leading to inconsistent geometry across viewpoints. This work identifies viewpoint bias in 2D diffusion priors as the main cause and proposes Structural Energy-Guided Sampling (SEGS), a training-free and plug-and-play framework to improve multi-view consistency. SEGS constructs a structural energy in the PCA subspace of U-Net features and injects its gradient into the denoising process. It can be easily integrated into SDS/VSD pipelines without retraining. Experiments show that SEGS reduces the Janus Rate by about 10% on average and improves View-CS scores across multiple baselines, including DreamFusion, Magic3D, and LucidDreamer. This method effectively alleviates viewpoint artifacts while preserving appearance fidelity, providing a flexible solution for high-quality text-to-3D content generation.

[CV-32] Passive Construction Site Safety Monitoring via Persona-Scaffolded Adversarial Chain-of-Thought VLM Verification

链接: https://arxiv.org/abs/2605.19869
作者: Ananth Sriram,Neel Mokaria,Rajveer Singh
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures. First place, this http URL Spatial Intelligence Hackathon, University of Maryland, February 2026. Code available at this https URL

点击查看摘要

Abstract:Construction remains the deadliest industry sector in the United States, with 1,055 fatal worker injuries recorded in 2023, and the majority preventable. Existing monitoring approaches are expensive, require real-time human operators, or address only a narrow subset of violations. This paper presents a passive, end-of-shift construction safety monitoring pipeline processing video from POV body-worn and fixed wall-mounted cameras through a three-stage architecture: (1) fine-tuned YOLO11 for primary PPE and hazard detection, (2) SAM 3 for segmentation refinement and worker deduplication, and (3) Qwen3-VL-8B-Instruct with a method-prompted, persona-scaffolded three-pass adversarial chain-of-thought protocol for compliance verification and hallucination control. The principal contribution is the Stage 3 prompt design: professional persona backstories following the method-actor framing drive an observed 12% precision improvement over single-pass prompting in an informal three-author review of the 12-video Ironsite development corpus, with the largest gains on hallucination-prone violation categories. Structural message isolation enforces observational independence between a generator, discriminator, and reconciliation pass governed by asymmetric rules encoding priors about human observation versus automated detection reliability. The system maps violations to OSHA standards, performs REBA-inspired ergonomic risk scoring from pose keypoints, and produces per-worker safety reports with timestamped evidence. An evaluation harness is released for future reproduction.

[CV-33] WoundFormer: Multi-Scale Spatial Feature Fusion for Multi-Class Wound Tissue Segmentation

链接: https://arxiv.org/abs/2605.19868
作者: Muhammad Ashad Kabir,Rabin Dulal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Chronic wounds such as diabetic foot ulcers and pressure injuries require accurate tissue-level assessment to guide treatment planning and monitor healing progression. While deep learning methods have advanced automated wound analysis, most existing approaches focus on binary segmentation and inadequately model heterogeneous tissue composition due to high intra-class variability and limited annotated data. Multi-class wound tissue segmentation, therefore, remains a challenging and clinically relevant problem. We propose WoundFormer, a transformer-based framework that enhances hierarchical spatial feature fusion for multi-class wound tissue segmentation. Specifically, we replace the standard SegFormer decoder with a spatially-preserving multi-scale aggregation head that maintains feature topology during cross-scale integration and strengthens contextual interactions through convolutional fusion. This design improves boundary localization and discrimination between visually similar tissue categories while preserving transformer efficiency. We evaluate WoundFormer on the WoundTissueSeg dataset (147 images, six tissue classes) and a second benchmark (DFUTissue dataset). The proposed method achieves an overall Dice score of 81.9%, outperforming strong CNN- and transformer-based baselines by up to 4.3 Dice points on the WoundTissueSeg benchmark, with consistent improvements across minority tissue classes. These results indicate that explicit modeling of hierarchical spatial interactions enhances transformer representations for heterogeneous wound tissue segmentation and supports more reliable quantitative wound assessment.

[CV-34] Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding

链接: https://arxiv.org/abs/2605.19866
作者: Peter El Hachem,Ahmed Nassar,A. Said Gurbuz,Christoph Auer,Peter W. J. Staar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 7 figures. Main text: 9 pages (4 figures); Appendix: 9 pages (3 figures)

点击查看摘要

Abstract:Vision-Language Models (VLMs) parse documents end-to-end but frequently break down on layouts unlike those seen in training. We attribute this to a two-hop bottleneck: before the decoder can extract content (Hop 2), it must first classify and localize the enclosing layout entity (Hop 1), and when the first hop fails the second collapses into omissions, malformed structure, or autoregressive repetition. We pre-resolve Hop 1 outside the decoder by running a lightweight RT-DETR detector, serializing its outputs in the parser’s native DocTags vocabulary, and injecting them into the prompt alongside the full page image. Unlike analyze-then-parse approaches that crop the page, or prior prompt-level priors written in plain text, our prior shares the decoder’s generation space and leaves the global image in view as a fallback when detections are noisy. On a 10k-page structural out-of-distribution benchmark, markdown F1 rises from 0.37 to 0.92 ; on the Chinese subset of OmniDocBench, table TEDS rises from 0.01 to 0.36 ; and on the 26k-page ViDoRe V3 benchmark, infinite-loop decoding failures drop across every industrial domain tested. These gains cost 15% wall-clock latency and a median of 74 prompt tokens, with no architectural change to the base VLM. An attention-level analysis further reveals a bimodal phase shift in which the decoder attends to injected layout tokens when emitting structure and to image patches when emitting content, consistent with the two-hop bottleneck being alleviated. Model weights will be released to support reproducibility.

[CV-35] Landscape-Awareness for Geometric View Diffusion Model CVPR2026

链接: https://arxiv.org/abs/2605.19865
作者: Yan-Ting Chen,Hao-Wei Chen,Tsu-Ching Hsiao,Chun-Yi Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2026

点击查看摘要

Abstract:Accurate camera viewpoint estimation under sparse-view conditions remains challenging, particularly in two-view scenarios. Recent approaches leverage diffusion models such as Zero123 to synthesize novel views conditioned on relative viewpoint, showing promising results when repurposed for viewpoint estimation via optimization with MSE loss. However, existing methods often suffer from nonconvex loss landscape with numerous local minima, making them sensitive to initialization and reliant on naive multistart strategies. We analyze these optimization challenges and visualize failure cases, showing that geometric ambiguities, such as symmetry and self-similarity, can mislead gradient-based updates toward incorrect viewpoints. To address these limitations, we propose a score-based method that reshapes the optimization landscape to guide updates toward the ground-truth viewpoint, followed by a refinement stage using a viewpoint-conditioned diffusion model. Experiments show that our method improves convergence, reduces reliance on brute-force sampling, and achieves competitive accuracy with higher sample-efficiency.

[CV-36] Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models

链接: https://arxiv.org/abs/2605.19859
作者: Hengfei Wang,Anshul Gupta,Pierre Vuillecard,Jean-Marc Odobez
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Vision-language models (VLMs) have rapidly evolved into general-purpose multimodal reasoners with strong zero-shot generalization. In this context, VLMs could greatly benefit the analysis of human gaze and attention, a central task in human behavior understanding that requires reasoning about the physical scene as well as the activity, interactions, and social context. However, the extent to which VLMs can reliably understand human gaze and related attentional behaviors remains largely unexplored. In this work, we present EyeVLM, a systematic evaluation framework for gaze understanding in VLMs across two complementary dimensions: tasks and models. To assess gaze understanding capabilities, we focus on two core tasks. The first, gaze following, i.e., predicting the 2D location where a person is looking, has a geometric and visual processing focus, requiring a precise understanding of the human face, attention direction, 3D scene structure, and spatial grounding of attended targets. The second, social gaze prediction, requires social and relational reasoning over multi-person interactions (e.g., mutual gaze and shared attention), and may benefit more from the LLM semantic reasoning capabilities within VLMs. Regarding models, EyeVLM evaluates these tasks in two ways: a zero-shot setting with a diverse set of state-of-the-art open- and closed-source VLMs, exploring different prompting strategies; and a fine-tuning approach based on task-specific QA pairs, studying the impact of model scale and data scale. As benchmarks, we rely on existing gaze understanding datasets and perform a systematic comparison with state-of-the-art purely visual models. Overall, our results show that current VLMs lack precise gaze understanding capabilities. While standard training helps reduce the gap with visual models, significant improvements are still needed.

[CV-37] A Framework for Evaluating Zero-Shot Image Generation in Concept-based Explainability

链接: https://arxiv.org/abs/2605.19855
作者: Giacomo Astolfi,Matteo Bianchi,Riccardo Campi,Antonio De Santis,Marco Brambilla
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: G. Astolfi, M. Bianchi, and R. Campi contributed equally

点击查看摘要

Abstract:Concept-based Explainable Artificial Intelligence (XAI) interprets deep learning models using human-understandable visual features (e.g., textures or object parts) by linking internal representations to class predictions, thereby bridging the gap between low-level image data and high-level semantics. A major challenge, however, is the reliance on large sets of labeled images to represent each concept, which limits scalability. In this work, we investigate the use of zero-shot Text-to-Image (T2I) generative models as a source of synthetic concept datasets for concept-based XAI methods. Specifically, we generate concepts using predefined prompts and evaluate their faithfulness to real ones through four complementary analyses: (1) comparing synthetic vs. real concept images via concept representation similarity; (2) evaluating their intra-similarity by comparing pairs of subsets of the same concept with progressively increasing size; (3) evaluating their performance for downstream explanation tasks using relevant class images; (4) evaluating how removing a concept from tested class images affects explanations of generated concepts. While current T2I generative models promise a shortcut to concept-based XAI, our study highlights challenges and raises open questions about the use of synthetic data generated by zero-shot pipelines in model analyses. The resulting dataset is available at this https URL.

[CV-38] When Preference Labels Fall Short: Aligning Diffusion Models from Real Data ICML2026

链接: https://arxiv.org/abs/2605.19839
作者: Weiyan Chen,Weijian Deng,Yao Xiao,Weijie Tu,ZiYi Dong,Ibrahim Radwan,Liang Lin,Pengxu Wei
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026 Camera Ready; Project Page: this https URL

点击查看摘要

Abstract:Preference alignment aims to guide generative models by learning from comparisons between preferred and non-preferred samples. In practice, most existing approaches rely on preference pairs constructed from model-generated images. Such supervision is inherently relative and can be ambiguous when both samples exhibit artifacts or limited visual quality, making it difficult to infer what constitutes a truly desirable output. In this work, we investigate whether real data can serve as an alternative source of supervision for preference alignment. We adopt a data-centric perspective and study a curation strategy that treats real images as reference points and constructs preference signals by contrasting them with generated or perturbed samples, without requiring manually annotated preference pairs. Through empirical analysis, we show that real-data-based supervision provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods. Our results suggest that real data offers a practical and complementary source of supervision for preference alignment and highlight directions of label-efficient alignment strategies. Code and models are available at this https URL.

[CV-39] LaCoVL-FER: Landmark-Guided Contrastive Learning Network with Vision-Language Enhancement for Facial Expression Recognition

链接: https://arxiv.org/abs/2605.19821
作者: Jiaxin Wang,Muwei Jian,Hui Yu,Junyu Dong,Yifan Xia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial Expression Recognition (FER) in the wild is still challenging due to uncontrolled variations in pose, occlusion, and illumination. Most existing attention-based methods primarily rely on visual appearance cues, suffering from attention redundancy and instability, which limits their performance in complex scenarios. To address these issues, we propose a novel landmark-guided contrastive learning network with vision-language enhancement for FER (LaCoVL-FER), which integrates geometric priors from facial landmarks and semantic priors from a vision-language model. Specifically, a Landmark-Guided Adaptive Encoder (LGAE) is designed to introduce geometric priors through a Bi-branch Gated Cross Attention (BGCA) mechanism, which achieves adaptive fusion of landmark-based geometric and visual appearance features to produce expression-relevant features, thereby focusing on key facial regions and suppressing noise interference. In parallel, a Vision-Language Enhancement Strategy (VLES) is presented to leverage the expression-relevant features to refine the generalizable visual features extracted by the frozen pretrained CLIP image encoder, yielding expression-specific visual representations. Based on these representations, an Expression-Conditioned Prompting (ECP) mechanism is utilized to further adapt the textual features of fixed class-level prompts from the frozen pretrained CLIP text encoder, generating more instance-aware textual representations. These visual-textual representations are aligned as semantic priors to enhance the robustness and generalization of FER. Quantitative and qualitative experiments demonstrate that our LaCoVL-FER outperforms state-of-the-art methods on three representative real-world FER datasets, including RAF-DB, FERPlus, and AffectNet. The code is available at this https URL.

[CV-40] Stitched Value Model for Diffusion Alignment

链接: https://arxiv.org/abs/2605.19804
作者: Hyojun Go,Hyungjin Chung,Prune Truong,Goutam Bhat,Li Mi,Zhaochong An,Zixiang Zhao,Dominik Narnhofer,Serge Belongie,Federico Tombari,Konrad Schindler
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes 3.2\times faster while halving peak GPU memory, and DiffusionNFT becomes 2.3\times faster.

[CV-41] Synergistic Foundation Models for Semi-Supervised Fetal Cardiac Ultrasound Analysis: SAM-Med2D Boundary Refinement and DINOv3 Semantic Enhancement

链接: https://arxiv.org/abs/2605.19799
作者: Tonghao Zhuang(1),Shanglong Hu(1),Yongsheng Luo(1),Zhiqi Zhang(1),Yu Li(1) ((1) Zhuhai College of Science and Technology, Zhuhai, China)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the ISBI 2026 Fetal HearT UltraSound Segmentation and Diagnosis (FETUS) Challenge

点击查看摘要

Abstract:We present a semi-supervised framework for joint segmentation and classification of fetal cardiac ultrasound images. Built upon the EchoCare multi-task backbone, our method integrates SAM-Med2D for boundary refinement and leverages DINOv3 to enhance pseudo-label quality. We introduce view-specific hard masking along with a two-stage optimization strategy: an EMA phase to consolidate segmentation capabilities, followed by a Classification Fine-Tuning phase that freezes segmentation parameters and resets the classification head to recover classification performance without compromising segmentation gains. Evaluated on the FETUS 2026 leaderboard, our method achieves a Dice Similarity Coefficient at 79.99%, Normalized Surface Distance at 61.62%, and F1-score at 41.20%, validating the effectiveness of our approach for prenatal congenital heart disease screening. Source code is publicly available at: this https URL. Comments: Accepted to the ISBI 2026 Fetal HearT UltraSound Segmentation and Diagnosis (FETUS) Challenge Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.19799 [cs.CV] (or arXiv:2605.19799v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.19799 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-42] Depth2Pose: A Pose-Based Benchmark for Monocular Depth Estimation without Ground-Truth Depth

链接: https://arxiv.org/abs/2605.19797
作者: Viktor Kocur,Sithu Aung,Gabrielle Flood,Yaqing Ding,Lukas Bujnak,Torsten Sattler,Zuzana Kukelova
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular depth estimation has improved significantly in recent years, driven by increasingly powerful models and large-scale training data. Predicted depth is increasingly used as an input signal for downstream tasks such as Structure-from-Motion (SfM), visual localization, and SLAM. However, monocular depth estimators (MDEs) are still primarily evaluated in terms of depth accuracy. Standard metrics aggregate errors globally and may not reflect the usefulness of depth for downstream geometric tasks. We therefore propose Depth2Pose, a framework for evaluating MDEs in the context of downstream tasks. By combining depth predictions with feature correspondences in depth-aware geometric solvers, we use relative camera pose estimation accuracy as a task-driven proxy for depth quality. Traditional benchmarks require dense ground truth in the form of per-pixel depth, which is expensive to obtain. In contrast, our formulation requires only camera poses, which can be estimated efficiently, e.g., using Structure-from-Motion pipelines. As a result, our framework can be applied to scenes where ground-truth depth is difficult to obtain, for example due to large scene scale or heavy occlusions (e.g., vegetated environments). Leveraging this, we introduce the D2P dataset, which contains challenging scenes outside the distribution of commonly used training data. We show that methods performing well under standard depth error metrics on existing benchmarks also perform well under our pose-based metric when evaluated on the same datasets, but do not necessarily generalize to our more challenging dataset. Finally, we provide a simple and extensible evaluation framework. The dataset and code are available at this http URL.

[CV-43] Mechanisms of Object Localization in Vision-Language Models CVPR2026

链接: https://arxiv.org/abs/2605.19792
作者: Timothy Schaumlöffel,Martina G. Vilas,Gemma Roig
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026

点击查看摘要

Abstract:Visually-grounded language models (VLMs) are highly effective in linking visual and textual information, yet they often struggle with basic classification and localization tasks. While classification mechanisms have been studied more extensively, the processes that support object localization remain poorly understood. In this work, we investigate two representative families, LLaVA-1.5 and InternVL-3.5, using a suite of mechanistic interpretability tools, including token ablations, attention knockout, and causal mediation analysis. We find that localization is driven by a containerization mechanism in which object-aligned tokens define the spatial extent of the object, while the semantic arrangement of tokens within those boundaries is largely irrelevant to the predicted box. Only a very small set of attention heads mediates the causal effect for both classification and localization, concentrating in early-mid layers for LLaVA and mid-late layers for InternVL. The two tasks share some early processing but ultimately depend on largely distinct specialized heads. Overall, we provide the first layer- and head-level account of localization in VLMs, revealing narrow computational pathways that can guide future model design and grounding objectives.

[CV-44] Fast 4D Mesh Generation by Spatio-Temporal Attention Chains FAST4

链接: https://arxiv.org/abs/2605.19786
作者: Dvir Samuel,Yuval Atzmon,Gal Chechik,Yoni Kasten
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details and thereby improving dynamic mesh geometry and temporal consistency. Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a 13\times speedup while producing higher-quality results. Moreover, our approach scales to videos up to 16\times longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks: 2D object tracking and 4D tracking. We further show that our framework enables reliable camera estimation, a capability not supported by prior 4D mesh generation methods. Comments: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.19786 [cs.CV] (or arXiv:2605.19786v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.19786 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-45] Preferences Order Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation

链接: https://arxiv.org/abs/2605.19776
作者: Yuanpei Zhao,Jie Lin,Chao Zhang,Yilin Wang,Mao Li,Chenhui Li,Jie Hou,Tangjie Lv
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 7 pages

点击查看摘要

Abstract:Pairwise preferences and pointwise ratings are the two dominant annotation protocols in image aesthetic assessment (IAA), yet existing benchmarks adopt only one, leaving their complementarity unmeasured under controlled conditions. We introduce PPaint, a matched dual-protocol benchmark in which 15 domain experts, 5 per category, annotate 150 Chinese paintings under both protocols across five aesthetic dimensions, collecting 45,900 pairwise expert judgments through a locally dense preference design alongside the matched ratings. The matched design reveals complementary strengths: preferences yield more consistent ordinal rankings, while ratings anchor the absolute score scale. Fusing both signals via two independent preference-to-score methods yields a fused expert ground truth on which the two constructions converge to nearly identical scores. The same preference-to-score principle extends to label-free VLM training. PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via an Elo reference pool, and trains the same VLM with confidence-weighted ranking optimization to produce a single-pass aesthetic scorer. Trained on a single painting category, the distilled Qwen3-VL-8B improves mean SRCC from 0.504 to 0.709 across all three categories, outperforming all open-source baselines including the dedicated aesthetic model ArtiMuse and matching closed-source Gemini-3.1-Pro within 0.04 SRCC at single-pass inference cost, with cross-domain transfer further validated on APDDv2. We will release the full PPaint dataset and training code.

[CV-46] Beyond Imitation: Learning Safe End-to-End Autonomous Driving from Hard Negatives

链接: https://arxiv.org/abs/2605.19771
作者: Junli Wang,Zhihua Hua,Xueyi Liu,Zebin Xing,Haochen Tian,Kun Ma,Hangjun Ye,Guang Chen,Long Chen,Qichao Zhang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing imitation learning methods for end-to-end autonomous driving predominantly learn from successful demonstrations by minimizing geometric deviations from expert trajectories. This paradigm implicitly assumes that spatial proximity implies behavioral safety, leading to a critical objective mismatch: trajectories with nearly identical imitation losses may exhibit drastically different safety outcomes, where one remains recoverable while the other results in collision. To address this limitation, we propose BeyondDrive, a failure-aware imitation learning framework that jointly learns from successful and failed driving behaviors. First, we introduce a flow matching-based negative trajectory generator that synthesizes safety-critical yet expert-proximate trajectories, enabling explicit modeling of safety asymmetry. Second, we develop a diversity-aware sampling strategy that mitigates mode collapse and improves coverage of diverse failure modes during negative trajectory generation. Third, we propose a Repulsive Distance Loss that simultaneously attracts predictions toward expert demonstrations while repelling them from hard negative trajectories, thereby establishing discriminative safety boundaries in trajectory space. Applied to the uni-modal baseline Latent TransFuser, BeyondDrive achieves 89.7 PDMS on the NAVSIMv1 closed-loop benchmark, outperforming prior state-of-the-art methods. Moreover, BeyondDrive generalizes effectively across different autonomous driving architectures, including multi-modal planners, and further demonstrates strong zero-shot transferability on the HUGSIM benchmark.

[CV-47] CPC-VAR:Continual Personalized and Compositional Generation in Visual Autoregressive Models

链接: https://arxiv.org/abs/2605.19750
作者: Junhao Li,Xinhao Zhong,Yi sun,Yuxia Qiao,Bin Chen,Shu-Tao Xia,Yaowei Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual autoregressive (VAR) models have recently emerged as an efficient paradigm for text-to-image generation. Despite their strong generative capability, existing VAR-based personalization methods remain limited to static settings, failing to accommodate evolving user demands. In particular, sequential concept learning leads to severe catastrophic forgetting, while multi-concept synthesis often suffers from feature entanglement and attribute inconsistency. In this work, we present the first systematic study of continual personalized generation in VAR models. We identify two key challenges: (i) preserving previously learned concepts during sequential customization, and (ii) composing multiple personalized concepts in a controllable manner. To address these issues, we propose a unified framework with two core components. For continual single-concept learning, we introduce Gradient-based Concept Neuron Selection (GCNS), which identifies concept-relevant neurons and constrains only conflicting parameters across tasks, effectively mitigating forgetting without additional model expansion. For multi-concept synthesis, we propose a context-aware composition strategy that performs multi-branch feature modeling and localized cross-attention fusion guided by spatial conditions, enabling precise and disentangled concept composition. Extensive experiments demonstrate that our method significantly improves performance in long-sequence continual personalization while achieving superior results in multi-concept image synthesis compared to existing baselines. These findings highlight the potential of VAR models for scalable and controllable personalized generation.

[CV-48] Real-World On-Vehicle Evaluation of Embedding-Based Anomaly Detection CVPR2026

链接: https://arxiv.org/abs/2605.19744
作者: Albert Schotschneider,Daniel Bogdoll,Svetlana Pavlitska,Ahmed Abouelazm,Johann Marius Zoellner
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 Workshop AUTOPILOT-NA

点击查看摘要

Abstract:Detecting anomalies in traffic scenes is crucial for ensuring safety in autonomous driving, yet collecting representative anomalous data remains challenging. Existing anomaly detection methods are highly specialized and rely on normality as defined by the abstract semantic Cityscapes classes, making it difficult to adapt to diverse real-world scenarios. We propose an adaptable real-time anomaly detection method that leverages foundation models in the form of pretrained vision transformer embeddings to detect deviations via nearest-neighbor similarity in the latent semantic feature space. Based on patch-wise processing, the algorithm produces dense anomaly masks, allowing for the localization of detected anomalies. The method robustly models normality through a single reference image. This formulation avoids explicit supervision and dataset-specific training, making it suitable for real-world deployment. We evaluate the method on standard benchmarks and on an automated vehicle in real-world scenarios. Despite its simplicity, the method achieves good performance on the Road Anomaly benchmark and demonstrates consistent qualitative behavior in practice, successfully highlighting semantically unusual objects in diverse scenes. These results suggest that simple, reference-based methods can provide useful anomaly signals under realistic operating conditions.

[CV-49] FlowErase-RL: Rethinking Concept Erasure as Reward Optimization in Flow Matching Models

链接: https://arxiv.org/abs/2605.19739
作者: Yi Sun,Zhiqi Zhang,Xinhao Zhong,Yimin Zhou,Shuoyang Sun,Bin Chen,Shu-Tao Xia,Ke Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in flow matching models have significantly improved text-to-image generation quality, but also introduce growing safety risks due to the generation of harmful or undesirable content. Existing concept erasure methods are either inference-time interventions with limited effectiveness or rely on supervised fine-tuning (SFT), which requires precisely aligned data and struggles with scalability and multi-concept settings. In this paper, we propose \emphFlowErase-RL, the first GRPO-based framework for concept erasure in flow matching models. We reformulate concept erasure as a reward optimization problem and introduce a \textbfdynamic dual-path reward mechanism that jointly optimizes (i) a Concept Erasure (CE) reward to suppress target concepts and (ii) a Non-target Space (NS) reward to preserve generative fidelity. The two reward paths are adaptively balanced during training via a performance-driven switching strategy, enabling stable optimization without explicit supervision. Extensive experiments on nudity, object, and artistic style erasure demonstrate that our method achieves state-of-the-art erasure performance while maintaining strong image quality and semantic alignment. Moreover, it exhibits robust resistance to adversarial attacks and scales effectively to multi-concept scenarios. Our results establish a new paradigm for safe and controllable generation in flow matching models.

[CV-50] Decentralized Direct Volume Rendering: A Browser-Native GPU Architecture for MRI Digital Twins in Resource-Constrained Settings

链接: https://arxiv.org/abs/2605.19737
作者: Oserebameh Augustine Beckley
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures. Live interactive browser demo available at: this https URL . Source code repository: this https URL

点击查看摘要

Abstract:Digital Twin (DT) technology holds immense potential for surgical planning and personalized medicine. However, generating interactive, patient-specific anatomical twins currently relies on computationally heavy Server-Side Rendering (SSR) or expensive local workstations, creating significant barriers to deployment, especially in resource-constrained settings (RCS). This paper presents a decentralized, client-side WebGPU architecture that democratizes access to high-fidelity anatomical Digital Twins. By bypassing standard server-side rendering pipelines, the framework executes deterministic single-pass raymarching and morphological gradient calculations directly on low-cost integrated edge GPUs. Eliminating the network latency inherent to cloud-rendered solutions, the system achieves a Time to First Pixel (TTFP) of under 920.0ms and maintains stable interactivity at = 82.0 FPS. Continuous Interaction Fidelity is maintained via uniform buffers, enabling zero-latency manipulation of tissue parameters for dynamic clinical decision-making. By proving that complex 3D medical simulations of patient-specific MRI scan can be executed natively in the browser without deep learning or external computational dependencies, this architecture provides a scalable, affordable foundation for the widespread clinical adoption of healthcare Digital Twins.

[CV-51] GeoMamba: A Geometry-driven MambaVision Framework and Dataset for Fine-grained Optical-SAR Object Retrieval

链接: https://arxiv.org/abs/2605.19734
作者: Tiantong Fang,Xiuwei Wang,Jing Xiao,Wujie Zhou,Liang Liao,Mi Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-source remote sensing enables complementary observation of ground objects, while cross-modal fine-grained object retrieval remains challenging, especially under unaligned optical and SAR conditions. Unlike conventional retrieval settings that rely on paired or spatially aligned samples, practical optical-SAR retrieval is affected by substantial modality discrepancy, speckle noise, and structural inconsistency, which limit robust cross-modal representation learning. To address this problem, we propose GeoMamba, a geometry-driven framework tailored for optical-SAR fine-grained retrieval. Specifically, GeoMamba introduces a Geometric Feature Injection (GFI) module that enhances cross-modal feature interaction and incorporates structural priors, thereby improving the robustness of SAR representations and promoting geometry-consistent feature learning. In addition, a Geometric Consistency Constraint (GCC) module, together with a Deep Supervision (DS) strategy, imposes hierarchical geometric constraints using classical operators, which helps preserve informative object structures during representation learning. We further construct a new dataset, FGOS-as, containing 11 aerospace and maritime categories for evaluating unaligned cross-modal fine-grained object retrieval in realistic remote sensing scenarios. Extensive experiments on FGOS-as demonstrate that GeoMamba outperforms existing methods, achieving 63.3% mAP and 77.0% Rank-1 accuracy in all-to-all retrieval setting.

[CV-52] LIFT and PLACE: A Simple Stable and Effective Knowledge Distillation Framework for Lightweight Diffusion Models CVPR2026

链接: https://arxiv.org/abs/2605.19729
作者: Hyunsoo Han,Sangyeop Yeo,Jaejun Yoo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 11 figure, 9 tables, To appear in CVPR 2026

点击查看摘要

Abstract:We demonstrate that in knowledge distillation for diffusion models, the teacher network’s highly complex denoising process - stemming from its substantially larger capacity - poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTtingbased distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a “coarse” alignment and a “fine” refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, PLACE extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance. Our experiments show that LIFT and PLACE is effective across diffusion spaces (image/latent), backbones (U-Net/DiT), tasks (unconditional/conditional), datasets, and even extends to flow-based models such as MMDiT (SD3). Furthermore, under extreme compression with a 1.3M-parameter student (only 1.6% of the teacher), conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50-200+, but our method remains stably convergent and achieves an FID of 15.73.

[CV-53] Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

链接: https://arxiv.org/abs/2605.19728
作者: Abdul Mohaimen Al Radi,Kunyang Li,Yuzhang Shang,Mubarak Shah,Yu Tian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can support scalable training and evaluation of aerial agents by providing a controllable proxy for real-world or expensive simulation data. To address this problem, we propose \textbfAero-World, a method for converting a pretrained image-to-video diffusion model into a controllable aerial video generator. Aero-World injects sequences of translational acceleration and angular velocity into a pretrained latent diffusion transformer through an action-token stream. A frozen latent-space Physics Probe, trained independently on real video–IMU pairs, provides differentiable inertial-consistency supervision during LoRA finetuning while avoiding computationally expensive video decoding. We further propose \textbfAeroBench, a benchmark for evaluating whether generated drone videos adhere to low-level action signals. AeroBench uses Action Alignment Score (AAS) to measure agreement with commanded inertial actions and Physical Consistency Rate (PCR) to measure temporal motion stability. On AeroBench, Aero-World improves mean AAS from 57.7 to 63.6 over action-only finetuning and gives a stronger quality-control trade-off than AirScape, with lower FVD (596.5 vs. 1058.6), higher SSIM (0.595 vs. 0.505), and higher Flow-IMU correlation (0.44 vs. 0.20). These results suggest that frozen Physics Probe supervision is a practical mechanism for adapting pretrained video generators toward more action-aligned aerial motion.

[CV-54] ango3D: Towards Alignment for Global and Local 2D-3D Correspondence

链接: https://arxiv.org/abs/2605.19727
作者: Zebin He,Mingxin Yang,Shuhui Yang,Hanxiao Sun,Xintong Han,Chunchao Guo,Wenhan Luo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show our model successfully achieves object-level pixel-to-point alignment while maintaining competitive global retrieval, a joint capability not offered by existing 3D foundation models. By establishing a fine-grained alignment feature space, Tango3D injects rich semantics into purely geometric 3D tokens, paving the way for a wide range of dense 3D downstream tasks.

[CV-55] Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention CVPR2026

链接: https://arxiv.org/abs/2605.19726
作者: Wenhu Zhang,Yiming Wu,Huanyu Wang,Yaoyang Liu,Huanzhang Dou,Senqiao Yang,Sitong Wu,Hanbin Zhao,Jiaya Jia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Findings paper

点击查看摘要

Abstract:Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks by fixed sampling patterns over the high-resolution attention space, such as tail regions or anti-diagonal stripes. Such prior-driven sampling can miss salient tokens and introduce instability under distribution shifts. In this paper, we propose the Block Approximate Sparse Attention framework (BA-Att) with block-wise pre-downsampled operation, which identifies informative regions within a compact downsampled space, avoiding reliance on brittle positional priors. To analyze its theoretical behavior, we define an oracle post-downsample attention map and formalize the approximation error between pre- and post-downsample schemes. Based on this insight, we introduce a lightweight norm-sorting module and a covariance-compensated correction that approximates full covariance using diagonal QK variances, reducing computational complexity. Extensive experiments show that our operator achieves up to 6.95x acceleration over FlashAttention in attention computation, and maintains near full-attention performance at 50% sparsity across language models, multimodal language models, and video generation models, demonstrating strong efficiency and generalization.

[CV-56] Physics-in-the-Loop: A Hybrid Agent ic Architecture for Validated CAD Engineering Design ECAI2026 IJCAI

链接: https://arxiv.org/abs/2605.19717
作者: Elias Berger,Muhammad Usama,Jan Mehlstäubl,Bernhard Saske,Kristin Paetzold-Byhain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IJCAI-ECAI 2026 (Special Track on AI4Tech)

点击查看摘要

Abstract:Large Language Models (LLMs) can generate Computer-Aided Design (CAD), yet lack physical comprehension required for reliable engineering design. Instead of attempting to implicitly learn physical laws from data, we propose a Hybrid Agentic-Physical Architecture that embeds validated knowledge-based engineering tools directly into the decision making loop of autonomous AI agents. In this framework, engineering design is formulated as a closed-loop, sequential decision making process guided by explicit physical verification. Based on a load case, dedicated agents iteratively plan, generate, evaluate, and revise engineering designs using knowledge-based tools as a feedback signal. We introduce a benchmark dataset and metrics for assessing functional validity in generative CAD. Our system generates more complex and physically verified designs, with a 4.2 increase in structural complexity and improving compile rate by 3.5% compared to similar agentic methods. The codebase, prompts and dataset will be made publicly available to support reproducibility and future research.

[CV-57] Physics-informed simulation framework for realistic sonar image generation and statistical validation

链接: https://arxiv.org/abs/2605.19712
作者: Kamal Basha S,Athira Nambiar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthetic sonar datasets offer a scalable alternative to costly real-world acquisition, yet their utility remains limited by the absence of rigorous quantitative validation. We present ACOUSIM (ACOustic SIMulation and Validation Platform), a physics-informed framework that evaluates the statistical alignment between synthetic and real sonar imagery without relying on generative models. A Gazebo-based environment generates sonar-like images by explicitly controlling seabed texture, illumination-driven shadowing, platform altitude, and noise. Realism is quantified against two public sonar datasets, SeabedObjects-KLSG-II and Sonar Common Target Detection (SCTD), using global intensity and local texture (LBP) distributions assessed via Kullback-Leibler divergence, Jensen-Shannon divergence, and Earth Mover’s Distance. Results show strong texture alignment (KL 0.07) across all classes, with plane-class intensity alignment outperforming ship-class due to shadow geometry complexity. ACOUSIM establishes a reproducible, distribution-level baseline for sim-to-real sonar evaluation and directly supports reliable dataset validation for underwater image analysis.

[CV-58] WBCAtt: Fine-Grained Pixel-Level Morphological Annotations for White Blood Cell Images

链接: https://arxiv.org/abs/2605.19692
作者: Satoshi Tsutsui,Winnie Pang,Shuting He,Bihan Wen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Medical Image Analysis. arXiv admin note: substantial text overlap with arXiv:2306.13531

点击查看摘要

Abstract:The microscopic examination of white blood cells (WBCs) plays a fundamental role in pathology and is essential for diagnosing blood disorders such as leukemia and anemia. To support further research on WBC images, multiple datasets have been proposed. However, they mainly annotate cell categories, and lack detailed morphological characteristics that pathologists use to explain their interpretations of cells. To address this gap, we introduce WBCAtt+, a novel dataset of WBC images densely annotated with 11 morphological attributes and five pixel-level cell components. With 113k image-level labels and 10k segmentation maps, WBCAtt+ is the first to provide comprehensive annotations for WBC images. Leveraging this dataset, we provide baseline models for attribute recognition and semantic segmentation. We also design an attribute recognition model to incorporate compositional structure of cells, further improving the recognition performance. Lastly, we showcase various applications enabled by our dataset, such as explainable AI models, including counterfactual example generation. \revisionThe dataset and code are publicly available\footnotethis https URL.

[CV-59] DocQT: Improving Document Forgery Localization Robustness via Diverse JPEG Quantization Tables

链接: https://arxiv.org/abs/2605.19688
作者: Kylian Ronfleux-Corail(L3I),Guillaume Bernard(L3I),Mickaël Coustaty(L3I),Nicolas Sidère(L3I)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document manipulation localization models achieve strong performance on public benchmarks yet fail to generalize to operational document workflows. We identify a critical and overlooked source of this gap: the mismatch between the narrow distribution of JPEG quantization tables used during training -restricted to standard libjpeg quality factors -and the heterogeneous compression profiles encountered in real-world insurance document pipelines. To isolate this factor, we conduct a controlled factorial study comparing two architectures with contrasting levels of quantization table awareness -FFDN [2] and Mesorch [20] -each trained under either standard quality factor augmentation (Standard-QT ) or operationally calibrated quantization tables sampled from DocQT, a quantization-table bank derived from a MAIF operational image corpus (Real-QT ), and evaluated under three recompression conditions. Training under Real-QT yields substantial localization gains on DocTamper [15] and significantly reduces the pixel-level false positive rate on authentic operational documents, but only for architectures that explicitly ingest the quantization table as input. The released DocQT quantization-table dataset and compression-reproduction material are directly available at this https URL. These results demonstrate that standard quality factor augmentation does not adequately proxy operational compression diversity, and that architectural choices explicitly conditioning on the quantization table provide a meaningful robustness advantage for real-world deployment.

[CV-60] Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images CVPR2026

链接: https://arxiv.org/abs/2605.19656
作者: Matias Turkulainen,Akshay Krishnan,Filippo Aleotti,Mohamed Sayed,Guillermo Garcia-Hernando,Juho Kannala,Arno Solin,Gabriel Brostow,Daniyar Turmukhambetov
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to CVPR 2026. 8 figures, 3 tables. Project page: this https URL

点击查看摘要

Abstract:We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground level AND by satellite. Faithful reconstructions require good camera coverage, but ground imagery is time-consuming and hard to capture at scale for large outdoor scenes. Fortunately, satellite imagery can provide a global geometric prior that is easy to access via public APIs. Cross-View Splatter fuses orthorectified satellite views with GPS-tagged ground photos to predict Gaussian splats in a unified 3D coordinate frame. By aligning ground and bird’s-eye feature representations, our model improves scene coverage and novel-view synthesis, compared to ground imagery alone. We train on curated georeferenced datasets and paired satellite-terrain data, mined from open mapping services. We evaluate our method on a new benchmark for novel-view synthesis with georeferenced imagery allowing comparison to prior state-of-the-art methods. Our code and data preparation will be available at this https URL.

[CV-61] CAD-Free Learning of Spacecraft Pose Estimators via NeRF-Based Augmentations

链接: https://arxiv.org/abs/2605.19649
作者: Antoine Legrand,Renaud Detry,Christophe De Vleeschouwer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: (under review)

点击查看摘要

Abstract:Spacecraft pose estimation networks require tens of thousands of CAD-rendered images to be trained. This reliance on synthetic CAD data (i) limits applicability to targets with reliable geometry prior, excluding uncooperative or poorly documented spacecraft, and (ii) causes poor generalization to real on-orbit conditions due to unrealistic illumination and material appearance. This paper introduces a NeRF-based image augmentation method that enables the learning of spacecraft pose estimators from only a few tens to a few hundreds of images. The method learns a Neural Radiance Field of the target and generates a large, diverse dataset through geometrically-consistent viewpoint and appearance augmentation. This augmented dataset enables the training of accurate target-specific pose estimators without requiring a CAD model or large synthetic datasets. Experiments show that our approach supports the training of accurate pose estimators from only 25 to 400 realistic images, even under severe illumination variations. When applied on large CAD-based synthetic datasets, the NeRF-based augmentation also enhances out-of-domain generalization, yielding improved robustness to real on-orbit conditions.

[CV-62] Benchmarking and Evolving Reason -Reflect-Rectify for Reflective Visual Generation

链接: https://arxiv.org/abs/2605.19639
作者: Junjie Wang,Xinghua Lou,Jason Li,Ye Tian,Keyu Chen,Yulin Li,Bin Kang,Jacky Mai,Yanwei Li,Zhuotao Tian,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) models and Unified Multimodal Models (UMMs) have achieved remarkable progress in visual generation. However, their reliance on a single-pass generation paradigm limits their ability to handle complex prompts requiring iterative refinement. To enable multi-round Reflective Visual Generation (RVG), we formalize the Reason-Reflect-Rectify (R^3) loop as a core framework and introduce R^3-Bench, a benchmark of over 600 expert-annotated instances that quantifies iterative reasoning and rectification capabilities. Evaluation on R^3-Bench reveals a critical gap: while state-of-the-art models can identify generation errors, they fail to generate actionable rectification instructions. To bridge this gap, we propose R^3-Refiner, a dual-stage framework leveraging Group Relative Policy Optimization (GRPO) and a Hierarchical Reward Mechanism (HRM) to better align rectification with reflective reasoning. Experiments show that R^3-Refiner achieves significant improvements on R^3-Bench (+12.0% in Reflective Verdict Score, +9.0% in Rectification Score), and can be seamlessly integrated with various MLLMs to enhance the generation quality of different T2I models on GenEval++ and T2I-CompBench. Code is available at this https URL.

[CV-63] P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

链接: https://arxiv.org/abs/2605.19634
作者: Kai Sheng,Liuyi Wang,Haojie Dai,Jinlong Li,Yongrui Qin,Zongtao He,Chengju Liu,Qijun Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360° panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.

[CV-64] HEAT: Heterogeneous End-to-End Autonomous Driving via Trajectory-Guided World Models

链接: https://arxiv.org/abs/2605.19631
作者: Hoonhee Cho,Giwon Lee,Jae-Young Kang,Hyemin Yang,Heejun Park,Kuk-Jin Yoon
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving has emerged as a compelling alternative to traditional modular pipelines by directly mapping raw sensor data to driving actions. While recent approaches achieve strong performance on single-domain datasets, their performance degrades significantly when trained jointly across multiple heterogeneous domains. In practice, however, autonomous systems must operate across diverse environments with heterogeneous distributions, including different cities, sensor configurations, and traffic patterns, without domain-specific retraining. This gap highlights a key challenge in multi-domain learning: domain-specific variations across heterogeneous domains introduce conflicting learning signals, driving models toward compromised solutions that are suboptimal across domains. To address this, we propose a trajectory-driven learning paradigm that organizes training around planning trajectories, enabling the model to capture domain-invariant representations of driving intent. Furthermore, we incorporate a world model that predicts future latent features conditioned on ego actions, improving feature consistency and mitigating domain-induced biases. We evaluate our approach on three benchmarks, nuScenes, NAVSIM, and the Waymo end-to-end dataset, and show substantial improvements over existing methods across all domains. Our results demonstrate that a single unified model can be trained on heterogeneous datasets while maintaining strong performance within each domain, highlighting a step toward scalable real-world deployment. We will make our code publicly available.

[CV-65] Component-Aware Structure-Preserving Style Transfer for Satellite Sim2Real 6D Pose Estimation

链接: https://arxiv.org/abs/2605.19624
作者: Yonglong Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monocular 6D pose estimation for non-cooperative satellites depends heavily on annotated training data, yet real satellite images with reliable pose labels and component-level masks are difficult to acquire at scale. Synthetic rendering can provide exact geometric annotations, but the appearance gap between rendered and real observations limits direct transfer to the real domain. This paper presents a component-aware structure-preserving style transfer framework for satellite synthetic-to-real data construction. The method builds weakly paired real–synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It then extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. To keep the generated images usable for downstream supervision, adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints. Experiments are conducted on 5,000 rendered satellite images and 100 real images captured in a calibrated setup. The real images provide target-domain appearance references and final evaluation images, while the downstream GDRNet pose estimator is trained only on synthetic or translated synthetic images. Compared with representative image-translation baselines, the proposed method achieves the lowest image distribution discrepancy, with an FID of 54.32 and a KID of 0.048. When the translated data are used to train GDRNet in this target-domain adaptation setting, the ADD pass rate improves to 0.260 and the AUC improves to 0.611. These results indicate that component-level appearance transfer can improve satellite Sim2Real pose estimation in the considered calibrated setup while retaining simulation-derived geometric annotations.

[CV-66] PrAda: Few-Shot Visual Adaptation for Text-Prompted Segmentation CVPR2026

链接: https://arxiv.org/abs/2605.19623
作者: Gabriele Rosi,Fabio Cermelli,Carlo Masone,Barbara Caputo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Findings. Code: this https URL

点击查看摘要

Abstract:Segmenting images is critical for visual understanding but demands extensive pixel-level annotations. Foundational models have enabled new paradigms for predicting new classes guided by textual prompts, without annotations from the target domain. Yet, on specialized target domains, far from the original pre-training, their performance degrades. We study the errors of existing methods under such domain-shift, finding that misclassification rather than mask generation is the main culprit. To address this, we introduce the novel problem of Few-Shot Visual Adaptation for text-prompted Segmentation. This kind of adaptation has been largely studied for image classification, but it remains unexplored for segmentation. We tackle this task with Prototype Adaptation (PrAda), a novel, parameter-efficient method that adapts a frozen text-prompted segmentation model. Our approach learns class-specific prototypes by combining fine-grained pixel features and high-level transformer representations, which are then fused with the original text-based predictions through a learned importance factor. This preserves the model’s zero-shot potential while enabling strong adaptation to new domains. Experiments across semantic, instance, and panoptic segmentation on five benchmarks demonstrate that PrAda yields significant improvements over state-of-the-art and proposed baselines.

[CV-67] UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register CVPR2026

链接: https://arxiv.org/abs/2605.19622
作者: Congpei Qiu,Zhaoyu Hu,Wei Ke,Zhuotao Tian,Yanhao Wu,Tong Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026

点击查看摘要

Abstract:Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9% mIoU on ADE20K (+9.4%), surpassing specialized vision models like DINOv2 (49.1%), while zero-shot segmentation accuracy improves by up to 22%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.

[CV-68] Bézier Degradation Modeling for LiDAR-based Human Motion Capture CVPR2026

链接: https://arxiv.org/abs/2605.19620
作者: Xiaoqi An,Lin Zhao,Jun Li,Chen Gong,Jian Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:LiDAR-based 3D human motion capture has broad applications in fields such as autonomous driving and robotics, where accurate motion reconstruction is crucial. However, existing methods often struggle with unstable inputs and severe occlusions, leading to jittery or even failed pose predictions. To address these challenges, we propose BMLiCap, a coarse-to-fine framework that models motion using temporally compressible Bézier curves. By reducing control points through a trajectory-preserving strategy, we obtain a coherent and learning-friendly motion representation. To reconstruct human actions from LiDAR point-cloud cues, we design a progressive motion-reconstruction module. Specifically, a Time-scale Motion Transformer (TMT) is introduced to predict motion curves at multiple temporal scales, and a Multi-level Motion Aggregator (MMA) is utilized to adaptively fuse the multi-scale curves to recover detailed, temporally coherent poses, effectively bridging observation gaps caused by occlusions and noise. Across four mainstream benchmarks LiDARHuman26M, FreeMotion, NoiseMotion, and SLOPER4D, BMLiCap achieves state-of-the-art accuracy and temporal continuity in complex scenes, demonstrating its ability to compensate for severe occlusions and reduce prediction jitter.

[CV-69] White-Balance First Adjust Later: Cross-Camera Color Constancy via Vision-Language Evaluation CVPR2026

链接: https://arxiv.org/abs/2605.19613
作者: Shuwei Li,Lei Tan,Robby T. Tan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In CVPR 2026

点击查看摘要

Abstract:Color constancy aims to keep object colors consistent under varying illumination. Cross-camera generalization in color constancy remains challenging because learning-based models often overfit to the color response characteristics of the training camera, resulting in degraded performance on images captured by other cameras. We propose VLM-CC, a feedback-guided framework that formulates color constancy as an iterative refinement process. Instead of directly estimating the illuminant from raw input, VLM-CC performs iterative correction driven by vision-language model (VLM)-based evaluation. At each iteration, the image is white-balanced using the current estimate and converted to pseudo-sRGB. A lightweight LoRA-tuned VLM then assesses the corrected image, identifying the dominant residual color cast and providing qualitative feedback. This feedback is mapped to a residual illumination direction (red, green, or blue) and used to update the illuminant estimate until convergence. Our key idea is to reframe color constancy as an iterative perceptual feedback problem, leveraging VLM evaluation instead of direct RGB regression. By replacing direct RGB estimation with VLM-guided perceptual feedback, VLM-CC achieves state-of-the-art robustness in cross-camera color constancy across multiple datasets. Code will be available at this https URL.

[CV-70] Inverse Design of Metasurface based Absorbers using Physics Guided Conditional Diffusion Models

链接: https://arxiv.org/abs/2605.19611
作者: Vineetha Joy,Jamshed Palai,Satwik Sahoo,Anshuman Kumar,Amit Sethi,Hema Singh
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Inverse design of metasurfaces for specific electromagnetic responses requires generating geometries that satisfy stringent spectral constraints while maintaining manufacturability. Conventional design methodologies rely on iterative optimization routines using full wave simulations, which become extremely time consuming and computationally intensive for large design spaces. In addition, commonly employed generative approaches often exhibit limited conditional fidelity and the generated designs often contain fine or irregular features that are impractical to fabricate. In this regard, we propose a physics guided condition quality enhanced diffusion framework for the inverse design of metasurface based absorbers. Here, the conditioning information consisting of target reflection characteristics is integrated into the model using feature wise linear modulation (FiLM). Furthermore, to enforce adherence to target spectra, a pre trained surrogate EM simulator is embedded into the framework introducing physics aware regularization through spectrum level loss functions. The efficiency of the proposed model is demonstrated by generating practically realizable metasurfaces for different types of reflection characteristics in the frequency range of 2 to 18 GHz. The proposed framework achieves an average spectral mean squared error of 0.0006 and band alignment accuracy of 0.958 between the target spectra and the spectra produced by the generated designs, demonstrating high conditional accuracy. In addition, the model generates multiple geometries for the same condition, thereby providing diverse design alternatives to the engineer. The proposed model produces the suitable design in approximately 30 seconds, whereas the conventional approach can take several months under comparable computational resources. The efficiency of the model is also established via experimental measurements.

[CV-71] Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution KDD2026

链接: https://arxiv.org/abs/2605.19607
作者: Soyeon Kim,Seongwoo Lim,Kyowoon Lee,Jaesik Choi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 13 figures, 9 tables. Accepted to ACM KDD 2026; includes appendix

点击查看摘要

Abstract:Integrated Gradients (IG) is a widely adopted feature attribution method that satisfies desirable axiomatic properties. However, the choice of integration path significantly affects the quality of attributions, and the standard straight-line path introduces all input features simultaneously, often accumulating noisy gradients along the way. To address this limitation, we propose Spectral Integrated Gradients, which constructs integration paths based on singular value decomposition (SVD) of the baseline-to-input difference. By progressively activating singular components from largest to smallest, SIG introduces global structure before fine-grained details, naturally following a coarse-to-fine progression. Through extensive evaluation across diverse image classification datasets, we demonstrate that SIG produces cleaner attribution maps with reduced noise and achieves improved quantitative performance compared to existing path-based attribution methods. Our code is available at this https URL.

[CV-72] deadtrees.earth-aerial: A Multi-Resolution Aerial Image Dataset for Tree Cover and Mortality Detection

链接: https://arxiv.org/abs/2605.19605
作者: Ayushi Sharma,Clemens Mosig,Lukas Drees,Salim Soltani,Janusch Vajna-Jehle,Aaron Sheppard,Belqis Ahmadi,Jonathan Schmid,Paul Neumeier,Nathan Jacobs,Jan Dirk Wegner,Teja Kattenborn
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Under review. All rights reserved

点击查看摘要

Abstract:Forests worldwide are increasingly threatened by climate change and disturbances such as fire, pests, and pathogens, creating an urgent need for scalable monitoring of tree cover and tree mortality. Aerial imagery from drones and aircraft is a key data source for detailed and large-scale mapping of tree crowns and mortality. However, related progress is limited by the lack of globally representative, harmonized datasets for joint segmentation of tree cover and mortality. We introduce two novel, open, machine-learning-ready datasets to enable joint segmentation of tree cover and tree mortality from centimeter-scale aerial imagery for the first time at global scales. With DTE-aerial-train, we provide a training dataset comprising 385K image patches of size 1024x1024 pixels, with resolutions ranging from 2.5 to 20 cm. It includes multi-class expert-annotated and -audited pseudo-labels for tree cover and mortality. With DTE-aerial-bench, we provide a geographically balanced benchmark test set of 25 globally distributed orthoimages totaling 525 patches with high-quality expert annotations for both tree cover and mortality. Both the training and benchmark datasets span tropical, temperate, boreal, and dryland biomes and cover a wide range of forest structures and mortality patterns. Using the benchmark test set for evaluation, we establish strong reference baselines that improve mortality segmentation across all biomes and scales with significant gains in challenging regions, such as boreal forests, where the F1 score increases from 0.40 to 0.58 with around 45% relative improvement. All data, models, and code will be publicly released under permissive open-source licenses. An interactive visualization of the benchmark dataset is available at this http URL.

[CV-73] A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images

链接: https://arxiv.org/abs/2605.19595
作者: João Pedro Matos-Carvalho,Laio Oriel Seman,Stefano Frizzo Stefenon,Mohammad Khalaf Mohammad Khreasat,Gabriel Villarrubia González
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The inspection of electrical power line insulators is essential for ensuring grid reliability and preventing failures caused by damaged or degraded insulation components. In recent years, Unmanned Aerial Vehicles (UAVs) combined with deep learning-based vision systems have emerged as an effective solution for automating this process. However, insulator fault detection remains challenging due to small defect regions, heterogeneous fault patterns, complex backgrounds, and varying imaging conditions. To address these challenges, this paper proposes an optimized YOLO26-MoE, a novel object detection architecture that integrates a sparse Mixture-of-Experts (MoE) module into the high-resolution branch of the YOLO26 detector. The proposed modification enables adaptive feature refinement for subtle and diverse fault patterns while preserving the efficiency of a one-stage detection framework. Hyperparameter optimization, final training, and evaluation were coordinated through a tool-augmented Large Language Model (LLM) agent. The proposed model achieved 0.9900 mAP@0.5 and 0.9515 mAP@0.5:0.95, outperforming the latest YOLO versions. These results demonstrate that the proposed model provides an effective and reliable solution for UAV-based insulator fault detection.

[CV-74] Lens Privacy Sealing: A New Benchmark and Method for Physical Privacy-Preserving Action Recognition

链接: https://arxiv.org/abs/2605.19578
作者: Mengyuan Liu,Ziyi Wang,Peiming Li,Junsong Yuan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 9 figures,

点击查看摘要

Abstract:RGB camera-based surveillance systems enable human action recognition for public safety and healthcare, yet raise serious privacy concerns. Existing methods rely on post-capture algorithms, which fail to protect privacy during data acquisition. We propose Lens Privacy Sealing (LPS), a simple hardware solution that physically obscures camera lenses with adjustable laminating film, providing pre-sensor privacy protection at minimal cost. Unlike software methods or expensive engineered optics, LPS achieves strong privacy through stochastic multi-layer scattering that is physically irreversible. We introduce the P ^3 AR dataset for privacy-preserving action recognition, featuring both large-scale replay-captured (P ^3 AR-NTU, 114K videos) and real-world collected (P ^3 AR-PKU) subsets with privacy attribute annotations. To handle video degradation from LPS, we propose MSPNet, a single-stage framework incorporating Inter-Frame Noise Suppressor (IFNS) and Cross-Frame Semantic Aggregator (CFSA), enhanced by contrastive language-image pre-training for robust semantic extraction. Extensive experiments demonstrate that MSPNet with IFNS and CFSA nearly doubles action recognition accuracy compared to baseline methods while suppressing identity recognition to low levels. Comprehensive validation shows LPS achieves a superior privacy-utility trade-off compared to state-of-the-art hardware methods, resists reconstruction attacks including PSF inversion and data-driven recovery, and generalizes robustly across optical configurations and challenging environments. Code is available at this https URL.

[CV-75] EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLM s

链接: https://arxiv.org/abs/2605.19559
作者: Yang Dai,Dian Jiao,Tianwei Lin,Wenqiao Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid development of Multimodal Large Language Models (MLLMs) has led to growing interest in egocentric video understanding, specifically the ability for MLLMs to recognize fine-grained hand-object interactions, track object state changes over time, and reason about manipulative processes in dynamic environments from a first-person perspective. However, existing egocentric video benchmarks suffer from \textbflimited grounded rationale evaluation, offering limited support for fine-grained operation-centric reasoning and rarely examining whether model rationales are grounded in explicit spatio-temporal evidence. To address this gap, we introduce \textbfEgoCoT-Bench, a fine-grained egocentric benchmark for grounded and verifiable operation-centric reasoning with explicit step-by-step rationale annotations. Overall, EgoCoT-Bench comprises 3,172 verifiable QA pairs over 351 egocentric videos separated into four task groups for a total of 12 sub-task groups, encompassing perception and retrospection, anticipation, and high-level reasoning. The benchmark is constructed through a spatio-temporal scene graphs (STSG) guided generation framework and is further refined by human annotators to ensure correctness, egocentric relevance and fine-grained quality. Experimental results show continuing difficulties with egocentric fine-grained reasoning and further reveal that many multimodal models produce explanations that are answer-correct, but have evidence that is inconsistent with the answer. We hope EgoCoT-Bench can serve as a useful testbed for grounded and verifiable reasoning in egocentric video understanding. Project page and supplementary materials are available at: this https URL.

[CV-76] EpiDiffVO: Geometry-Aware Epipolar Diffusion for Robust Visual Odometry

链接: https://arxiv.org/abs/2605.19556
作者: Prateeth Rao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, in revision to be submitted to IEEE RA-L

点击查看摘要

Abstract:Estimating relative pose from image pairs fundamentally requires only a minimal subset of geometrically consistent correspondences. However, most learning-based approaches rely on dense matching or direct regression, leading to redundancy and reduced geometric interpretability. In this work, we propose a sparse epipolar matching framework that predicts a compact set of correspondences optimized for geometric consistency across varying temporal baselines. To address residual noise and misalignment, we introduce an epipolar diffusion process that models correspondence uncertainty and refines keypoints toward epipolar consistency. The refined correspondences, along with depth cues, are lifted into a graph representation forming a Steiner graph that encodes relational structure between points. A graph neural network learns a compact subset of informative correspondences, which are passed to a differentiable singular value decomposition solver for end-to-end geometric estimation. Relative pose is recovered from the resulting essential matrix and evaluated in a visual odometry setting on the TartanAir and KITTI SLAM datasets. Experimental results demonstrate that combining sparse matching, diffusion-based refinement, and graph-based subset selection reduces correspondence redundancy while maintaining robust pose estimation across challenging baselines.

[CV-77] Self-Creative Text-to-Object Generation using Semantic-Aware Spatial Weighting

链接: https://arxiv.org/abs/2605.19554
作者: Yue Yu,Haibo Chen,Shuo Chen,Jian Yang,Jun Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instilling creativity in text-to-image (T2I) generation presents a significant challenge, as it requires synthesized images to exhibit not only visual novelty and surprise, but also artistic value. Current T2I models, however, are largely optimized for literal text-image alignment with their data distribution, and their noise prediction networks constrain the generation to high-probability regions, consequently generating outputs that lack authentic creativity. To address this, we propose a Self-Creative Diffusion (SCDiff) model for meaningful T2I generations featuring two core modules: a learnable spatial weighting (LSW) module and a visual-semantic mixing loss (VSML). The LSW module designs a parametric Kaiser-Bessel window to reinforce central image features, fostering novel and surprising generation. The VSML module introduces a dual loss function: a similarity loss constrains that the new images align with its textual description, while a diversity loss maximizes its distinction from the original image, enhancing both semantic value and visual novelty. Extensive experiments demonstrate that our model substantially improves creativity, semantic alignment, and visual coherence, offering a simple yet powerful framework for generating creative objects.

[CV-78] AnchorFlow: Editable SVG Reconstruction via Sparse Anchor Point Fields

链接: https://arxiv.org/abs/2605.19551
作者: Mengnan Jiang,Christian Franke,Michele Franco Adesso,Antonio Haas,Grace Li Zhang
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image-to-SVG reconstruction aims to produce vector graphics that are faithful to raster inputs and easy to edit. Existing methods face a structural trade-off in how vector structure is parameterized, including how many paths represent an image and how many anchor points define each path. High-fidelity methods often rely on many paths or densely parameterized curves, whereas overly compact SVG generation may deviate from the input geometry. This issue becomes more pronounced when local raster evidence is imperfect, where boundary-following reconstruction can introduce redundant anchors and fragmented structures. We argue that this trade-off should be addressed at the level of anchor placement, since anchors on Bezier curves define local path structure and strongly affect both accuracy and editability. We propose AnchorFlow, an editable SVG reconstruction framework that models path-level anchor placement with sparse anchor point fields. Given path-like foreground components extracted from a raster image, AnchorFlow predicts an image-conditioned sparse anchor field for each component and resolves it into an ordered Bezier path. Rendering-guided feedback then corrects local structural errors before re-resolution. The recovered paths are then assembled and optimized into the final SVG. Experiments on isolated paths and full images show that AnchorFlow achieves a favorable fidelity-editability trade-off, substantially reducing editable complexity while preserving competitive raster fidelity.

[CV-79] rust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R ICML2026

链接: https://arxiv.org/abs/2605.19539
作者: Zihao Zhu,Wenyuan Zhao,Nuo Chen,Chao Tian,Zhiwen Fan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026. 10 pages main paper, with appendix

点击查看摘要

Abstract:Geometric foundation models hold promise for unconstrained dense geometry prediction from uncalibrated images. However, in current feed-forward designs, their predicted confidence scores are heuristic, lack probabilistic interpretation, and often fail to indicate where and how much the predicted geometry can be trusted. To address this gap, we present Trust3R, a lightweight evidential uncertainty framework for feed-forward 3D reconstruction. Trust3R combines gated residual mean refinement with a Normal-Inverse-Wishart evidential head, yielding a closed-form multivariate Student-t distribution for per-point geometric uncertainty. This design provides probabilistically grounded pointmap uncertainty estimates while adding moderate inference overhead. We evaluate on diverse indoor and outdoor benchmarks and compare against MASt3R’s built-in confidence map as well as common uncertainty-aware baselines spanning single-pass heteroscedastic regression and sampling-based methods such as MC dropout and deep ensembles. Experimental results show that Trust3R consistently improves risk-coverage and sparsification, and generally improves geometric accuracy. These gains are reflected in stronger uncertainty ranking across benchmarks, with 25% lower AURC and 41% lower AUSE on ScanNet++, providing a practical reliability signal for uncertainty-aware weighting in downstream geometry pipelines. The project page and code are available at this https URL.

[CV-80] CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision

链接: https://arxiv.org/abs/2605.19538
作者: Pengcheng Wang,Haoxiang Liu,Yang Dai,Xiangxiang Zeng,Guanhua Chen,Baotian Hu,Longyue Wang,Weihua Luo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 12 figures

点击查看摘要

Abstract:CAPTCHAs are widely deployed as human verification mechanisms and frequently block intelligent agents from completing end-to-end automation in real-world web environments. Solving modern CAPTCHAs requires robust multi-step visual reasoning and interaction capabilities, yet training-based approaches have remained absent due to the lack of large-scale training data and process-level annotations. We introduce CaptchaBench, the first CAPTCHA benchmark designed to support large-scale training, comprising 16,000 programmatically generated samples across eight task categories with detailed region and process-level annotations. Systematic evaluation on CaptchaBench reveals that existing methods fail consistently on tasks requiring fine-grained visual detail capture and region-level comparison. We therefore present CaptchaMind, an RL-based solver trained with explicit reasoning process supervision, achieving 82.9% average success rate across eight tasks and 71.0% on real-world instances, substantially outperforming all existing methods without closed-source APIs.

[CV-81] Replacement Learning: Training Neural Networks with Fewer Parameters

链接: https://arxiv.org/abs/2605.19533
作者: Yuming Zhang,Peizhe Wang,Tianyang Han,Hengyu Shi,Junhao Su,Dongzhi Guan,Jiabin Liu,Jiaji Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16pages

点击查看摘要

Abstract:End-to-end training with full-depth backpropagation remains the dominant paradigm for optimizing deep neural networks, but its efficiency deteriorates as models grow deeper. Since every block must be executed and differentiated under a single global objective, full-depth BP introduces substantial parameter redundancy, activation-memory cost, and training latency, especially when neighboring layers exhibit highly correlated learning patterns. Directly skipping or removing layers can reduce cost, but often weakens representation capacity or requires architecture-specific reuse designs. In this paper, we propose Replacement Learning (RepL), a training-time paradigm that reduces full-depth redundancy by replacing selected blocks rather than simply discarding them. For each removed block, RepL inserts a lightweight computing layer that synthesizes a surrogate operator from the parameters of its adjacent preceding and succeeding blocks through a learnable transformation, and applies the synthesized operator to the preceding activation. In this way, RepL preserves local contextual continuity while avoiding unnecessary full-layer computation. We instantiate RepL for CNNs and ViTs with tailored parameter-fusion blocks that handle convolutional channels, feature resolutions, and transformer submodules. Extensive experiments on CIFAR-10, SVHN, STL-10, ImageNet, COCO, and CityScapes show that RepL reduces trainable parameters, GPU memory usage, and training time while matching or surpassing standard end-to-end training across classification, detection, and segmentation. Additional results on WikiText-2, transfer learning, inference throughput, checkpointing, stochastic depth, and INT8 quantization further demonstrate its generality and compatibility.

[CV-82] Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection

链接: https://arxiv.org/abs/2605.19532
作者: Yunzhe Zhang,Hongfu Liu,Pengyu Hong
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Text-to-image diffusion models can synthesize high-quality images, yet the outcome is notoriously sensitive to the random seed: different initial seeds often yield large variations in image quality and prompt-image alignment. We revisit this “seed effect” and show that attention dynamics over prompt core tokens, the content-bearing words, measured during the first few denoising steps, strongly predict final generation quality. Building on this observation, we introduce Attention-Based Seed Selection (ABSS), a training-free, plug-and-play method that ranks seeds for a given prompt by leveraging cross-attention to core tokens during the denoising process. ABSS requires no finetuning and does not alter the initial noise; it scores and ranks all candidate seeds, keeps only the top-k for full generation, and discards the rest, without relying on a fixed accept/reject threshold. Operating purely at inference time, ABSS can serve as a lightweight pre-selection add-on for existing seed-optimization pipelines, enabling additional gains. Across three benchmarks, extensive experiments show that ABSS enables consistent improvements in text-image alignment and visual quality for Stable Diffusion variants, as corroborated by human preference and alignment metrics.

[CV-83] owards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLM s

链接: https://arxiv.org/abs/2605.19528
作者: Xueying Jiang,Wenhao Li,Quanhao Qian,Deli Zhao,Shijian Lu,Gongjie Zhang,Ran Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D localization in Multimodal Large Language Models (MLLMs), including 3D object detection and 3D visual grounding, is fundamentally limited by camera intrinsic ambiguity: the same image admits different 3D scenes under different cameras. Existing MLLMs either ignore camera parameters and overfit to a canonical training intrinsic, or retrieve depth and 3D cues from external tools but treat the returned values as reference cues (numerical hints that the model is free to interpret implicitly), both preventing camera information from being deterministically propagated into the prediction. We propose an equation-anchored tool-use framework that re-purposes spatial tools as formula variables. The proposed framework proactively retrieves camera intrinsics and samples multi-point metric depths, writes the pinhole back-projection equation \hatX = (u_c - c_x)\barZ/f_x explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box. On both 3D object detection and 3D visual grounding tasks under rescaled camera intrinsics from 0.5\times to 1.5\times , our method outperforms RGB-only and tool-augmented baselines, with significant gains where the camera deviates most from the training scale. Code and data will be released.

[CV-84] Dual-Prompt CLIP with Hybrid Visual Encoders for Occluded Person Re-Identification

链接: https://arxiv.org/abs/2605.19527
作者: Zhangjian Ji,Shaotong Qiao,Kai Feng,Wei Wei
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Occluded person re-identification focuses on matching partially visible pedestrians across multiple camera views. However, occlusions disrupt body-region cues, thereby complicating cross-view matching. Most person ReID methods built on pretrained vision-language models only focus on enhancing prompt-based feature learning while ignoring the semantic information of occluders. Based on the success of CLIP-ReID, we propose a novel Dual Prompt Learning ReID (DPL-ReID) model for occluded person ReID. It incorporates a Dual Prompt Learning (Dual-PL) strategy, which can utilize textual cues to capture complete pedestrian semantics and keep robustness against occlusion, and a Real-World Occlusion Augmentation (RWOA) method that realistically simulates occlusion scenarios encountered in real word to enrich occluded samples. In addition, we also design a Weighted Gated Feature Fusion (WGFF) method, which in corporates LSNet to capture global information and act as a feature-gating mechanism. This mechanism can effectively guide the CLIP visual encoder toward generating more comprehensive feature representations. Extensive experiments on several benchmark occluded ReID datasets show that our proposed DPL-ReID achieves the state-of-the art performance. The occlusion instance library are available at this https URL.

[CV-85] SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving

链接: https://arxiv.org/abs/2605.19524
作者: Kefei Tian,Yuansheng Lian,Kai Yang,Xiangdong Chen,Shen Li
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving systems excel in common scenarios but struggle with safety-critical long-tail cases. Vision-Language-Action (VLA) models are promising due to their strong reasoning capabilities. However, most VLA-based approaches rely on positive expert demonstrations, rarely exploiting negative samples, leading to insufficient understanding of risky behaviors and safety boundaries. To address this limitation, we propose SafeAlign-VLA, a unified negative-enhanced safe alignment framework that incorporates negative data into supervised learning and reinforcement learning. First, we develop a counterfactual safety pairing paradigm to generate structured safety labels and counterfactual positive trajectories from risky scenarios via counterfactual reasoning. Then, a two-stage training strategy is adopted: negative-enhanced supervised fine-tuning for failure feedback and trajectory correction, followed by anchor-based group relative policy optimization that uses positive and negative trajectories as contrastive anchors to steer sampling and penalize high-risk behaviors via group-relative advantages. Experiments on NAVSIM and DeepAccident validate the proposed framework. SafeAlign-VLA achieves 89.1 PDMS on the NAVSIM v1 testset, improving over the baseline without negative data by 1.3%. On DeepAccident, it reduces the collision rate to 3.36%, while achieving 84.2% language accuracy and 85.8% risk prediction accuracy. These results demonstrate the effectiveness of the proposed negative-enhanced safe alignment framework for safe and robust autonomous driving.

[CV-86] Diff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment CVPR2026

链接: https://arxiv.org/abs/2605.19522
作者: Xinli Yue,JianHui Sun,Tao Shao,Liangchao Yao,Fan Xia,Yuetang Deng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Workshop

点击查看摘要

Abstract:Pairwise image quality assessment (IQA) in professional photography requires a model not only to identify the preferred image between two candidates, but also to provide convincing and image-grounded reasoning. In the NTIRE 2026 RAIM challenge, this requirement is further emphasized by jointly evaluating preference prediction and rationale generation. To address this task, we propose iDiff, an Interpretable Difference-aware framework for pairwise image quality assessment. Our method adopts a dual-branch design consisting of an Answer Model and a Thinking Model. The Answer Model performs robust preference prediction by explicitly decomposing each sample into left/right global and local views, followed by content-aware specialization for person and scene images and ensemble-based aggregation across backbones. The Thinking Model focuses on rationale generation and is progressively enhanced with expert-style templates, multi-source quality features, and answer-aware supervision conditioned on the Answer Model prediction. In this way, iDiff jointly models discriminative decision making and structured explanation, improving both robustness and interpretability. Extensive experiments demonstrate the effectiveness of the proposed framework on both accuracy and reasoning-quality metrics. Our method achieved first place in the NTIRE 2026 RAIM challenge, showing the effectiveness of integrating explicit difference modeling with structured multimodal reasoning for pairwise IQA.

[CV-87] Are Watermarked Images Editable? SafeMark for Watermark-Preserving Text-Guided Image Editing

链接: https://arxiv.org/abs/2605.19511
作者: Xiaodong Wu,Qi Li,Xiangman Li,Zelin Zhang,Lingshuang Liu,Jianbing Ni
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates a fundamental yet underexplored question: can watermarked images remain editable without compromising watermark integrity? We propose SafeMark, a framework for watermark-preserving text-guided image manipulation that explicitly integrates watermark integrity into the editing process. Specifically, SafeMark adds a thresholded watermark-decoding loss directly to the diffusion editor’s training objective, fine-tuning the editor so that semantically valid edits also preserve the embedded watermark at the final output. This design admits a clean information-theoretic justification: maintaining high bit-accuracy on the edited image lower-bounds the mutual information that the editor channel preserves between watermark and edited output, the quantity that fundamentally controls watermark recoverability. SafeMark is compatible with differentiable diffusion-based editors, and requires no architectural modification. Extensive evaluations across multiple datasets, text-guided editing methods, and post-edit distortion settings demonstrate that SafeMark achieves high watermark bit accuracy across diverse editing settings while maintaining high-quality semantic edits, without sacrificing robustness to common post-edit distortions. These results demonstrate that semantic editability and watermark integrity are fundamentally compatible, enabling trustworthy image provenance in generative editing pipelines.

[CV-88] Return of Frustratingly Easy Unsupervised Video Domain Adaptation ICML2026

链接: https://arxiv.org/abs/2605.19510
作者: Pengfei Wei,Yiqun Sun,Zhiqiang Xu,Yiping Ke,Lawrence B. Hsieh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in ICML 2026

点击查看摘要

Abstract:Unsupervised video domain adaptation (UVDA) is a practical but under-explored problem. In this paper, we propose a frustratingly easy UVDA method, called MetaTrans. Specifically, MetaTrans adopts a concise learning objective that contains only two fundamental loss terms. Despite the simplicity of the learning objective, MetaTrans embodies an advanced UVDA idea, that is, handling the spatial and temporal divergence of cross-domain videos separately, through a subtle model architecture design. By implementing a temporal-static subtraction module, MetaTrans effectively removes spatial and temporal divergence. Extensive empirical evaluations, particularly on various cross-domain action recognition tasks, show substantial absolute adaptation performance enhancement and significantly superior relative performance gain compared with state-of-the-art UVDA baselines.

[CV-89] EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning

链接: https://arxiv.org/abs/2605.19506
作者: Pengtao Ma,Ziliang Zhou,Ciyu Ruan,Haoyang Wang,Kaiyuan Li,Zihang Gong,Wenhua Ding,Chen Gao,Jingao Xu,Xinlei Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:First-person dynamic spatial reasoning requires models to track continuous motion and precise geometric structure, but the quadratic attention cost of Transformer-based Video-LLMs makes dense visual tokens computationally expensive. Existing token pruning paradigms predominantly rely on discrete static snapshots, failing to preserve the motion and geometric cues essential for reasoning. We propose Event Cascade Pruning (ECP), to our knowledge the first training-free framework that leverages the high-frequency motion cues from event cameras as a continuous event-guided motion prior to guide token selection. ECP combines three stages: Event-Triggered Causal Sampling to anchor motion-informative keyframes, Event-guided Motion Saliency Filtering to suppress event-inactive visual tokens, and Event-Attention Ranking Fusion to calibrate spatial attention with motion-salient dynamics. With 80% visual token reduction, ECP outperforms the full-token baseline (37.62% vs. 36.31%) while achieving 1.89x inference speedup and 52% GFLOPs reduction. We further introduce ESR-Real, the first real-world RGB-event benchmark for first-person spatial reasoning, where ECP improves accuracy by 2.68 percentage points over full-token baselines.

[CV-90] hinking in Scales: Accelerating Gigapixel Pathology Image Analysis via Adaptive Continuous Reasoning ICML2026

链接: https://arxiv.org/abs/2605.19491
作者: Jiusong Ge,Yingkang Zhan,Wenjie Zhao,Di Zhang,Ke Wang,Jiashuai Liu,Chunze Yang,Chengzu Li,Jian Zhang,Yuxin Dong,Ni Zhang,Qidong Liu,Mireia Crispin-Ortuzar,Huazhu Fu,Chen Li,Zeyu Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Traditional whole slide image (WSI) analysis methods typically rely on the multiple instance learning (MIL) paradigm, which extracts patch-level features at high magnification and aggregates them for slide-level prediction. However, such exhaustive patch-level processing is computationally expensive, severely limiting the efficiency and scalability of WSI analysis. To address this challenge, we propose PathCTM (a Pathology-oriented Continuous Thought Model) that enables token-efficient scale-space continuous reasoning for gigapixel WSIs. PathCTM formulates diagnostic inference as a dynamic sequential information pursuit. It progressively transitions from low-magnification global to high-magnification local inspection, and adaptively terminates inference when sufficient evidence is gathered to effectively bound decision uncertainty. Specifically, it uses conditional computation for dynamic scale switching with attention-guided region pruning, coupled with confidence-aware early stopping. Extensive experiments demonstrate that, compared with standard MIL-based methods, PathCTM reduces the number of required image patches by 95.95% and shortens inference time by approximately 95.62%, while maintaining AUC without degradation. Code is available at this https URL.

[CV-91] Closed-Loop Hybrid Digital Twin Platform for Connected and Automated Vehicle Validation

链接: https://arxiv.org/abs/2605.19490
作者: Kanglong Quan,Zhebing Xia,Linfeng Jiang,Hao Yu,Ziheng Qiao,Dapeng Dong,Dongyao Jia
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Comprehensive and efficient validation of connected and automated vehicles (CAVs) is critical prior to real-world deployment. While simulation-based testing offers scalability, existing approaches often lack seamless integration with real vehicles and field data, limiting their fidelity in capturing dynamic, real-world interactions. To bridge this gap, this paper proposes a novel real-time hybrid digital twin platform. Its core innovation lies in the tight coupling of a high-fidelity CARLA-SUMO co-simulation with a physical test site and vehicle via a low-latency Vehicle-to-Everything (V2X) communication link. A custom-developed middleware serves as the critical bridge, synchronizing a real CAV’s kinematic state as a shadow vehicle in the simulation and translating virtual control commands into chassis-actuating Controller Area Network (CAN) messages for closed-loop control. Detailed implementation includes using photogrammetry for full-scale asset reconstruction and a cloud-edge collaborative architecture for scalable, multi-user operation. Experimental results demonstrate stable synchronization and effective closed-loop control with low latency, confirming the platform’s practicality for multi-scenario CAV verification.

[CV-92] Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures

链接: https://arxiv.org/abs/2605.19478
作者: Zeyao Liu,Zhendong Zhao,Xiaojun Chen,Xin Zhao,Yuexin Xuan,Xiaoshuang Ji
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing ViT backdoor attacks based on backbone-overwriting full-tuning are computationally expensive and inflict performance degradation. This has forced adversaries towards the Visual Parameter-Efficient Fine-Tuning (PEFT) paradigm, dominated by adapter-based (e.g., LoRA) and prompt-based (e.g., VPT) approaches. While adapter security has seen initial study, the risks of the burgeoning prompt-based ecosystem remain critically unexplored. We fill this critical gap, exposing how the evolution of VPT towards dynamic and context-aware architectures can facilitate a far more dangerous and emergent threat. This vulnerability arises even though these dynamic modules unlock superior benign performance. We propose VIPER, an attack framework built on a lightweight, dynamic Visual Prompt Generator (VPG) that demonstrates this vulnerability. Critically, this dynamic architecture enables Functional Fusion: an emergent phenomenon where malicious logic and benign task utility are tightly fused into the same sparse, high-magnitude parameter core. This fusion creates a formidable ``hostage" dilemma, as pruning the attack necessarily destroys the benign performance. Comprehensive evaluations show VIPER effectively addresses the attacker’s trilemma: VIPER not only achieves state-of-the-art performance on clean data, but also maintains near-100% ASR even under 90% VPG-module pruning (where LoRA attacks collapse), while adding only an imperceptible 0.06ms (1.16%) of inference latency. VIPER’s results, driven by Functional Fusion, expose a new, paradigm-level risk in dynamic prompt architectures.

[CV-93] argeted Downstream-Agnostic Attack

链接: https://arxiv.org/abs/2605.19446
作者: Zhuxin Lei,Ziyuan Yang,Yi Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, pre-trained encoders have gained widespread use due to their strong capability in representation extraction. However, they are vulnerable to downstream-agnostic attacks (DAAs). Existing DAA methods operate under a permissive threat model, where an attack is successful if the generated downstream-agnostic adversarial examples (DAEs) change the original prediction, without requiring a specific target. In this paper, we propose a Targeted DAA (TDAA) method under a stricter threat model requiring the attack to be both targeted and downstream-agnostic. Since the downstream task is unknown and encoders do not directly produce predictions, achieving a targeted attack is particularly challenging. To address this, we introduce a novel component termed the ‘threat image’, pre-selected by the attacker as the target. Specifically, a generator is designed to produce example-specific adversarial perturbations that compel the victim encoder to output identical features for both the DAEs and the threat image. Unlike previous DAA methods that generate a single shared perturbation for all samples, which often fails due to image diversity, our method adopts an example-specific paradigm. This generates tailored perturbations for each image to ensure a high attack success rate and invisibility. By leveraging the threat image as a feature-level anchor, our method builds a task-agnostic bridge to reveal the vulnerabilities of the victim encoder. Extensive experiments on 10 self-supervised methods across 3 benchmark datasets demonstrate the effectiveness of our approach and reveal the pronounced vulnerability of pre-trained encoders. The code will be made publicly available after the review period.

[CV-94] KappaPlace: Learning Hyperspherical Uncertainty for Visual Place Recognition via Prototype-Anchored Supervision

链接: https://arxiv.org/abs/2605.19435
作者: Maya Yanko,Yoli Shavit
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Place Recognition (VPR) is critical for autonomous navigation, yet state-of-the-art methods lack well-calibrated uncertainty estimation. Standard pipelines cannot reliably signal when a query is ambiguous or a match is likely incorrect, posing risks in safety-critical robotics. We propose KappaPlace, a principled framework for learning uncertainty-aware VPR representations. Our core contribution is a Prototype-Anchored supervision strategy that leverages latent class representatives as targets for a probabilistic objective. By modeling image descriptors as von Mises-Fisher (vMF) variables, we learn a lightweight module to predict the concentration parameter as a direct proxy for aleatoric uncertainty. While existing VPR uncertainty methods are typically restricted to a query-centric view, we derive a novel match-level formulation to quantify the reliability of specific query-reference pairs. Across five diverse benchmarks, KappaPlace reduces Expected Calibration Error (ECE@K) by up to 50% compared to existing methods while maintaining or improving retrieval recall. We provide both a joint-training variant and a post-training extension for frozen backbones. Our results demonstrate that KappaPlace provides a robust, stable, and well-calibrated signal that enables reliable decision-making within the VPR pipeline. Our code is available at: this https URL

[CV-95] Vision Harnessing Agent for Open Ad-hoc Segmentation

链接: https://arxiv.org/abs/2605.19410
作者: Zilin Wang,Stella X. Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 11 figures

点击查看摘要

Abstract:Segmentation has become easy when the concept is known, requiring retrieval of a learned visual grounding from text. It remains hard for open ad-hoc concepts, where the grounding may not exist as one learned mask and must often be constructed from image evidence through parts, relations, exclusions, and collections. We propose a Vision-guided Ad-hoc Segmentation Agent (VASA), the first vision harnessing agent for open ad-hoc segmentation. VASA is training-free and couples a VLM agent, a segmentation foundation model, and a visually grounded workflow. Rather than revising text prompts alone, VASA uses a persistent working mask to reason, construct, and validate a solution. It plans visual operations, invokes segmentation tools, inspects results, edits the mask, and recovers from errors. We construct PARS, a new benchmark that turns part-level labels in PartImageNet into open ad-hoc concepts through long-form definition queries. On PARS, VASA outperforms open-vocabulary, reasoning-based, and agentic baselines, surpassing SAM3 Agent by 14-25%. On RefCOCOm, a standard multi-granularity referring segmentation benchmark, VASA improves over SAM3 Agent by 5-9% and over other agentic baselines by up to 20%. These results validate agentic visual construction for open ad-hoc segmentation. Our work points to a path for AI agents beyond wrapping foundation models as tools: Programming them with task knowledge, VLM behavior, visual routines, working memory, and failure-aware workflows.

[CV-96] Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

链接: https://arxiv.org/abs/2605.19398
作者: Wooseok Jeon,Seungho Park,Seunghyun Shin,Sangeyl Lee,Hyeonho Jeong,Hae-Gon Jeon
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate this issue by weakening or modifying the image-conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify \emphreference-frame dominance as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS~(Dynamic Motion Slider), a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state-of-the-art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.

[CV-97] Neuron Incidence Redistribution for Fairness in Medical Image Classification

链接: https://arxiv.org/abs/2605.19393
作者: Abin Shoby,Lyle John Palmer,Nikhil Cherian Kurian
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 4 Pages, 1 Figure

点击查看摘要

Abstract:Deep learning models for medical image classification are susceptible to subgroup performance disparities across demographic attributes such as age, gender, and race. We identify a latent representational mechanism underlying these disparities: in transfer-learned models, the dominant penultimate-layer activation channel under positive predictions is co-activated by both disease-positive samples and privileged demographic groups (male, older patients), producing over-diagnosis; conversely, the dominant channel under negative predictions is co-activated by disadvantaged groups (female, younger patients), producing systematic under-diagnosis. To address this, we propose Neuron Incidence Redistribution (NIR), a lightweight regularization method that penalizes the variance of predicted-probability-weighted mean activations across penultimate-layer neurons, requiring no demographic labels at training time. On HAM10000, TPR disparity drops from 10.81% to 0.93% across age groups and from 12.04% to 0.74% across gender, with a marginal AUC improvement of 0.51 points. On Harvard OCT-RNFL, NIR reduces FPR disparity for race (from 15.68% to 10.66%) and age (from 12.69% to 1.80%), demonstrating that distributing latent disease evidence across the full penultimate layer is a principled and effective strategy for improving demographic fairness in medical AI.

[CV-98] LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue

链接: https://arxiv.org/abs/2605.19390
作者: Chaoyue Li,Yongxue Xu,Jie Feng,Jiayu Ding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent large multimodal models (LMMs) have become increasingly capable on image and video understanding, yet still struggle to sustain 4D continuous spatiotemporal dynamic reasoning. To study this capability gap, we formulate trajectory-grounded multi-turn spatiotemporal dialogue, a new task in which a model must answer spatiotemporal queries while returning structured 3D target trajectories over an entire short clip or a specified segment of a longer clip, and introduce Track4D-Bench, a benchmark with 526 clip-level dialogue samples spanning 23.5k frames and 7.5k object annotations, for training and evaluation. Building on this task, we propose LMM-Track4D, which combines RTGE (Ray–Time Geometry Encoding), a dedicated streaming state token TRK for long-horizon dynamic propagation, and an Object-Slot Kinematic, Residual-Anchor (OSK-RA) decoder for stable 4-step 3D state estimation under occlusion and viewpoint variation. Experiments on Track4D-Bench show consistent improvements over strong baselines, suggesting that explicit dynamic state modeling is a useful design principle for eliciting 4D dynamic reasoning in LMMs. Our code and dataset will be publicly available at this https URL.

[CV-99] MatPhys: Learning Material-Aware Physics Parameters for Deformable Object Simulation from Videos

链接: https://arxiv.org/abs/2605.19386
作者: Yang Yang,Yiyan Wang,Zheming Liu,Naoya Iwamoto
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to Siggrah Asia 2026

点击查看摘要

Abstract:Reconstructing simulation-ready deformable objects is important for vision, graphics, and robotics. Existing physics-driven methods can recover physical digital twins from videos, but they suffer from two fundamental limitations: they typically assume a homogeneous material across the whole object, and their scene-specific inverse optimization, combined with the inherent ambiguity of monocular observation, yields inconsistent parameters for the same material across different scenes or interactions. We propose MatPhys, a material-aware feed-forward framework that predicts spring-mass parameters from a single-view video, addressing these two issues with two coupled designs. To relax the homogeneous material assumption, we use DINO features to decompose the object into semantically meaningful parts and to query a part-level material prior, assigning each part its own physical behavior. To enforce cross-scene consistency, we introduce a learned material codebook of shared material embeddings as the bridge between appearance and physics, and further use the part-level prior as a reference distribution that constrains the decoder so that the same material yields consistent parameters across scenes and interactions. Together, these designs turn an under-constrained monocular problem into feed-forward inference grounded on shared, reusable material concepts. Experiments show that our method matches per-scene optimization baselines in reconstruction and future prediction, while achieving stronger generalization to unseen interactions and objects with more consistent physical parameters.

[CV-100] Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock

链接: https://arxiv.org/abs/2605.19378
作者: Haiying Sha
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper systematically diagnoses the training failure modes of Token-Choice sparse Mixture-of-Experts (MoE) on video Diffusion Transformers. Starting from a pretrained dense model of about 5 billion parameters, we convert it into an MoE architecture following three laws: routed experts exactly clone the original FFN weights, shared experts are initialized to zero for verification and then to extremely small non-zero noise for actual training, while only the gating networks start from random initialization. Experiments reveal a hierarchy of five failure modes: (1) linear routers suffer global soft saturation with complete expert homogenization; (2) MLP routers introduce selective deadlock, where roughly one-third of layers degenerate into a single-expert mode that cannot be prevented by increasing the auxiliary loss; (3) cross-attention routers exhibit preliminary self-recovery, yet about nine layers remain stubbornly deadlocked; (4) deadlocked layers display a U-shaped distribution, concentrated in shallow visual processing layers and deep semantic integration layers; (5) bfloat16 mixed precision causes tiny weight updates to be truncated to zero by hardware. Based on routing decision time series over 65 million tokens across 5,000 training steps, we propose the Functional Redundancy Hypothesis: deadlock is a rational waiting strategy before the shared expert matures within the gate-shared expert-routed expert triadic system. This hypothesis is supported by the theory of functional redundancy in systems biology. On the engineering side, we summarize the Three Laws of dense-to-MoE conversion and provide a complete solution for the bfloat16 precision trap. We calibrate the current capability boundary of the Token-Choice paradigm and outline a three-step evolutionary roadmap from visual unification to a world model. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.19378 [cs.CV] (or arXiv:2605.19378v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.19378 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-101] Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings MICCAI2026

链接: https://arxiv.org/abs/2605.19374
作者: Chenyu Lian,Hong-Yu Zhou,Chun-Ka Wong,Jing Qin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Early accepted by MICCAI 2026

点击查看摘要

Abstract:Vision-language alignment using chest X-rays and radiology reports has emerged as an advanced paradigm for zero-shot classification and grounding of chest X-ray findings. However, standard contrastive learning typically treats radiographs and reports from different patients simply as negative pairs. This assumption introduces noisy negatives, as different patients frequently exhibit similar findings. Such noisy negatives cause semantic ambiguity and degrade performance in zero-shot understanding tasks. To address this challenge, we propose CoNNS, a concept-guided noisy-negative suppression framework. To support the negative suppression mechanism, unlike previous methods that use raw reports or templatized texts, we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts (evidential segment and presence statement). Leveraging this ontology, we implement a cross-patient pair relabeling strategy comprising three steps: (1) Fine-Grained Breakdown to categorize pairs based on finding presence; (2) Noisy Negative Filtering to resolve semantic conflicts by removing false negatives; and (3) Hard Negative Mining to identify subtle attribute discrepancies using a lightweight language model. Finally, we propose a Concept-Aware NCE loss to align visual features with text while suppressing the identified noisy negatives. Extensive experiments across multi-granularity zero-shot grounding tasks and five zero-shot classification datasets validate that CoNNS outperforms existing state-of-the-art models. The code is available at this https URL.

[CV-102] Multi-Scale Generative Modeling with Heat Dissipation Flow Matching

链接: https://arxiv.org/abs/2605.19371
作者: Jun Ma,Hanquan Zhang,Yanjun Qin,Haoyuan Guan,Ke Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models are widely used in image generation, with most relying on noise-based corruption and denoising. A distinct branch instead uses blur as the main corruption, preserving better color budgets and multi-scale detail by providing multi-scale priors. However, blur-based models remain in SDE-based frameworks and are not integrated into ODE-based frameworks, such as Flow Matching (FM). Meanwhile, in the blur-based formulation, the classical inverse heat-dissipation (IHD) process faces an ill-posed challenge. Moreover, under the data-manifold assumption, regressing blurred images from high-dimensional noise (or velocity) space is also difficult. We propose Heat Dissipation Flow Matching (HDFM), which introduces a continuous blurred (heat-dissipation) process into FM to inject multi-scale priors. HDFM aligns an interpolated heat-dissipation path to address ill-posedness and adopts x -prediction to mitigate high-dimensional regression difficulty. Toy experiments and ablation studies show that HDFM consistently benefits from both blur and x -prediction. The performance of HDFM outperforms most baseline methods on all datasets.

[CV-103] Scalable Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection

链接: https://arxiv.org/abs/2605.19360
作者: Parnian Ghapandar Kashani,Shiqi Chen,Aydogan Ozcan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Applied Physics (physics.app-ph); Optics (physics.optics)
备注: 30 Pages, 8 Figures

点击查看摘要

Abstract:The rapid proliferation of AI-generated visual media has created an urgent need for efficient, trustworthy deepfake detection systems. However, existing deep learning-based detection methods rely on computationally intensive and energy-demanding inference algorithms, limiting their scalability. Here, we present a hybrid digital-analog deepfake video detection framework that combines a lightweight digital front-end with a spatially multiplexed optical decoding back-end for massively parallel analog inference through a programmable spatial light modulator. By simultaneously processing 15 or more video streams within a single optical propagation pass, the system enables high-throughput and accurate video-level authenticity prediction at reduced computational cost compared with purely digital methods. We validated this hybrid deepfake video processor using different datasets spanning classical face-swapping, real-world deepfake recordings, and fully AI-generated videos. Using a spatially multiplexed experimental set-up operating in the visible spectrum, we achieved average deepfake detection accuracy, sensitivity and specificity of 97.79%, 99.86% and 95.72%, respectively, on the Celeb-DF video dataset with 15 videos tested in parallel in a single optical pass per inference. The multiplexed optical decoder also demonstrates resilience against various types of video degradation, noise, compression, experimental misalignments and black-box adversarial attacks. Our results show that integrating optical computation into AI inference enables simultaneous gains in throughput, energy efficiency, and adversarial robustness - three properties that are difficult to achieve together in purely digital systems.

[CV-104] MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

链接: https://arxiv.org/abs/2605.19359
作者: Halil Ibrahim Gulluk,Olivier Gevaert
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: this https URL

[CV-105] Skinned Motion Retargeting with Spatially Adaptive Interaction Guidance SIGGRAPH2026

链接: https://arxiv.org/abs/2605.19355
作者: Soojin Choi,Seokhyeon Hong,Chaelin Kim,Junghyun Nam,Junhyuk Jeon,Junyong Noh
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: SIGGRAPH 2026 / ACM TOG. Project page available at this https URL

点击查看摘要

Abstract:Retargeting motion across characters with varying body shapes while preserving interaction semantics, such as self-contact and near-body proximity, remains a challenging problem. While recent geometry-aware approaches address this by maintaining spatial relationships between predefined corresponding regions, their reliance on static correspondences often struggles when the target character exhibits exaggerated body proportions. In this paper, we present a geometry-aware motion retargeting framework that preserves interaction semantics by performing proximity matching over spatially adaptive anchors. Unlike prior methods with static anchor definitions, the proposed method dynamically repositions anchors to reachable regions on the target character. This is achieved via a Transformer-based anchor refinement strategy that predicts anchor displacements and constrains the translated anchors to remain on the target character geometry through differentiable soft projection. By incorporating pose-dependent spatial structures from the source character, the adapted anchors provide structurally coherent guidance for interaction-aware retargeting. Conditioned on these anchors, a graph-based autoencoder predicts target skeletal motion that preserves the spatial configuration of the source. To encourage task-aligned optimization between anchor adaptation and motion retargeting, we adopt an alternating training scheme in which each module is optimized in turn. Through extensive evaluations, we demonstrate that our method outperforms state-of-the-art approaches in preserving interaction fidelity across diverse character geometries.

[CV-106] Semantic-Enriched Latent Visual Reasoning

链接: https://arxiv.org/abs/2605.19342
作者: Tianrun Xu,Yue Sun,Qixun Wang,Jingyi Lu,Yuan Wang,Tianren Zhang,Longteng Guo,Fengyun Rao,Jing Lyu,Feng Chen,Jing Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.

[CV-107] Selective Regularized and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation CVPR2026

链接: https://arxiv.org/abs/2605.19340
作者: Junyuan Ma,Xunzhi Xiang,Wenbin Li,Qi Fan,Yang Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 11 figures, 13 tables. Accepted to CVPR 2026

点击查看摘要

Abstract:Vision foundation models (VFMs) have achieved strong performance across various vision tasks. However, it still remains challenging to apply VFMs for cross-domain few-shot segmentation (CD-FSS), which segments objects of novel classes under domain shifts using only a few labeled exemplars. The challenge is mainly driven by two factors: (1) limited labeled exemplars per novel class relative to the scale of VFM pre-training, making the model prone to overfitting during retraining, and (2) target-domain shifts underrepresented during pre-training, inducing cross-domain inconsistency and layer-wise sensitivity. To address these issues, we propose Hierarchical Exemplar Representation Adaptation (HERA), a three-stage select-regularize-calibrate VFM-based segmentation framework that learns effectively from limited labels and adapts to novel domains without source-data retraining. We first design Hierarchical Layer Selection (HLS) to adaptively identify the most informative VFM layer using a data-dependent Exemplar Transfer Risk (ETR) computed for each candidate layer. Then, Prior-Guided Regularization (PGR) regularizes interactions on the selected representation, yielding well-structured local signals for the subsequent stage. Furthermore, Pixelwise Adaptive Calibration (PAC) combines the selected representation with the refined interaction maps to calibrate pixel-wise predictions, producing consistent masks. Together, these stages form a hierarchical select-regularize-calibrate pipeline that guides frozen VFM features in new domains while fine-tuning less than 2.7% of parameters at test time. Extensive experiments show that HERA surpasses the state of the art by more than 4.1 mIoU across multiple CD-FSS benchmarks.

[CV-108] RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

链接: https://arxiv.org/abs/2605.19329
作者: Hanqing Liu,Mingjie Liu,Luoping Cui,Endian Lin,Donghong Jiang,Chuang Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question-answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments. Code and datasets are available at this https URL.

[CV-109] DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLM s

链接: https://arxiv.org/abs/2605.19322
作者: Minyoung Park,Taehun Kong,Sangjun Ahn
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in Video Large Language Models (Video-LLMs) have greatly expanded multimodal reasoning capabilities. However, the massive number of visual tokens extracted from long video sequences incurs prohibitive computational costs, limiting their deployment in real-world scenarios. Existing training-free token compression methods select tokens based on attention magnitude as a proxy for semantic importance, but often overlook positional bias and rely only on short-term temporal locality, leading to redundant spatio-temporal coverage and inefficient token usage. We present DynaTok, a training-free, temporally adaptive and bias-aware token compression framework that allocates token budgets across both temporal and spatial dimensions. Through a lightweight exponential moving average (EMA) memory, the Temporal Budget Allocation (TBA) module dynamically assigns fewer tokens to redundant frames and more to novel frames, capturing long-term temporal variation. The Spatial Budget Allocation (SBA) module complements this by selecting spatially diverse and semantically important features using activation-based attention maps, while leveraging a spatial memory to reduce redundancy from previously selected regions and mitigate positional bias. DynaTok integrates seamlessly with existing Video-LLMs such as LLaVA-OneVision and LLaVA-Video without retraining, and effectively preserves semantic coverage under aggressive compression. Experiments on four representative VideoQA benchmarks-MVBench, LongVideoBench, MLVU, and VideoMME-show that DynaTok retains over 95% of baseline accuracy even with a 90% token reduction, surpassing recent training-free approaches. These results demonstrate that DynaTok provides a principled foundation for efficient and robust video reasoning, paving the way toward real-time streaming video understanding with future Video-LLMs.

[CV-110] xtAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

链接: https://arxiv.org/abs/2605.19320
作者: Mingxuan Cui,Jingpu Yang,Fengxian Ji,Qian Jiang,Zhecheng Shi,Jiaming Wang,Zirui Song,Fajri Koto,Xiuying Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.

[CV-111] SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

链接: https://arxiv.org/abs/2605.19319
作者: Yiren Song,Yihan Wang,Xiyao Deng,Zhuoran Yan,Mike Zheng Shou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task-relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX-Kontext under the same robotic data setting, and find that image editing produces more reliable task-level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one-shot sparse visual planning framework that progressively generates a sequence of task-relevant manipulation keyframes through successive image editing, conditioned on language instructions and optional arrow-based spatial guidance. A goal-conditioned diffusion action predictor then converts adjacent imagined keyframes into executable action chunks. To reduce the mismatch between real and edited visual subgoals, we further introduce a mixed-training strategy with filtered edited targets. Experiments on DROID and RoboMimic show that SWEET improves keyframe prediction across seen and unseen scenes and enables a full pipeline from sequential keyframe planning to executable robot actions, suggesting that image editing is a promising and underexplored direction for embodied visual prediction.

[CV-112] MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems

链接: https://arxiv.org/abs/2605.19307
作者: Quanxing Xu,Yuhao Tian,Ling Zhou,Xian Zhong,Xiaohua Huang,Rubing Huang,Chia-Wen Lin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Question Answering (VQA), as the representative multimodal task, serves as a key benchmark for evaluating the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, existing evaluations largely rely on static datasets and accuracy-based metrics, which fail to capture robustness, consistency, and generalization. Inspired by Metamorphic Testing (MT), we propose Metamorphic Robustness Assessment (MetaRA), a testing framework that employs Metamorphic Relations (MRs) to systematically probe vulnerabilities in MLLM-based VQA systems. MetaRA generates controlled variations of image-question inputs based on specific MRs and evaluates models across diverse conditions. Applying MetaRA to multiple MLLM-based VQA models across different tasks reveals nuanced failure patterns, including sensitivity to linguistic perturbations, over-reliance on superficial visual cues, and deeper weaknesses in multimodal reasoning. Experimental results demonstrate that MetaRA provides richer diagnostic insights than conventional accuracy metrics, exposing failure modes that remain hidden under standard benchmarks. Overall, this work highlights the need for systematic robustness evaluation in VQA and positions metamorphic assessment as a scalable, model-agnostic approach toward trustworthy multimodal AI.

[CV-113] Matérn Noise for Triangulation-Agnostic Flow Matching on Meshes SIGGRAPH2026

链接: https://arxiv.org/abs/2605.19305
作者: Tianshu Kuai,Arman Maesumi,Daniel Ritchie,Noam Aigerman
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: In ACM Transactions on Graphics (SIGGRAPH 2026). Project page: this https URL

点击查看摘要

Abstract:This paper tackles the task of learning to generate signals over triangle meshes in a triangulation-agnostic manner, meaning the trained model can be applied to different meshes and triangulations effectively. Practically, the paper adapts the flow matching (FM) paradigm to a mesh-based, triangulation-agnostic setting. Theoretically, it proposes a specific noise distribution which is triangulation agnostic, to be used inside the FM model’s denoising process. While noise distributions are usually trivial to devise for, e.g., images, devising a triangulation-agnostic distribution proves to be a much more difficult task. We formulate a mathematical definition of triangulation agnosticism of distributions, via their spectrum. We then show that a discretization of a specific Gaussian random field called a Matérn process holds these desired properties, and provides a simple and efficient sampling algorithm. We use it as our noise model, and adapt FM to the triangulation-agnostic setting by using a state-of-the-art approach for learning signals on meshes in the gradient domain – PoissonNet – as the denoiser. We conduct experiments on elaborate tasks such as sampling elastic rest states, and generating poses of humanoids. Our method is shown to be capable of producing highly realistic results for meshes of over one million triangles, significantly exceeding the state-of-the-art in quality and diversity.

[CV-114] MMGS: 10times Compressed 3DGS through Optimal Transport Aggregation based on Multi-view Ranking

链接: https://arxiv.org/abs/2605.19304
作者: Beizhen Zhao,Sicheng Yu,Ziran Yin,Dongxu Shen,Hao Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 19 pages

点击查看摘要

Abstract:While 3D Gaussian Splatting (3DGS) has revolutionized 3D reconstruction, it suffers from significant overhead due to massive redundant primitives. Existing compression methods typically rely on local sampling or fixed pruning thresholds, which often struggle to balance redundancy reduction with high-fidelity rendering. To address this, we propose a novel framework that formulates Gaussian optimization as a global geometric distribution matching problem. Specifically, our approach integrates three components: (1) we introduce a multi-view 3D Gaussian contribution ranking mechanism that filters primitives using geometric consistency instead of local heuristics; (2) we propose a global Optimal Transport (OT)-based aggregation algorithm that merges redundant primitives while preserving the underlying geometry; and (3) we design an OT-based densification operator that maintains the Gaussian’s distributional properties for stable optimization. Our approach achieves state-of-the-art rendering quality with only \textbf10 % primitives and \textbf10 \times accelerated training speeds compared to vanilla 3DGS.

[CV-115] GSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language Models

链接: https://arxiv.org/abs/2605.19301
作者: Xuezhi Cui,Dongbo Zhou,Wang Guo,Zeyuan Wang,Ziyu Li,Gaozhi Zhou,Xian Li,Ling Zhao,Wentao Yang,Chao Tao,Haifeng Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models require efficient adaptation to continually emerging downstream tasks. While Parameter-Efficient Fine-Tuning mitigates catastrophic forgetting, assigning isolated modules per task leads to parameter explosion. Conversely, recent similarity-driven sharing mechanisms falsely equate superficial visual similarity with underlying alignment consistency. This fundamental mismatch triggers severe negative transfer between visually similar but logically distinct tasks and fails to exploit alignment reuse across visually diverse ones. We argue thatalignment sharing is fundamentally a geometric problem of overlapping optimization trajectories within shared low-rank subspaces. Grounded in this insight, we propose iGSP, a novel framework that achieves efficient adaptation via implicit gradient subspace projection. Leveraging the early convergence of MoE routers to establish the subspace basis, iGSP bifurcates the adaptation process into two phases. First, the Subspace Identification phase introduces candidate experts via basis pre-expansion, applies a novel subspace-constrained regularization to implicitly project new task gradients onto the historical subspace, and precisely prunes redundant dimensions by treating routing probabilities as gradient flow indicators, ultimately to maximize knowledge reuse. Second, the Orthogonal Subspace Fine-Tuning phase fixes this structural basis and removes the regularization to rapidly fit the task-specific residual loss. Extensive experiments on the MTIL benchmark demonstrate that iGSP achieves state-of-the-art accuracy while significantly improving training efficiency, reducing the average trainable parameters by 42.7% compared to current SOTA methods, and decreasing the final total parameters by 86.9% relative to counterparts. The source code is available at this https URL.

[CV-116] What Makes Synthetic Data Effective in Image Segmentation ICML2026

链接: https://arxiv.org/abs/2605.19289
作者: Jinjin Zhang,Xiefan Guo,Yizhou Jin,Nan Zhou,Di Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Driven by rapid advances in large-scale generative models, synthetic data has emerged as a promising solution for visual understanding. While modern diffusion models achieve remarkable photorealistic image synthesis, their potential in complex visual segmentation tasks remains underexplored. In this work, we conduct a systematic analysis of synthetic images from state-of-the-art diffusion models to uncover the factors governing their utility. In particular, synthetic images characterized by dense scene composition and fine instance fidelity demonstrate distinctive benefits, yielding significantly more discriminative spatial representations. Building on these insights, we propose SENSE, a unified framework that leverages flexible and scalable synthetic data to substantially enhance segmentation performance. Notably, SENSE is model-agnostic, compatible with diverse architectures (e.g., DPT and Mask2Former), and scales effectively across models with varying parameter capacities. Extensive experiments on Cityscapes, COCO, and ADE20K validate the effectiveness and generalization capability of our approach. Code is available at this https URL.

[CV-117] FPED: A Functional-Network Prior-Guided Mixture-of-Experts Framework for Interpretable Brain Decoding

链接: https://arxiv.org/abs/2605.19279
作者: Yudan Ren,Pengcheng Shi,Zihan Ma,Xiaowei He,Xiao Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages,4 figures

点击查看摘要

Abstract:Visual image reconstruction from functional Magnetic Resonance Imaging (fMRI) is a fundamental task in brain decoding, providing a crucial pathway for understanding human perceptual mechanisms and developing advanced brain-computer interfaces (BCIs). However, most current methods simply flatten fMRI signals from localized visual cortices into one-dimensional (1D) vectors, mapping them directly into latent spaces such as that of Contrastive Language-Image Pre-training (CLIP). This paradigm not only disrupts the inherent network topology of the brain-leading to limited neuroscientific interpretability-but also overlooks the synergistic contributions of other distributed functional networks in processing high-level visual semantics. To address these limitations, we propose FPED, a Functional-Network Prior-Guided Mixture of Experts (MoE) framework for interpretable brain decoding. FPED explicitly models different functional brain networks as specialized experts and employs adaptive routing to capture their complementary contributions to visual semantic understanding. Unlike conventional homogeneous decoding paradigms, our framework incorporates neurobiologically grounded priors to enable structured and interpretable network-level representation learning. Experimental results demonstrate that FPED achieves highly competitive semantic reconstruction performance with only 0.68B parameters. The learned routing dynamics reveal biologically meaningful correspondence between functional brain networks and modality-specific semantic processing, providing transparent neuroscientific interpretability. This suggests that brain network-aware expert modeling is a promising direction for bridging neural decoding and biologically inspired artificial intelligence.

[CV-118] Distribution Matching Distillation without Fake Score Network

链接: https://arxiv.org/abs/2605.19256
作者: Youngjoong Kim,Deokyeong Lee,Jaesik Park
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Distribution Matching Distillation (DMD) provides an effective distribution-level correction for few-step generation, while relying on an auxiliary fake-score network to track the evolving generative distribution. Recent work combines DMD-style objectives with flow-map generators to exploit both forward-divergence training and reverse-divergence correction. The fake-score estimator remains an additional component with memory and update overhead. In this work, we study whether this explicit tracker can be avoided when the generator itself has a flow-map structure. We propose Fake-Score-network-Free DMD (FSF-DMD), a DMD formulation for flow-map generators that replaces the auxiliary fake-score estimator with a generator-induced pseudo-velocity surrogate. The key observation is that the endpoint pseudo-velocity of a flow-map generator provides a tractable proxy for fake-velocity estimation, allowing the generator itself to supply the reverse-divergence signal. Building on this observation, we derive a practical objective, extend it with flow-map-consistent backward simulation, and introduce a self-teacher variant for training from scratch. In our ImageNet-1K 256 \times 256 experiments, FSF-DMD improves flow-map baselines, reaches lower FID than the listed DMD2 comparisons in the flow-map-initialized setting, and remains effective under flow-matching initialization and training from scratch.

[CV-119] Structuring Open-Ended NAS: Semi-Automated Design Knowledge Structuring with LLM s for Efficient Neural Architecture Search

链接: https://arxiv.org/abs/2605.19247
作者: Yuiko Sakuma,Masakazu Yoshimura,Marcel Gröpl,Zitang Sun,Junji Otsuka,Atsushi Irie,Takeshi Ohashi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 42 pages

点击查看摘要

Abstract:Current neural architecture search (NAS) methods are often limited by their predefined, restrictive search spaces. While recent large language model (LLM)-assisted NAS methods enable open-ended search spaces, they often suffer from inefficient exploration due to biased or low-quality design ideas. To address these issues, we propose to semi-automatically structure model design knowledge to guide the search process. Our approach first defines a high-level structural template of architectural attributes. An LLM then populates this template by analyzing papers, creating a rich and diverse search space that embodies this structured design knowledge. To efficiently explore this vast space, we introduce FairNAD, using a multi-type mutation that enables broad exploration through mutation with fair idea sampling, Pareto-aware mutation, LLM-driven iterative mutation, and a fine-grained feedback loop. We demonstrate the effectiveness of FairNAD in discovering high-performing architectures that yield 0.84, 2.17, and 2.35 points improvement on CIFAR-10, CIFAR-100, and ImageNet16-120, respectively, compared to current state-of-the-art methods.

[CV-120] PhyWorld: Physics-Faithful World Model for Video Generation

链接: https://arxiv.org/abs/2605.19242
作者: Pu Zhao,Juyi Lin,Timothy Rupprecht,Arash Akbari,Chence Yang,Rahul Chowdhury,Elaheh Motamedi,Arman Akbari,Yumei He,Chen Wang,Geng Yuan,Weiwei Chen,Yanzhi Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations, namely, generated videos that preserve the physical state implied by the conditioning input, and evolve in ways consistent with basic physical principles. We propose PhyWorld, a video generation world model designed to produce temporally coherent and physically faithful scene continuations through two-stage post-training. In the first stage, we improve video-to-video continuation with flow matching fine-tuning, encouraging stable visual attributes and coherent motion dynamics across frames. In the second stage, we align generated dynamics with physical principles using Direct Preference Optimization (DPO) over physics preference pairs, guiding the model toward outputs with higher physical plausibility. To evaluate PhyWorld, we use both standard video-quality benchmarks and a dedicated physical-faithfulness benchmark with per-law scoring. Experiments show that PhyWorld improves video consistency, achieving an average score of 0.769 on VBench compared with 0.756 or below for state-of-the-art baselines. PhyWorld also improves physical plausibility, reaching an average score of 3.09 on our physical-faithfulness benchmark compared with 2.99 for the strongest baseline. These results suggest that post-training large video generation models with continuation and physics-preference signals can make them more effective world simulators for Physical AI.

[CV-121] Robust Mitigation of Age-Dependent Confounding Effects via Sample-Difficulty Decorrelation

链接: https://arxiv.org/abs/2605.19230
作者: Nikhil Cherian Kurian,Victor Caquilpan Parra,Abin Shoby,Luke Whitbread,Lyle J. Palmer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 Pages, 3 Figures

点击查看摘要

Abstract:Age dependent performance disparities in medical image classification often arise because age acts as a confounder, linking imaging morphology with disease prevalence. In practice, disparities can manifest as overdiagnosis at ages where disease prevalence is higher and underdiagnosis at ages where prevalence is lower, and can worsen under train test shifts in the age distribution. Conventional mitigation approaches that enforce strict age invariance may suppress diagnostically meaningful information encoded in age. We therefore propose a robust framework that mitigates the effects of age-dependent confounding by targeting spurious age linked trends rather than enforcing invariance. Following a warm-up phase, we characterize sample difficulty and model its age-dependent trends in a label-conditioned manner. We decorrelate age from dominant age difficulty trends using robust, Huber weighted affinity weights, attenuating confounding-driven shortcuts while preserving clinically meaningful, nonlinear age information. We further introduce an Age Coverage Score that scales the decorrelation penalty by minibatch age variance to ensure stable optimization under limited age diversity. Across two radiology datasets, our approach reduces age dependent true and false positive disparities with minimal AUC impact and remains robust to increasing train test age distribution shifts.

[CV-122] HAVEN: Hierarchically Aligned Multimodal Benchmark for Unified Video Understanding

链接: https://arxiv.org/abs/2605.19223
作者: Mengqi Shi,Haopeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) exhibit strong performance on standard video tasks, their ability to faithfully summarize and reason over complex narratives remains poorly evaluated. Existing summarization benchmarks fragment supervision across isolated granularities, such as keyframes, key shots, or disjointed text summaries, failing to capture the inherently hierarchical structure of cross-modal alignment. To address this critical gap, we introduce HAVEN, a hierarchically aligned multimodal benchmark for unified video understanding. HAVEN pioneers a fully granular (frame, shot, and video levels) and fully multimodal (video and text) dataset architecture, complete with explicit, continuous alignment between modalities. Built upon this unified annotation paradigm, we propose a comprehensive evaluation suite spanning summarization, temporal reasoning, multimodal grounding, and saliency ranking. Extensive benchmarking of state-of-the-art MLLMs exposes a persistent gap between surface-level textual fluency and grounded multimodal understanding. Ultimately, HAVEN advances the evaluation of multimodal systems beyond traditional QA formats, offering a rigorous, standardized testbed to drive future research in interpretable, hierarchical video understanding. We publicly release the dataset, benchmark suite, and evaluation protocols.

[CV-123] Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

链接: https://arxiv.org/abs/2605.19218
作者: Beomseok Kang,Dongwon Jo,Jiwon Song,Donghwee Son,Jae-Joon Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.

[CV-124] Worst-Group Equalized Odds Regularization for Multi-Attribute Fair Medical Image Classification

链接: https://arxiv.org/abs/2605.19214
作者: Nikhil Cherian Kurian,Victor Caquilpan Parra,Abin Shoby,Luke Whitbread,Lauren Oakden-Rayner,Robert Vandersluis,Jessica Schrouff,Lyle J. Palmer,Mark Jenkinson
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 Pages, 2 Figures

点击查看摘要

Abstract:Diagnostic performance in medical AI varies systematically across demographic groups, yet subgroup AUC can mask clinically important disparities. At a fixed inference-time operating point, some groups may exhibit over-diagnostic behaviour, characterized by elevated true and false positive rates, while others show under-diagnostic patterns with reduced true and false positive rates. These opposing tendencies can cancel in aggregate AUCs while producing meaningful inequities in clinical decision-making. Motivated by the need to assess and mitigate such disparities at the operating point and across multiple demographic attributes simultaneously, we propose a worst-group equalized-odds margin regularizer. The proposed regularizer explicitly targets subgroup-level deviations on both the true positive and false positive sides at inference. At each update, the method identifies subgroups defined by explicit demographic attributes (e.g., age, sex, and race) that exhibit the most extreme margin deviations and applies a unified penalty, enabling fairness optimization across multiple demographic axes without requiring explicit intersectional constraints. Across two medical imaging datasets in realistic multi-label settings, our method consistently reduces disparities in Equalized Odds and Equalized Opportunity with minimal impact on AUC, preserving diagnostic performance while improving fairness.

[CV-125] Smartphone-based Circular Plot Sampling for Forest Inventory

链接: https://arxiv.org/abs/2605.19213
作者: Su Sun,Jui-Cheng Chiu,Nabin Khanal,Songlin Fei,Yingjie Victor Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Circular sample plots are a cornerstone of forest inventory, yet accurate measurement of tree diameter at breast height (DBH) and spatial location within such plots remains challenging. Conventional approaches rely either on costly terrestrial LiDAR systems or labor-intensive manual methods involving calipers and compass bearings, limiting their scalability and accessibility in large scale environments. We present a lightweight, smartphone-based pipeline that enables complete plot sampling based tree measurement from a single walkthrough video, requiring no specialized hardware beyond a consumer smartphone mounted on a portable stand. The proposed method integrates pretrained monocular depth estimation and tree instance segmentation with a simultaneous localization and mapping (SLAM) framework to jointly refine camera trajectories and depth across the video sequence. Tree positions and DBH estimates are recovered by fusing SLAM-derived camera poses with segmented depth maps, with absolute real-world scale anchored via a calibrated reference length. The system was evaluated in both managed forest plots and natural forest plot, achieving a mean absolute error of 1.51 cm (MARE 3.98%) and 2.30 cm (MARE 5.69%) respectively, with consistent performance across varying starting directions and positions. Cross-video consistency analysis further demonstrated stable and reproducible tree localization across measurements initiated from different starting positions. The proposed approach achieves accuracy comparable to established field methods while substantially reducing equipment cost and operational complexity, making it accessible to both professional researchers and non-expert forest managers in diverse operational settings. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.19213 [cs.CV] (or arXiv:2605.19213v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.19213 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-126] D-Convexity: A Unified Differentiable Convex Shape Prior via Quasi-Concavity for Data-driven Image Segmentation CVPR2026

链接: https://arxiv.org/abs/2605.19210
作者: Shengzhe Chen,Hao Yan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2026

点击查看摘要

Abstract:Convexity is a fundamental geometric prior that underlies many natural and man-made structures, yet remains challenging to impose effectively in end-to-end trainable segmentation networks. We revisit convexity from a functional perspective and propose a unified, threshold-free convexity prior based on the quasi-concavity of the network’s output mask function u. Instead of constraining a single binary segmentation, we require all super-level sets of u to be convex, transforming global shape constraints into local, differentiable inequalities on u and its derivatives. From this principle, we derive zero, first, and second-order characterizations, yielding respectively a local midpoint convexification algorithm, a gradient-based condition linked to supporting hyperplanes, and a sufficient second-order inequality expressed as a quadratic form on the tangent plane. The first and second-order formulations produce a compact convolutional loss that can be densely applied across the image without thresholding. Our quasi-concavity losses integrate seamlessly with modern segmentation networks via the proposed convex gradient projection module (CGPM). They consistently enforce convexity and improve shape regularity across multiple datasets, outperforming networks tailored for retinal segmentation and surpassing previous shape-aware methods. Remarkably, our analysis unifies a wide spectrum of previous convex shape models, from discrete 1-0-1 line constraints and graph-cuts convexity formulations to curvature or signed distance Laplacian based level-set priors, within a single continuous and differentiable framework.

[CV-127] Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings

链接: https://arxiv.org/abs/2605.19207
作者: Sumanth Meenan Kanneti,Aryan Shah
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning models have shown strong performance in medical image analysis, but deploying them in low-resource clinical environments remains difficult due to computational, memory, and power constraints. This paper presents a multi-strategy compression framework for brain tumor classification from MRI, encompassing quantization-aware training, knowledge distillation from a DenseNet-101 teacher to a compact DenseNet-32 student with low-bit post-training quantization, and Float16 post-training quantization on a lightweight MobileNetV2 backbone. Using a multi-class brain tumor MRI dataset containing glioma, meningioma, pituitary tumors, and healthy controls, we provide full experimental validation of the MobileNetV2-based pipeline, training the classifier through a three-stage transfer learning process and applying Float16 quantization via TensorFlow Lite. The DenseNet-based distillation and quantization-aware training strategies are described as complementary compression approaches within the framework, with their complete empirical evaluation reserved for future work. Experimental results on the MobileNetV2 pipeline show that the quantized model achieves 82.37 percent validation accuracy compared to the 82.20 percent full-precision baseline, reducing model size from 35.34 MB to 5.76 MB, a 6.14x compression ratio with no meaningful accuracy loss. Per-class evaluation confirms that quantization preserves diagnostic performance uniformly across all four tumor categories. These findings demonstrate that lightweight quantized models can deliver clinically viable brain tumor screening in resource-constrained healthcare settings. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.19207 [cs.CV] (or arXiv:2605.19207v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.19207 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-128] Efficient coding along the visual hierarchy

链接: https://arxiv.org/abs/2605.19155
作者: Ananya Passi,Brian S. Robinson,Michael F. Bonner
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 6 figures

点击查看摘要

Abstract:Biological visual systems learn from limited experience, unlike deep learning models that rely on millions of training images. What learning principles make this possible? We tested whether efficient coding, the idea that neural representations capture the statistical structure of natural inputs, can build a hierarchy of human-aligned visual features from limited data. We developed an unsupervised learning procedure in which each layer of a deep network compresses its inputs onto the dominant modes of variation in natural images, using only local statistics and no labels, tasks, or backpropagation. This unsupervised procedure yields features that progress from edges and colors to textures and shapes. The features of this deep efficient coding model are readily recognized by human observers and are predictive of image-evoked fMRI responses in human visual cortex. Furthermore, a hybrid learning procedure that combines efficient coding with supervised fine-tuning yields better brain alignment in low-data settings and more rapid category learning. These findings suggest that efficient coding may shape representations across the entire visual hierarchy and help explain the data efficiency of biological vision.

[CV-129] owards Data-Efficient Video Pre-training with Frozen Image Foundation Models CVPR2026

链接: https://arxiv.org/abs/2605.19137
作者: Svetlana Orlova,Niccolò Cavagnero,Gijs Dubbelman
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 Workshops CV4Smalls

点击查看摘要

Abstract:Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model. Code: this https URL .

[CV-130] Knowing When Not to Predict: Self Supervised Learning and Abstention for Safer DR Screening IJCAI2026

链接: https://arxiv.org/abs/2605.19133
作者: Muskaan Chopra,Lorenz Sparrenberg,Jan H. Terheyden,Rafet Sifa
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at IJCAI 2026

点击查看摘要

Abstract:Self-supervised learning (SSL) is now a standard way to pretrain medical image models, but performance is still mostly judged by downstream accuracy. For safety-critical screening tasks such as diabetic retinopathy grading, this is not enough: a model must also know when its predictions are unreliable and defer uncertain cases for clinical review. In this work, we examine how the length of SSL pretraining influences calibrated confidence and confidence-based abstention. We evaluate multiple SSL checkpoints under a fixed fine-tuning protocol and assess calibrated confidence, coverage, selective accuracy, and selective macro-F1. Across datasets and data regimes, SSL pretraining improves selective prediction compared to training from scratch. Unlike prior SSL studies that primarily evaluate downstream accuracy or AUROC, we analyze how SSL pretraining duration influences confidence behavior under calibrated confidence-based abstention. However, once accuracy saturates, selective performance can still change markedly across checkpoints, and longer pretraining does not consistently improve reliability. These results underscore the importance of abstention-aware evaluation and suggest that pretraining length should be treated as an important reliability-related design choice rather than only a computational detail. Code is available at GitHub.

[CV-131] FAGER: Factually Grounded Evaluation and Refinement of Text-to-Image Models CVPR2026

链接: https://arxiv.org/abs/2605.19111
作者: Youngsun Lim,Cusuh Ham,Pin-Yu Chen,Deepti Ghadiyaram
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: It was accepted for an oral presentation at the 2nd Workshop on the Evaluation of Generative Foundation Models (EVGENFM2026) at CVPR 2026. Total 8 pages (1 page for references). 5 figures

点击查看摘要

Abstract:Existing text-to-image (T2I) evaluation metrics mainly assess whether generated images align with information explicitly stated in the prompt, but often fail to capture factual requirements that are implicit, externally grounded, or identity-defining. As a result, they are not well suited for evaluating factual correctness in prompts involving scientific knowledge, historical facts, products, or culture-specific concepts. We propose FActually Grounded Evaluation and Refinement (FAGER), an agentic framework that evaluates whether generated images correctly reflect visually verifiable facts grounded in or implied by the prompt, while also providing actionable feedback for improvement. FAGER first constructs a structured factual rubric by combining LLM-based fact proposal with reference-guided visual fact extraction and verification, then converts the rubric into question-answer pairs for VLM-based evaluation. To validate FAGER as a factuality metric, we introduce a Factual A/B test, which measures whether a metric prefers factual reference images over corresponding generated images. Across five datasets spanning science, history, products, culture, and knowledge-intensive concepts, FAGER consistently outperforms prior metrics on this test. We further show that FAGER can be used to refine T2I outputs in a fully training-free manner, yielding substantial factuality gains across datasets.

[CV-132] CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering ACL2026

链接: https://arxiv.org/abs/2605.19075
作者: Mahesh Bhosale,Abdul Wasi,Vishvesh Trivedi,Pengyu Yan,Akhil Gorugantu,David Doermann
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026 Multimodal Augmented Generation via MultimodAl Retrieval Workshop

点击查看摘要

Abstract:Grounded multi-video question answering over real-world news events requires systems to surface query-relevant evidence across heterogeneous video archives while attributing every claim to its supporting source. We introduce CRAFT (Critic-Refined Adaptive Key-Frame Targeting), a query-conditioned pipeline that combines dynamic keyframe selection, per-video ASR with multilingual fallback, and a hybrid critic loop to iteratively verify and repair claims before consolidation. The pipeline integrates UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator, with a final citation-merging stage that emits each fact once with all supporting source identifiers. On MAGMaR 2026, CRAFT achieves the best overall average (0.739), reference recall (0.810), and citation F1 (0.635). We further evaluate on a MAGMaR-style conversion of WikiVideo with 52 non-overlapping event queries, where CRAFT also performs strongly (0.823 Avg), showing that its claim-centric evidence aggregation generalizes beyond MAGMaR. Ablations show that atomic claims, ASR, and the critic loop drive the main gains over the vanilla query-conditioned baseline. Code and implementation details are publicly available at this https URL.

[CV-133] Learning Long-Term Temporal Dependencies in Photovoltaic Power Output Prediction Through Multi-Horizon Forecasting

链接: https://arxiv.org/abs/2605.19074
作者: Sumit Laha,Ankit Sharma,Hassan Foroosh
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid global expansion of solar photovoltaic (PV) capacity-reaching a record 597 GW in 2024-highlights the urgent need for robust forecasting models to mitigate the grid instability caused by the intermittent nature of solar irradiance. While deep learning-based direct forecasting using ground-based sky images (GSI) has emerged as a dominant approach, existing literature is often constrained by single-architecture evaluations and an exclusive focus on single-horizon (point) prediction. This paper proposes a transition from traditional single-horizon estimation toward a multi-horizon forecasting framework, leading to an architecture-independent improvement in accuracy. We hypothesize and demonstrate experimentally that joint optimization over a sequence of future values allows deep neural networks to better capture latent inter-step temporal dependencies by avoiding precocious convergence of the network in terms of both weight gradients and filter diversity. Leveraging this architecture-independent improvement that integrates sequential sky imagery with historical PV generation data, we evaluate the models’ abilities to predict power output across multiple discrete future time steps simultaneously. Our methodology is validated through a comparative analysis across diverse deep learning architectures. The results demonstrate that this multi-horizon approach significantly enhances predictive accuracy and robustness across the entire forecast horizon while maintaining computational parsimony. By achieving superior performance with negligible overhead compared to single-horizon models, this work provides a scalable and efficient solution to improve the resilience of modern power grids.

[CV-134] LiFT: Lifted Inter-slice Feature Trajectories for 3D Image Generation from 2D Generators

链接: https://arxiv.org/abs/2605.19060
作者: Xinhe Zhang,Yuyang Zhang,Pengfei Jin,Arnau Marin-Llobet,Na Li,Quanzheng Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:High-resolution 3D medical image generation remains challenging because fully volumetric models are computationally expensive, while efficient 2D slice generators often fail to preserve anatomical consistency across the third dimension. We propose LiFT, a framework for Lifted inter-slice Feature Trajectories that factorizes 3D volume synthesis into per-slice image generation and inter-slice trajectory learning. Rather than modeling the volumetric distribution end-to-end, LiFT treats a volume as an ordered trajectory in feature space, capturing how anatomical structures appear, transform, and disappear across depth. A tri-planar drifting loss aligns the trajectory of generated slices with the trajectories of real volumes, enabling distributional learning over inter-slice progressions in unconditional generation; in paired translation, a bidirectional z -context mixer trained against the registered target supplies through-plane coherence while preserving per-slice fidelity. We evaluate LiFT on BraTS 2023 (unconditional and missing-modality MR) and SynthRAD2023 (MR-to-CT). Across these settings, LiFT preserves per-slice quality, approaches the reported cWDM missing-MR reconstruction quality at \sim 135\times lower inference cost (without formal equivalence testing), and improves through-plane coherence on MR-to-CT relative to a no-mapper ablation, demonstrating that lightweight inter-slice trajectory learning is a viable route to high-resolution 3D medical synthesis.

[CV-135] Personalized Face Privacy Protection From a Single Image

链接: https://arxiv.org/abs/2605.19032
作者: Zachary Yahn,Fatih Ilhan,Tiansheng Huang,Selim Tekin,Sihao Hu,Yichang Xu,Margaret Loper,Ling Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Photos of faces uploaded online are vulnerable to malicious actors who can scrape facial images from online sources and intrude on personal privacy via unauthorized use of facial recognition models. This paper presents FaceCloak, a novel personalized face privacy protection system, which can generate defensive identity-specific universal face privacy masks from a single image of a user, causing facial recognition to fail. FaceCloak introduces a three-stage personalized face perturbation learning methodology: (1) It generates a small set of high-variety synthetic face images of a person based on a single image of the person. (2) It learns face cloaking by adding more protection to key facial-identity leakage regions through iterative perturbation generation over the small set of synthetic images, effectively shifting a user’s identity embedding towards a distant anchor identity and away from a similar one. (3) It generates a personalized identity-protective mask in the form of pixel-wise cloaking, which is light-weight and can be efficiently applied to any facial image of a user while maintaining good perceptual quality. Extensive experiments on three popular face datasets across ten recognition models show the effectiveness of FaceCloak compared to 29 other existing representative methods. Code is available at this https URL

[CV-136] MedFM-Robust: Benchmarking Robustness of Medical Foundation Models MICCAI2026

链接: https://arxiv.org/abs/2605.19027
作者: Xiangxiang Cui,Tianjin Huang,Yifang Wang,Lijie Hu,Lu Yin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI2026

点击查看摘要

Abstract:Medical foundation models (MedFMs) have emerged as transformative tools in healthcare, demonstrating capabilities across diverse clinical applications. These models can be broadly categorized into two paradigms: Medical Vision-Language Models (Med-VLMs) and segmentation foundation models. Med-VLMs range from medical-specialized models such as LLaVA-Med and MedGemma, to general-purpose models like GPT-4o and Gemini, all capable of medical image understanding tasks including visual question answering (VQA), report generation, and visual grounding. Concurrently, the Segment Anything Model (SAM) has catalyzed a new generation of medical segmentation models, with adaptations like SAM-Med2D and MedSAM. The widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions.

[CV-137] A Systematic Failure Analysis of Vision Foundation Models for Open Set Iris Presentation Attack Detection

链接: https://arxiv.org/abs/2605.19020
作者: Rahul Anand,Siddharth Singh,Dileep A D,Mahadeva Prasanna,Raghavendra Ramachandra
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision foundation models have demonstrated strong transferability across diverse visual recognition tasks and are increasingly considered for biometric applications. Their suitability for iris Presentation Attack Detection (PAD), particularly under realistic open-set operating conditions, remains insufficiently examined. This work presents a systematic failure analysis of general-purpose vision foundation models for open-set iris PAD using periocular imagery. Five representative foundation models are evaluated under three open-set protocols that explicitly separate different sources of distribution shift: unseen Presentation Attack Instruments (PAIs), unseen datasets captured with different sensors and cross-spectral transfer from near-infrared (NIR) to visible spectrum (VIS) imagery. Both frozen feature representations and parameter-efficient task adaptation using Low-Rank Adaptation (LoRA) are assessed within a unified experimental framework. The results indicate that foundation models can transfer across datasets with similar sensing characteristics, but fail to generalise reliably to unseen attack instruments and degrade sharply under cross-spectral evaluation. While LoRA improves performance in certain cross-dataset settings, it frequently amplifies failure under attack-level and spectral shifts. Additional validation experiments using segmented iris inputs, full backbone fine-tuning, joint cross-dataset and cross-PAI shifts, and reverse VIS to NIR transfer further confirm that these failures are not simply artefacts of periocular input, weak adaptation, or one-directional spectral evaluation. These findings show that strong closed-set or cross-dataset performance should not be treated as evidence of robust open-set security, and highlight the need for PAD representations that maintain sensitivity to presentation artefacts while remaining stable under realistic deployment variation.

[CV-138] EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction

链接: https://arxiv.org/abs/2605.19004
作者: Ahmad Yehia,Abduallah Mohamed,Tianyi Wang,Jiseop Byeon,Kun Qian,Junfeng Jiao,Christian Claudel
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 21 pages, 14 figures. Project page: this https URL

点击查看摘要

Abstract:Accurately forecasting human trajectories from an egocentric perspective plays a central role in applications such as humanoid robotics, wearable sensing systems, and assistive navigation. However, progress in this direction remains limited due to the scarcity of egocentric trajectory datasets collected in real-world environments. Addressing this need, we introduce EgoTraj, an egocentric multimodal open dataset recorded using Meta Quest Pro (MQPro). EgoTraj contains 75 sequences of human navigation collected from multiple MQPro wearers in real-world urban environments. Each recording provides synchronized RGB video along with ground-truth data, including continuous time-synchronized 6-degree-of-freedom head poses, per-frame 3D eye gaze vectors, scene annotations. To the best of our knowledge, EgoTraj differs from typical egocentric trajectory datasets by capturing long-horizon, self-directed navigation across diverse urban routes with broad participant diversity. To demonstrate the potential of the dataset, we benchmark several state-of-the-art methods for egocentric trajectory prediction and conduct ablation studies to analyze the contributions of gaze, scene, and motion cues. The results highlight the utility of EgoTraj for AR-based perception, navigation, and assistive systems. The EgoTraj dataset, code, and EgoViz Dashboard are publicly available at this https URL.

[CV-139] Artifact-Bench: Evaluating MLLM s on Detecting and Assessing the Artifacts of AI-Generated Videos

链接: https://arxiv.org/abs/2605.18984
作者: Yuqi Tang,Yang Shi,Zhuoran Zhang,Qixun Wang,Xuehai Bai,Yue Ding,Ruizhe Chen,Bohan Zeng,Xinlong Chen,Xuanyu Zhu,Bozhou Li,Yuran Wang,Yifan Dai,Chengzhuo Tong,Xinyu Liu,Yiyan Ji,Yujie Wei,Yuhao Dong,Shilin Yan,Fengxiang Wang,Yi-Fan Zhang,Haotian Wang,Yuanxing Zhang,Pengfei Wan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent video generative models have greatly improved the realism of AI-generated videos, yet their outputs still exhibit artifacts such as temporal inconsistencies, structural distortions, and semantic incoherence. While Multimodal Large Language Models (MLLMs) show strong visual understanding capabilities, their ability to perceive and reason about such artifacts remains unclear. Existing benchmarks often lack systematic evaluation of artifact-aware perception and fine-grained diagnostic reasoning, especially across diverse AI-generated video domains beyond photorealistic content. To address this gap, we introduce Artifact-Bench, a comprehensive benchmark for evaluating MLLMs on AI-generated video artifact detection and analysis. We first establish a three-level hierarchical taxonomy of realism artifacts, covering photorealistic, animated, and CG-style videos. Based on this taxonomy, Artifact-Bench defines three complementary tasks: real vs. AI-generated video classification, pairwise realism comparison, and fine-grained artifact identification. Experiments on 19 leading MLLMs reveal substantial limitations in artifact perception and reasoning, with many models approaching random or even below-random performance in challenging settings. We further observe significant misalignment between MLLM judgments and human perceptual preferences, highlighting their limited reliability as general evaluators for AI-generated video realism.

[CV-140] Harnessing Self-Supervised Features for Art Classification

链接: https://arxiv.org/abs/2605.18974
作者: Federico Melis,Davide Bilardello,Emanuele Prato,Evelyn Turri,Lorenzo Baraldi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: IRCDL 2026

点击查看摘要

Abstract:Classifying artworks presents a significant challenge due to the complex interplay of fine-grained details and abstract features that condition the style or genre of an artwork. This paper presents a systematic investigation of the effectiveness of supervised and self-supervised backbones as feature extractors for both artwork classification and retrieval, with a particular focus on paintings. We conduct an extensive experimental evaluation using the DINO family and CLIP models, assessing multiple classification strategies and feature representations. Our results demonstrate that employing a self-supervised backbone leads to consistent improvements in artwork classification performance. Moreover, our work provides insights into the applicability of classification and retrieval modules in real-world applications, such as virtual reality (VR) applications that support museum navigation.

[CV-141] MotionMERGE: A Multi-granular Framework for Human Motion Editing Reasoning Generation and Explanation

链接: https://arxiv.org/abs/2605.18956
作者: Bizhu Wu,Jinheng Xie,Wenting Chen,Zhe Kong,Jianfeng Ren,Linlin Shen,Ruibin Bai,Rong Qu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent motion-language models unify tasks like comprehension and generation but operate at a coarse granularity, lacking fine-grained understanding and nuanced control over body parts needed for animation or interaction. This stems from fundamental issues in both the model and the data, in which the model can’t focus on motion’s localized pattern, and the training data lacks fine-grained supervision. To tackle this, we propose MotionMERGE, a unified framework that bridges the granularity gap. First, we pioneer the study of fine-grained languageguided motion control, including detailed understanding and localized editing, by explicitly modeling motion at part and temporal levels within a single LLM, thereby endowing the model with robust priors for precise control. Second, we design ReasoningAware Granularity-Synergy pre-training, a novel strategy that employs joint supervision for cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. This equips the model with fine-grained motion-language alignment, crossgranularity synergy, and explicit reasoning ability. Third, we curate MotionFineEdit, a large-scale dataset (837K atomic + 144K complex triplets) with the first fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations, establishing a new benchmark for fine-grained text-driven motion editing and motion-grounded reasoning. Extensive experiments demonstrate the capability of MotionMERGE for more precise motion generation, understanding, and editing, and compelling zero-shot generalization to other complex motion tasks. This work represents a significant step toward models that interact with motion in finer granularity and human-like reasoning.

[CV-142] CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation CVPR2026

链接: https://arxiv.org/abs/2605.18916
作者: Gyubin Lee,Junwon Lee,Juhan Nam
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: accepted to CVPR 2026 Workshop on Sight and Sound

点击查看摘要

Abstract:We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing VideoText-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines. To evaluate replacement quality, we propose a metric leveraging a text-audio co-embedding space to measure both target-prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at this https URL

[CV-143] Reasoning Portability: Guiding Continual Learning for MLLM s in the RLVR Era

链接: https://arxiv.org/abs/2605.18903
作者: Qiuhe Hong,Yuyang Liu,Shuo Yang,Tiantian Peng,Fei Zhu,Yonghong Tian
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal tasks while retaining prior knowledge. The emerging paradigm that couples Multimodal Large Language Models (MLLMs) with Reinforcement Learning with Verifiable Rewards (RLVR) calls for a new pattern to guide continual adaptation. Advances in reasoning capability now make it feasible to impose constraints at the reasoning level. We formalize portability, a sample-level measure of how reusable the previous policy’s behavior is on a new task, and empirically show that reasoning-level signals remain reliable on out-of-distribution samples while answer-level signals do not. We instantiate this as Reasoning Portability (RP) and propose Reasoning-based Dynamic Balance Continual Learning (RDB-CL), which modulates the per-sample Kullback-Leibler regularization in RLVR according to RP: a tight anchor preserves reusable reasoning on high-RP samples, while a relaxed anchor on low-RP samples permits exploration of new reasoning pathways. Experiments show that RDB-CL consistently outperforms baselines, improving Last accuracy by +12.0% over the vanilla RLVR baseline.

[CV-144] Navigating the Emotion Tree: Hierarchical Hyperbolic RAG for Multimodal Emotion Recognition

链接: https://arxiv.org/abs/2605.18884
作者: Zeheng Wang,Bo Zhao,Yijie Zhu,Zhishu Liu,Hui Ma,Ruixin Zhang,Shouhong Ding,Qianyu Xie,Zitong Yu
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal emotion recognition aims to integrate text, audio, and video sources to understand human affective states. Although multimodal large language models excel at multimodal reasoning, they typically treat emotion categories as independent labels, ignoring the rich hierarchical taxonomy of human psychology. Moreover, lacking external contextual knowledge makes them highly susceptible to over-interpreting noisy cues, further complicating fine-grained emotion classification. To address these issues, we propose \textbfHyperEmo-RAG, a retrieval-augmented generation framework that leverages a structured emotional knowledge base. Our framework introduces two key innovations. 1) Hierarchical hyperbolic grounding. Recognizing the inherent hierarchical tree structure of emotion taxonomies, we jointly embed hierarchical emotion labels and multimodal samples into a continuous hyperbolic space (Poincaré ball) and design a hierarchical beam-search deliberation process that progressively retrieves samples from coarse to fine-grained levels. 2) Structured evidence injection. Based on the retrieved evidence, we construct an evidence graph and inject the structured knowledge as explicit cognitive context into the LLM through a Tree-Aware Attention mechanism and an EmotionGraphFormer, preserving the integrity of graph-structured information. Experiments on multiple datasets demonstrate that HyperEmo-RAG significantly outperforms existing methods.

[CV-145] A Multi-Dimensional Clustering Approach for Identifying Inborn Errors of Immunity

链接: https://arxiv.org/abs/2605.18880
作者: Nishad Kulkarni,Alexandra K. Martinson,Nicholas L. Rider,Michael Keller,Syed Muhammad Anwar
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Accepted at EMBC 2026

点击查看摘要

Abstract:Rare diseases such as inborn errors of immunity (IEI) require early diagnosis to prevent end organ damage and improve quality of life. Hurdles in accessing and curating large scale electronic health record (EHR) data limit routine data driven analyses to remain on the forefront of IEI and other rare disease trends. Development of machine learning (ML) algorithms in IEI for pattern recognition as well as published methodology examining how to systematically process and integrate complex medical data is limited. Our proposed pipeline, including data curation and ML clustering algorithms, is designed to recognize novel rare disease patterns and extract IEI- associated features from a national data registry. Our methodology for EHR data formatting and processing presents the pipeline that transforms raw immunologic lab data into vectors. This is further combined with hyperparameter tuning for diseases pattern recognition via clustering. This study refines IEI feature awareness, develops data tool kits for rare disease populations analysis, and expands on transforming complex medical records in data structures interpretable by unsupervised ML.

[CV-146] DarkLLM : Learning Language-Driven Adversarial Attacks with Large Language Models

链接: https://arxiv.org/abs/2605.18868
作者: Ye Sun,Xin Wang,Jiaming Zhang,Yifeng Gao,Yixu Wang,Yifan Ding,Qixian Zhang,Henghui Ding,Xingjun Ma,Yu-Gang Jiang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages, 13 figures

点击查看摘要

Abstract:While vision and multimodal foundation models underpin critical tasks from perception to complex reasoning, they remain highly vulnerable to adversarial attacks. However, traditional adversarial attacks are typically limited to single, predefined objectives, tightly coupling each attack to a specific model or task, which restricts their scalability and flexibility in real-world scenarios. In this work, we present DarkLLM, a novel attack framework that trains an LLM to translate natural-language attack instructions into latent attack vectors, which are then decoded into visual adversarial perturbations. By leveraging natural-language instruction tuning, DarkLLM not only unifies targeted, untargeted, segmentation, and multi-model attacks within a single framework, but also achieves flexible and controllable adversarial generation, enabling each instruction to produce a perturbation that induces desired behaviors across heterogeneous models. Through extensive experiments across 4 tasks, 13 datasets, and 15 models, we demonstrate that DarkLLM with only 1B parameters can follow attacker instructions and generate highly effective attacks against CLIP, SAM, and frontier LLMs, revealing a systemic vulnerability in modern foundation models.

[CV-147] From Llama to Cria: Scaling Down Neural Networks via Neuron-Level Spectral Structural Importance Evaluation

链接: https://arxiv.org/abs/2605.18860
作者: Yongyu Wang
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes a neuron pruning framework based on neuron-level spectral structural importance evaluation. Given a trained neural network, we record the hidden states of each hidden layer during inference and model neurons as graph nodes, with hidden states treated as graph signals. Using ideas from graph signal processing, we infer layer-wise input and output graphs that characterize the structural relationships among neurons before and after each layer transformation. We then evaluate the spectral structural importance of neurons by analyzing the transformation between these graphs based on spectral graph theory. Neurons with high spectral structural importance are regarded as strongly involved in the internal representation transformation and are therefore preserved, while neurons with low importance scores are selected as pruning candidates. The pruning process is conducted iteratively until a predefined effective parameter reduction target is reached. Instead of fine-tuning after every pruning step, the proposed strategy first removes low-importance neurons to obtain a compact architecture and then applies a final recovery fine-tuning stage to restore task performance. By connecting neuron pruning with graph signal processing and spectral structural analysis, the proposed framework offers a principled way to reduce neural network size while maintaining solution quality. Experimental results on CIFAR-10 image classification and SST-2 sentiment classification show that our method can effectively remove low-importance neurons and achieve compact networks with competitive performance after recovery fine-tuning.

[CV-148] Delta Attention Residuals

链接: https://arxiv.org/abs/2605.18855
作者: Cheng Luo,Zefan Cai,Junjie Hu
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight \approx 0.2), limiting the model’s ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas – the change introduced by each sublayer ( \mathbfv_i = \mathbfh_i+1 - \mathbfh_i ) – instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight \approx 0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M–7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7–8.2% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at this https URL.

[CV-149] INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference

链接: https://arxiv.org/abs/2605.18853
作者: Ahmed Šabanović,Paul Joe Maliakel,Ivona Brandić
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Edge deployment of Vision-Language Models (VLMs) faces a tradeoff between latency and accuracy: cloud execution provides high-quality predictions but incurs communication delay and energy cost, while edge-only execution is faster but less accurate due to limited model capacity. This trade-off is further complicated by heterogeneity in image quality and reasoning complexity, making static placement suboptimal. We present INAR-VL, a lightweight edge-cloud routing system for multimodal inference in a two-tier deployment. INAR-VL maintains complementary VLMs across edge and cloud and uses lightweight image and text complexity signals to guide routing and model selection, executing simple queries locally while offloading complex ones when beneficial. Evaluation on visual question answering shows that INAR-VL executes 36% of requests on the edge, reduces latency by 24%, lowers energy by 26%, and preserves 97% of cloud-level accuracy.

[CV-150] Spectral Gradient Surgery for Domain-Generalizable Dataset Distillation

链接: https://arxiv.org/abs/2605.18836
作者: Minyoung Oh,Najeong Chae,Jae-Young Sim
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 17pages

点击查看摘要

Abstract:Dataset Distillation (DD) synthesizes a compact synthetic dataset that preserves the training utility of a full dataset. However, its standard formulation assumes that test data follow the same distribution as training data, an assumption that rarely holds in practice. A straightforward extension-applying post-hoc Domain Generalization (DG) techniques to distilled data-is ill-suited because existing DG methods rely on the natural diversity of real datasets, which compact synthetic sets inherently lack, while also incurring substantial augmentation overhead that conflicts with the efficiency objective of dataset distillation. To address this limitation, we introduce Domain Generalizable Dataset Distillation (DGDD), a new problem setting that explicitly targets out-of-distribution (OOD) generalization of distilled datasets. We study this problem through a widely adopted DD baseline of Distribution Matching (DM). We attribute the OOD vulnerability of DM to the entanglement of class-discriminative and domain-specific information within the compressed synthetic set, and propose Spectral Gradient Surgery (SGS) to disentangle the two. The key insight of SGS is that cross-domain agreement among domain-wise gradients in the spectral domain reveals which gradient components are shared across source domains-and are therefore class-discriminative-and which are domain-specific. Based on this observation, SGS augments the standard DM update with two complementary gradients: one that reinforces cross-domain shared components and another that explicitly promotes diversity within the distilled dataset. Extensive experiments on diverse-scale benchmarks demonstrate that SGS substantially improves OOD generalization while remaining plug-and-play compatible with existing DM methods.

[CV-151] XFlowMap: Cross-Scale Generalization and Mapping of Massive Origin-Destination Data

链接: https://arxiv.org/abs/2605.18777
作者: Diansheng Guo,Hai Jin
类目: ocial and Information Networks (cs.SI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mapping large origin-destination (OD) datasets remains challenging because flow maps become cluttered, meaningful patterns occur at multiple spatial scales, and existing flow-mapping approaches frequently rely on predefined aggregation units or manual generalization. This paper presents XFlowMap, a framework for the cross-scale generalization and mapping of massive OD data. Specifically, the framework integrates cross-scale flow pattern (cluster) detection, automated flow map generalization, and a new cartographic representation for analyzing and visualizing complex origin-destination flow structures. The approach detects salient flow patterns at their appropriate origin and destination scales, extracts high-level structures, and generates a new flow map representation that supports holistic interpretation of complex origin-destination flow patterns. A scan-statistic-based procedure is developed to evaluate and generalize cross-scale flow clusters. The detected clusters are then visualized using a novel flow symbol that integrates location, direction, strength, and OD scales in a single representation. The framework supports both area-based and point-based OD data, is robust to sparse and noisy datasets, and enables comparative mapping of stratified flow data. Experiments with synthetic data and U.S. migration data demonstrate that the method effectively extracts meaningful cross-scale flow patterns and produces clear, information-rich flow maps for large mobility datasets, supporting both static presentation and interactive exploration.

[CV-152] FGSVQA: Frequency-Guided Short-form Video Quality Assessment

链接: https://arxiv.org/abs/2605.20016
作者: Xinyi Wang,Angeliki Katsenou,Junxiao Shen,David Bull
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 1 figure

点击查看摘要

Abstract:Short-form video poses new challenges to the quality assessment of user-generated content (UGC) due to its complex generation pipeline, rapid content variation, and mixed distortions. To address this challenge, we propose an end-to-end video quality assessment (VQA) framework that employs a dense visual encoder based on CLIP, and incorporates compression priors derived from the frequency domain to generate artifact- and structure-aware weight maps for feature aggregation. By explicitly decomposing artifact, structure, and original visual feature branches and adaptively fusing them over time through a learned gating module, the proposed method achieves accurate and efficient quality prediction. Experimental results show that our method achieves strong performance on short-form video datasets in terms of average rank and linear correlation (SRCC: 0.736, PLCC: 0.787), while maintaining efficient inference runtime. The code and additional results are available at: this https URL.

[CV-153] Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction

链接: https://arxiv.org/abs/2605.19354
作者: Yilmaz Korkmaz,Vishal M. Patel
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:MRI reconstruction is an inherently ill-posed inverse problem, since incomplete measurements admit many plausible solutions. This ambiguity becomes more severe under high acceleration, where pixel-domain continuous predictors tend to average over feasible reconstructions and suppress high-frequency anatomy. We address this limitation by moving reconstruction to discrete multi-scale latent space and posing it as autoregressive next-acceleration-scale prediction. Leveraging discrete priors proven effective in visual autoregressive modeling, our method restricts the solution to compact sequences of codebook tokens, enabling sharp reconstructions even from extremely sparse measurements. This discrete autoregressive formulation also aligns naturally with modern large language model post-training techniques. Building on this observation, we introduce on-policy privileged information distillation for visual autoregressive modeling, where a teacher is provided training only privileged context that is unavailable at inference, in our case fully sampled acquisitions, and supervises a student trained on its own rollouts, leading to consistent reconstruction gains. Through extensive experiments on the fastMRI benchmark, we show that our approach delivers improved reconstruction performance across diverse sampling patterns under extreme undersampling. Project website is \hyperlinkthis https URLhere.

[CV-154] From Division to Decision: Leverag ing Temporal Cell-Stage Segmentation for Embryo Transferability Prediction

链接: https://arxiv.org/abs/2605.18923
作者: Yasmine Hachani(MALT),Patrick Bouthemy(MALT),Elisa Fromont(MALT),Véronique Duranthon(BREED, ENVA),Ludivine Laffont(BREED),Alline de Paula Reis(BREED, ENVA)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Accurate selection of bovine embryos is a challenging task, as current practice relies on a single expert assessment on the seventh day after insemination, resulting in high rates of pregnancy loss. Time-lapse videomicroscopy provides detailed information on early development, but is difficult to exploit because of complex motion patterns and time-consuming analysis. We propose TransFACT, a transformer-based framework for modeling early developmental stages and embryo transferability using 2D time-lapse videos from the first four days of development. TransFACT combines frame-level temporal features with stage-level representations, using developmental stages as auxiliary supervision to predict transferability on day four. Our experiments demonstrate that TransFACT, by leveraging an existing method designed for action recognition, achieves superior performance than its competitor in predicting embryo transferability.

[CV-155] Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis

链接: https://arxiv.org/abs/2605.18878
作者: Jana Armouti,Laura Hutchins,Jacob Duplantis,Thomas Deiss,Thales Nogueira Gomes,Keyur H. Patel,Seema Walvekar,Shane Guillory,Thomas H. Fox,Amita Krishnan,Ricardo Rodriguez,Bennett DeBoisblanc,Deva Ramanan,John Galeotti,Gautam Gare
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Hospital readmission within 30 days of discharge is a leading driver of morbidity, mortality, and avoidable healthcare expenditure in congestive heart failure (CHF). Current clinical risk stratification tools rely primarily on non-imaging data and exhibit limited predictive performance. Point-of-care lung ultrasound (LUS) offers a sensitive, noninvasive window into the pulmonary congestion that characterizes CHF decompensation, yet its prognostic utility for readmission prediction remains largely unexplored. We present a pilot feasibility study, the first systematic machine learning study using B-mode LUS acquired during hospitalization to predict 30-day CHF readmission. Quantitative spatiotemporal embeddings are extracted from a pretrained Temporal Shift Module (TSM) ResNet-18 encoder, and interpretable biomarker features are separately evaluated. Through structured ablations over lung view, temporal representation, multi-view fusion, and cross-lung augmentation, we identify the key imaging factors driving readmission risk. Our findings reveal that (1) dependent lower-lung regions (Left-3, Right-3) carry the strongest prognostic signal, consistent with their greater susceptibility to hydrostatic congestion; (2) temporal difference features between sequential examinations substantially outperform single-timepoint representations, highlighting the importance of capturing disease trajectory; and (3) multi-view feature concatenation yields the best overall performance, with our top MLP model achieving an F1 score of 0.80 (95% CI: 0.62-0.96). Biomarker analysis further reveals that pleural-line abnormalities, including breaks and indentations, are as informative as the canonical A-line and B-line markers. These results support POCUS-derived biomarkers as practical, interpretable tools for noninvasive CHF risk stratification. Subjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV) Cite as: arXiv:2605.18878 [eess.SP] (or arXiv:2605.18878v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2605.18878 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-156] SpecX: A Large-Scale Benchmark for Multi-Modal Spectroscopy and Cross-Paradigm Evaluation

链接: https://arxiv.org/abs/2605.18791
作者: Chengrui Xiang,Tengfei Ma,Yujie Chen,Tong Wang,Haowen Chen,Xiangxiang Zeng
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Other Quantitative Biology (q-bio.OT)
备注: 9 pages,1 figures

点击查看摘要

Abstract:Existing spectral benchmarks are limited in scale, modality alignment, and evaluation scope, and typically focus on either specialized models or multimodal language models (MLLMs). We introduce SpecX, a large-scale benchmark for multi-modal spectroscopy with cross-paradigm evaluation. SpecX contains 1.7M molecules with diverse spectral modalities, including NMR (1H, 13C, HSQC), IR, MS,UV,Raman and FL, and is organized into three tiers: a large-scale dataset for pretraining, an aligned multi-spectral subset for benchmarking, and a high-quality experimental subset for evaluation. SpecX supports a range of tasks such as molecular elucidation, spectrum simulation, and spectral understanding, and enables unified evaluation across both specialized spectral models and MLLMs. Experiments show that specialized models excel at signal-level modeling, while MLLMs exhibit strengths in high-level reasoning but lack precise spectral grounding. SpecX establishes a unified benchmark for spectral intelligence and highlights the need for spectrum-native foundation models.

人工智能

[AI-0] Atoms of Thought: Universal EEG Representation Learning with Microstates

链接: https://arxiv.org/abs/2605.20182
作者: Xinyang Tian,Ruitao Liu,Ziyi Ye,Siyang Xue,Xin Wang,Xuesong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by the 3rd International Workshop on Multimodal and Responsible Affective Computing (MRAC 2025). 8 pages of main text, 23 pages total, 5 figures, 4 tables

点击查看摘要

Abstract:Learning universal representations from electroencephalogram (EEG) signals is a cutting-edge approach in the field of neuroinformatics and brain-computer interfaces (BCIs). Conventionally, EEG is treated as a multivariate temporal signal, where time- or frequency-domain features are extracted for representation learning. This paper investigates a simple yet effective EEG representation, i.e., microstates. Microstates represent the building blocks of brain activity patterns at a microscopic time scale. We build a universal microstate tokenizer from a large medical EEG dataset by clustering continuous EEG signals into sequences of discrete microstates. The microstate tokenizer is then adopted universally across a series of downstream tasks, including sleep staging, emotion recognition, and motor imagery classification. Experimental results show that EEG representation learning with microstates outperforms traditional time-domain and frequency-domain features under different models and across different tasks. Further analysis shows that microstates offer greater interpretability and scalability, thereby opening up applications in both cognitive neuroscience and clinical research.

[AI-1] A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

链接: https://arxiv.org/abs/2605.20173
作者: Vasundra Srinivasan
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 25 pages, 2 figures, 6 tables. Companion repo at this https URL

点击查看摘要

Abstract:Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load-bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six runtime patterns that compose the SDB differently across conversational, autonomous, and long-horizon agents: hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop. For each pattern, we trace its lineage to distributed-systems concepts and identify what changes when the worker is stochastic. The paper contributes a five-step methodology for selecting runtime patterns, a diagnostic procedure that maps production failures to pattern weaknesses, and a failure mode called replay divergence, in which LLM-based consumers of a deterministic event log produce different downstream outputs under model-version or prompt changes. A stylized reliability decomposition separates per-call model variance from architectural momentum, motivating the claim that as model variance decreases, pattern choice and SDB strength become increasingly important levers for long-run reliability. We apply the methodology to five workloads and provide one runnable reference implementation for a 90-day contract-renewal agent. Comments: 25 pages, 2 figures, 6 tables. Companion repo at this https URL Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2605.20173 [cs.AI] (or arXiv:2605.20173v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.20173 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-2] Long-term Power Grid Planning via Answer Set Programming

链接: https://arxiv.org/abs/2605.20172
作者: Antonio Ielo,Francesco Doria,Sandra Castellanos-Paez,Marco Maratea,Francesco Percassi,Mauro Vallati
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures

点击查看摘要

Abstract:The Power grid is a critical infrastructure underpinning all aspects of modern society and its services. Maintaining its effectiveness requires continuous adaptations. In particular, addressing sustainability targets, demand patterns, and urbanisation trends requires implementing changes to the network. Actual developments can potentially span over a decade, with supply continuity and service quality that must be preserved throughout by ensuring conformance to several topological and combinatorial invariants. Long-term power grid planning deals with the above process, and although planning languages could be a natural choice, the kind of properties and invariants needed are cumbersome to express in such languages; on the contrary, they can be elegantly and succinctly encoded in Answer Set Programming (ASP). In this paper, we propose the first approach to automate and optimise the long-term power grid planning process using ASP. Experimental evaluations conducted on synthetic and real-world grid data confirm the expressive power of the proposed ASP-based approach and demonstrate its effectiveness.

[AI-3] HaorFloodAlert: Deseasonalized ML Ensemble for 72-Hour Flood Prediction in Bangladesh Haor Wetlands

链接: https://arxiv.org/abs/2605.20167
作者: Salma Hoque Talukdar Koli,Fahima Haque Talukder Jely,Md. Samiul Alim,Md. Zakir Hossen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 9 figures. To be submitted to this http URL

点击查看摘要

Abstract:Flash floods in Bangladesh’s haor wetlands show up with almost no warning. They wreck the annual boro rice harvest. Current setups, built for riverine floods, miss backwater dynamics entirely. These basins are flat. Water does not behave like it does on the Brahmaputra. We built HaorFloodAlert, a deseasonalized machine learning ensemble that forecasts 72-hour flood probability for the Sunamganj Haor (approximately 8,000 km2). Temperature was acting as a seasonal cheat code - it inflated accuracy by 6.9 pp just because floods happen in warm months. We caught that. We also built an upstream Barak River Sentinel-1 SAR proxy from Silchar, Assam, giving about 36 hours of lead time. Otsu-thresholded SAR change detection validates at 84-91 percent spatial match. The operational ensemble (RF 0.5625 + XGBoost 0.4375) hits 89.6 percent LOOCV accuracy, 87.5 percent recall, and 0.943 AUC-ROC on 77 real Sentinel-1 events. A three-tier alert pipeline and a BRRI-calibrated boro rice damage estimator are included. Comments: 9 pages, 9 figures. To be submitted to this http URL Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.20167 [cs.AI] (or arXiv:2605.20167v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.20167 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-4] Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

链接: https://arxiv.org/abs/2605.20164
作者: Utkarsh Tyagi,Xingang Guo,MohammadHossein Rezaei,Daniel George,Anas Mahmoud,Jackson Lee,Bing Liu,Yunzhong He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion’s human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy’s outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins 24 of 30 base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in 2.5 – 4\times fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.

[AI-5] Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

链接: https://arxiv.org/abs/2605.20120
作者: Gabriel Rongyang Lau
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:AI-assisted theorem proving can now generate substantial Lean developments for olympiad-level mathematics, but the evidential status of such developments depends on which declarations are actually verified. This paper reports a Lean 4 formalization case study of an Aristotle API proof attempt for the Grasshopper problem, originally posed as IMO 2009 Problem 6. The generated artifact states a generalized Lean version of the theorem, contains four verified helper lemmas for local components of a maximality and adjacent-swap exchange strategy, and leaves the main theorem grasshopper closed directly by one unresolved sorry. The verified components establish that the final partial sum equals the total sum, that an adjacent transposition can affect only the relevant intermediate partial sum, that the changed partial sum has the expected form, and that maximality at a position admitting an adjacent successor swap forces a corresponding forbidden-set membership fact. The Aristotle output summary identifies the intended remaining mathematical step as the global counting step needed to show that these membership facts produce at least n distinct forbidden values, contradicting the cardinality assumption |M| n; the Lean source itself does not reduce the main theorem to a separately encoded counting lemma. This case study gives an inspectable example of a central limitation in AI-assisted formalization, namely that local proof search can succeed while the global combinatorial bookkeeping required for a theorem remains unresolved. The paper contributes a reproducible Lean artifact and a precise analysis of its verified and unverified proof content.

[AI-6] oto 2.0: Time Series Forecasting Enters the Scaling Era

链接: https://arxiv.org/abs/2605.20119
作者: Emaad Khwaja,Chris Lettieri,Gerald Woo,Eden Belouadah,Marc Cenac,Guillaume Jarry,Enguerrand Paquin,Xunyi Zhao,Viktoriya Zhukov,Othmane Abou-Amal,Chenghao Liu,Ameet Talwalkar,David Asker
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code: this https URL Weights: this https URL

点击查看摘要

Abstract:We show that time series foundation models scale: a single training recipe produces reliable forecast-quality improvements from 4M to 2.5B parameters. We release Toto 2.0, a family of five open-weights forecasting models trained under this recipe. The Toto 2.0 family sets a new state of the art on three forecasting benchmarks: BOOM, our observability benchmark; GIFT-Eval, the standard general-purpose benchmark; and the recent contamination-resistant TIME benchmark. This report describes our experimental results and details the design decisions behind Toto 2.0: its architecture and training recipe, training data, and the u-muP hyperparameter transfer pipeline. All five base checkpoints are released under Apache 2.0.

[AI-7] k-Inductive Neural Barrier Certificates for Unknown Nonlinear Dynamics

链接: https://arxiv.org/abs/2605.20108
作者: Ben Wooding,Hongchao Zhang,Taylor T. Johnson,Abolfazl Lavaei
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 18 pages, 5 figures, 3rd International Conference on Neuro-Symbolic Systems (NeuS)

点击查看摘要

Abstract:While conventional (k=1) discrete-time barrier certificate conditions impose strict safety constraints by requiring the function to be non-increasing at every step, k-inductive barrier certificates relax this by allowing a temporary increase – up to k-1 times, each within a threshold \epsilon – while maintaining overall safety, and improving flexibility. This paper leverages neural networks and constructs k-inductive neural barrier certificates (k-NBCs) for (partially) unknown nonlinear systems. While neural networks offer scalability in the design process, they lack formal guarantees, requiring additional approaches such as counterexample-guided inductive synthesis (CEGIS) with satisfiability modulo theories (SMT) for verification. However, the CEGIS-SMT framework requires knowledge of system dynamics, which is unavailable in practical settings. To address this, we leverage the generalization of the Willems et al.'s fundamental lemma, using a single state trajectory, to construct a data-driven representation of (partially) unknown models for SMT verification without sacrificing accuracy. Additionally, CEGIS-SMT further removes the constraint of restricting barrier certificates to specific function classes, such as sum-of-squares, enabling greater flexibility in their design. We validate our approach on three nonlinear case studies with (partially) unknown dynamics.

[AI-8] Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction

链接: https://arxiv.org/abs/2605.20107
作者: Robert Jenkinson Alvarez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:JEPAs often regularize one-view embeddings toward an isotropic Gaussian, implicitly baking Euclidean symmetry into the representation. We show that this is not merely a benign default. For a known structured downstream geometry H\succ0 , the minimax and maximum-entropy covariance under a Hamiltonian energy budget is (c/d)H^-1 , and Euclidean isotropy incurs a closed-form price of isotropy. More importantly, when the downstream geometry is unknown, no geometry-independent fixed marginal target is canonical: every fixed covariance shape can be maximally misaligned for some structured geometry. We further show that even oracle one-view marginals do not identify the JEPA view-to-view predictive coupling. These results suggest that the structural bias in JEPAs should enter the cross-view coupling rather than a fixed encoder marginal. We instantiate this principle with \textbfHamJEPA, which encodes each view as a phase-space state (q,p) and predicts view-to-view transitions with a learned Hamiltonian leapfrog map, while non-isotropic scale and spectral floors prevent collapse. In a deliberately headless token protocol, HamJEPA improves over SIGReg on CIFAR-100 by +4.89 kNN@20 and +3.52 linear-probe points at 30 epochs, and by +6.45 kNN@20 and +10.64 linear-probe points at 80 epochs, while a matched MLP predictor ablation shows that the symplectic coupling is the ingredient driving the neighborhood-geometry gain. On ImageNet-100, HamJEPA- q improves by +4.82 kNN@20 and +7.52 linear-probe points at 45 epochs.

[AI-9] Draft Less Retrieve More: Hybrid Tree Construction for Speculative Decoding

链接: https://arxiv.org/abs/2605.20104
作者: Yuhao Shen,Tianyu Liu,Xinyi Hu,Quan Kong,Baolin Zhang,Jun Dai,Jun Zhang,Shuang Ge,Lei Chen,Yue Li,Mingcheng Wan,Cong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval compensates for pruning-induced coverage loss and recovers accepted length. By employing a sequential `prune-then-graft’ mechanism, Graft attaches highly predictive retrieved tokens into positions opened by pruning, filling the topological gaps with near-zero overhead. Graft is entirely training-free and lossless. Comprehensive evaluations show that Graft establishes a new Pareto frontier across practical deployment settings, including short-context generation, long-context generation, and large-scale models. On short-context benchmarks, it achieves up to 5.41 \times speedup and improves average speedup over EAGLE-3 by up to 21.8% on the large-scale Qwen3-235B. We also provide a preliminary exploration of applying Graft to the DFlash-style block drafting paradigm, offering initial evidence and insights for extending grafting beyond autoregressive draft trees.

[AI-10] Neurosymbolic Learning for Inference-Time Argumentation

链接: https://arxiv.org/abs/2605.20098
作者: Gabriel Freedman,Adam Dejl,Adam Gould,Mansi,Lihu Chen,Jianqi Jiang,Francesca Toni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Claim verification is an important problem in high-stakes settings, including health and finance. When information underpinning claims is incomplete or conflicting, uncertain answers may be more appropriate than binary true or false classifications. In all cases, faithful explanations of the considerations determining the final verdict are crucial. We introduce inference-time argumentation (ITA), a trainable neurosymbolic framework for ternary claim verification in which a formal argumentation semantics giving the strength of claims is used both (i) to guide LLM training as models learn to generate arguments and assign them base scores (representing intrinsic strengths) and (ii) to compute ternary (true/false/uncertain) predictions from generated, scored arguments. As a result, at training time, argument generation and scoring can be optimised according to the quality of the induced argumentative predictions. Moreover, at inference time, the final prediction is faithful, by construction, to the arguments and scores determining the verdict, rather than being justified by a potentially unfaithful post-hoc reasoning trace as in conventional reasoning models. We finally show that, on two datasets for ternary claim verification, ITA improves upon argumentative baselines and can perform competitively against non-argumentative direct-prediction baselines, while providing verdicts that are computed deterministically from explicit, inspectable argumentative structures.

[AI-11] INSHAPE: Instance-Level Shapelets for Interpretable Time-Series Classification IJCAI2026

链接: https://arxiv.org/abs/2605.20088
作者: Seongjun Lee,Seokhyun Lee,Changhee Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to IJCAI 2026. 25 pages

点击查看摘要

Abstract:Discovering shapelets – i.e., discriminative temporal patterns within time series – has been widely studied to address the inherent complexity of time-series classification (TSC) and to make model decision-making processes more transparent. However, existing methods primarily focus on population-level shapelets optimized across the entire dataset, which leads to two fundamental limitations: (i) population-level patterns often misalign with instance-specific features, resulting in suboptimal performance and potentially misleading interpretations, and (ii) most methods treat shapelets as independent entities, overlooking important temporal dependencies and interactions among multiple patterns. To address these limitations, we propose INSHAPE, an interpretable TSC framework that discovers variable-length, discriminative temporal patterns specific to each time series. INSHAPE identifies these patterns as non-overlapping segments and models their temporal dependencies, thereby providing clear instance-level interpretations while achieving strong predictive performance. Furthermore, INSHAPE bridges local and global interpretability through a bottom-up approach, aggregating instance-level shapelets into prototypical (population-level) shapelets. Extensive experiments on 128 UCR and 30 UEA benchmark datasets show that INSHAPE consistently outperforms state-of-the-art shapelet-based methods while providing more intuitive and interpretable insights.

[AI-12] What Do Evolutionary Coding Agents Evolve?

链接: https://arxiv.org/abs/2605.20086
作者: Nico Pelleriti,Sree Harsha Nelaturu,Zhanke Zhou,Zongze Li,Max Zimmer,Bo Han,Sebastian Pokutta
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 12 figures, 12 tables

点击查看摘要

Abstract:Recent work pairs LLMs with evolutionary search to iteratively generate, modify, and select code using task-specific feedback. These systems have produced strong results in mathematical discovery and algorithm design, yet a fundamental question remains: what do they actually evolve? Progress is typically summarized by the best score a run reaches under a task-specific evaluator, but that score can reflect several different mechanisms: new algorithmic structure, re-tuning an existing strategy, recombining ideas already in the model’s internal knowledge, or overfitting to the evaluator. Distinguishing these mechanisms requires inspecting the search process itself, not only its final outcome. We introduce EvoTrace, a dataset of evolutionary coding traces spanning four evolutionary frameworks, reasoning and non-reasoning models, and 16 tasks across mathematics and algorithm design. To analyze these traces, we develop EvoReplay, a replay-based methodology that reconstructs the local search states behind high-scoring solutions and tests controlled interventions, including adjusting constants, removing program components and substituting models or prompting contexts. We annotate every code edit in EvoTrace with one of nine recurring edit types using an LLM-as-judge pipeline validated against blind human re-annotation. Across EvoTrace, most score gains come from a small subset of these edit types. We further find a deterministic cycling pattern: about 30% of code lines added during search are byte-identical re-introductions of previously-deleted lines, present throughout nearly every run. These results show that benchmark gains in evolutionary coding agents can arise from qualitatively different mechanisms, only some of which correspond to new algorithmic structure. EvoTrace enables more diagnostic evaluation of evolutionary coding agents beyond final benchmark scores.

[AI-13] Probing Embodied LLM s: When Higher Observation Fidelity Hurts Problem Solving

链接: https://arxiv.org/abs/2605.20072
作者: Oussama Zenkri,Oliver Brock
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Submitted to From Animals to Animats: The 18th International Conference on the Simulation of Adaptive Behavior (SAB)

点击查看摘要

Abstract:Large Language Models are increasingly proposed as cognitive components for robotic systems, yet their opaque decision processes make it difficult to explain success or failure in closed-loop embodied tasks. Following an empirical AI methodology, we study embodied LLM agents behaviorally by varying the information available to the agent and measuring the resulting changes in behavior. Using the Lockbox, a sequential mechanical puzzle with hidden interdependencies, we evaluate LLMs across RGB, RGB-D, and ground-truth symbolic observations in a physical robotic setup and use controlled simulation to probe the resulting behavior. Counterintuitively, agents perform best under raw RGB input and worst under perfect ground-truth observations. In simulation, we probe this effect by randomly flipping perceived action outcomes and find that moderate noise improves performance, peaking at a 40% flip probability with a 2.85-fold success rate increase over the noise-free baseline. Further analysis links this gain to a reduction in repetitive action loops. These findings suggest that success rates alone are insufficient for evaluating LLMs, as measured performance may reflect the interaction between perceptual errors and reasoning failures rather than robust problem solving.

[AI-14] owards LLM -Assisted Architecture Recovery for Real-World ROS~2 Systems: An Agent -Based Multi-Level Approach to Hierarchical Structural Architecture Reconstruction

链接: https://arxiv.org/abs/2605.20055
作者: Dominique Briechle,Raj Chanchad,Tobias Geger,Ruidi He,Dhruv Jajadiya,Dhruv Kapadiya,Andreas Rausch,Meng Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Explicit software architecture models are essential artifacts for communicating, analyzing, and evolving complex software-intensive systems. In ROS~2-based robotic systems, however, structural (de-)composition and integration semantics are often only implicitly encoded across distributed artifacts such as source code and launch files, making recovery of hierarchical architecture particularly difficult. Existing approaches mainly focus on node-level entities and communication wiring, while providing limited support for recovering hierarchical structural (de-)composition across multiple abstraction levels. In this paper, we extend our previously proposed blueprint-guided LLM-assisted architecture recovery pipeline for ROS~2 systems through two major enhancements: (1) refined prompting to improve the consistency and controllability of architecture synthesis, and (2) a staged recovery strategy based on multi-level intermediate architectural representations that incorporate the atomic ROS node list and launch file dependencies, thereby enabling structurally constrained reconstruction across multiple abstraction levels. The approach is evaluated on a real-world automated product disassembly system based on cooperative robotic arms and heterogeneous ROS~2 artifacts. Compared to our previous work, the considered case study exhibits substantially higher integration complexity and richer functionality. The results demonstrate improved structural consistency, scalability, and robustness of architecture recovery, while also revealing remaining challenges related to dynamic integration semantics in large-scale ROS~2 systems. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2605.20055 [cs.SE] (or arXiv:2605.20055v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.20055 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-15] Does Code Cleanliness Affect Coding Agents ? A Controlled Minimal-Pair Study

链接: https://arxiv.org/abs/2605.20049
作者: Priyansh Trivedi,Olivier Schmitt(SonarSource)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As autonomous coding agents see rapid adoption, their evaluation has primarily focused on task completion rates holding the target codebase fixed. This leaves a critical question unanswered: does the structural and stylistic quality, or ``cleanliness’’ of the underlying code affect an agent’s ability to navigate and modify it? To isolate the effect of code cleanliness from agent capability, we introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity. The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one. We author 33 tasks across six such pairs, evaluated through hidden tests at the application’s public surface. Across 660 trials with Claude Code, code cleanliness does not change the agent’s pass rate. However, it substantially alters the agent’s operational footprint: agents working on cleaner code use 7 to 8% fewer tokens and reduce file revisitations by 34%. Our findings suggest that traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents. Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours.

[AI-16] When Critics Disagree: Adaptive Reward Poisoning Attacks in RIS-Aided Wireless Control System

链接: https://arxiv.org/abs/2605.20037
作者: Deemah H. Tashman,Soumaya Cherkaoui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reward-poisoning attacks present a significant risk to learning-based wireless control systems. Given this, we propose a Disagreement-Guided Reward Poisoning (DGRP) adaptive attack on a Soft Actor-Critic (SAC) agent. In a Cognitive Radio Network (CRN) environment assisted by Reconfigurable Intelligent Surfaces (RIS), the SAC agent is tasked with maximizing the long-term secondary users’ (SUs) rate by simultaneously optimizing the transmission power of the SU transmitter and the RIS phase shifts. DGRP corrupts rewards, particularly when the SAC dual critics exhibit substantial disagreement-especially in high-leverage, high-uncertainty states-resulting in distorted value estimations and guiding the policy towards suboptimal actions. Our findings demonstrate that DGRP substantially diminishes the performance improvements typically provided by RIS and degrades transmission quality. We further investigate key attack parameters and determine their impact on learning. In comparison to periodic-timing and exploration-triggered baselines, DGRP consistently causes greater damage, highlighting the necessity of considering disagreement-aware threats when evaluating the robustness of Deep Reinforcement Learning (DRL) in RIS-assisted networks.

[AI-17] AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

链接: https://arxiv.org/abs/2605.20025
作者: Jiaqi Liu,Shi Qiu,Mairui Li,Bingzhou Li,Haonian Ji,Siwei Han,Xinyu Ye,Peng Xia,Zihan Dong,Congyu Zhang,Letian Zhang,Guiming Chen,Haoqin Tu,Xinyu Yang,Lu Feng,Xujiang Zhao,Haifeng Chen,Jiawei Zhou,Xiao Wang,Weitong Zhang,Hongtu Zhu,Yun Li,Jieru Mei,Hongliang Fei,Jiaheng Zhang,Linjie Li,Linjun Zhang,Yuyin Zhou,Sheng Wang,Caiming Xiong,James Zou,Zeyu Zheng,Cihang Xie,Mingyu Ding,Huaxiu Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from multiple perspectives, experiments fail and inform the next attempt, and lessons accumulate across cycles. Existing autonomous research systems often model this process as a linear pipeline: they rely on single-agent reasoning, stop when execution fails, and do not carry experience across runs. We present AutoResearchClaw, a multi-agent autonomous research pipeline built on five mechanisms: structured multi-agent debate for hypothesis generation and result analysis, a self-healing executor with a \textscPivot/\textscRefine decision loop that transforms failures into information, verifiable result reporting that prevents fabricated numbers and hallucinated citations, human-in-the-loop collaboration with seven intervention modes spanning full autonomy to step-by-step oversight, and cross-run evolution that converts past mistakes into future safeguards. On ARC-Bench, a 25-topic experiment-stage benchmark, AutoResearchClaw outperforms AI Scientist v2 by 54.7%. A human-in-the-loop ablation across seven intervention modes reveals that precise, targeted collaboration at high-leverage decision points consistently outperforms both full autonomy and exhaustive step-by-step oversight. We position AutoResearchClaw as a research amplifier that augments rather than replaces human scientific judgment. Code is available at this https URL.

[AI-18] raining Neural Networks with Optimal Double-Bayesian Learning

链接: https://arxiv.org/abs/2605.20009
作者: Vy Bui,Hang Yu,Karthik Kantipudi,Ziv Yaniv,Stefan Jaeger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 13 pages, 4 figures; see also arXiv:2410.12984 [cs.LG]

点击查看摘要

Abstract:Backpropagation with gradient descent is a common optimization strategy employed by most neural network architectures in machine learning. However, finding optimal hyperparameters to guide training has proven challenging. While it is widely acknowledged that selecting appropriate parameters is crucial for avoiding overfitting and achieving unbiased outcomes, this choice remains largely based on empirical experiments and experience. This paper presents a new probabilistic framework for the learning rate, a key parameter in stochastic gradient descent. The framework develops classic Bayesian statistics into a double-Bayesian decision mechanism involving two antagonistic Bayesian processes. A theoretically optimal learning rate can be derived from these two processes and used for stochastic gradient descent. Experiments across various classification, segmentation, and detection tasks corroborate the practical significance of the theoretically derived learning rate. The paper also discusses the ramifications of the proposed double-Bayesian framework for network training and model performance.

[AI-19] GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards

链接: https://arxiv.org/abs/2605.20006
作者: Kyeongjin Ahn,Seungeon Lee,Krishna P. Gummadi,Meeyoung Cha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 26 pages,12 figures, 9 tables

点击查看摘要

Abstract:Geospatial reasoning requires solving image-grounded problems over the complex spatial structure of a scene. However, developing this capability is hindered by the cost of annotating a vast and combinatorial question space. We propose GeoX, a self-play framework that acquires spatial logic through executable programs that yield verifiable rewards, without relying on large-scale human-curated data Given a satellite or aerial image, our framework employs a single multimodal policy that proposes spatial problems as executable programs and solves them under three reasoning modes-abduction, deduction, and induction-over spatial primitives and an image understanding tool. A verifier executes each program to covert a reward signal that jointly optimizes the two roles via reinforcement learning. GeoX consistently improves its base VLMs by up to 5.5 points on average, matching or exceeding conventional baselines trained on millions of curated data. Along-side the proposed method, we release a benchmark for geospatial understanding accumulated through self-play.

[AI-20] LLM Benchmark Datasets Should Be Contamination-Resistant ICML2026

链接: https://arxiv.org/abs/2605.19999
作者: Ali Al-Lawati,Jason Lucas,Dongwon Lee,Suhang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted to ICML 2026 Position Paper Track

点击查看摘要

Abstract:Benchmark datasets are critical for reproducible, reliable, and discriminative evaluation of LLMs. However, recent studies reveal that many benchmark datasets are included in pretraining corpora, i.e., \textitcontaminated , which diminishes their value as reliable measures of model generalization. In this paper, we argue that benchmark datasets should be \textitcontamination-resistant , i.e., \textitunlearnable , but support \textitinference . To accomplish this, we first highlight the wide prevalence of benchmark dataset contamination and outline the properties of contamination-resistant datasets. Second, we highlight how the asymmetry between the inference and training pipelines in the Transformer architecture can be leveraged to support contamination-resistance. Third, we outline mathematical advancements to make these datasets interoperable across various LLM architectures. Based on the above, we call on the community to ensure the reliability of LLM benchmarking by: (i) advancing novel contamination-resistant methodologies, (ii) developing supporting methods and platforms, and (iii) adopting contamination-resistant benchmarks into existing evaluation pipelines.

[AI-21] A Case for Agent ic Tuning: From Documentation to Action in PostgreSQL

链接: https://arxiv.org/abs/2605.19988
作者: Hongyu Lin,Mingyu Li,Weichen Zhang,Yihang Lou,Mingjie Xing,Yanjun Wu,Haibo Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Databases (cs.DB); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Documentation has long guided computer system tuning by distilling expert knowledge into per-parameter recommendations. Yet such guides capture only what experts conclude, discarding how they reason. This fundamental gap manifests in three concrete deficiencies: documentation grows stale as software evolves, fails under heterogeneous workloads, and ignores inter-parameter dependencies. We propose shifting from static documentation to dynamic action for system tuning. We introduce PerfEvolve, which translates expert tuning methodologies into executable skills that equip LLM-based agents to perform version-consistency verification, workload-specific profiling, and multi-parameter joint optimization. Evaluated on PostgreSQL under TPC-C and TPC-H benchmarks, PerfEvolve outperforms state-of-the-art documentation-driven tuning baselines by up to 35.2%. The tool is available at this https URL. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Databases (cs.DB); Performance (cs.PF) Cite as: arXiv:2605.19988 [cs.SE] (or arXiv:2605.19988v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.19988 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-22] Learning with Foresight: Enhancing Neural Routing Policy via Multi-Node Lookahead Prediction

链接: https://arxiv.org/abs/2605.19975
作者: Xia Jiang,Yaoxin Wu,Yew-Soon Ong,Yingqian Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by the 35th International Joint Conference on Artificial Intelligence

点击查看摘要

Abstract:Neural policies have shown promise in solving vehicle routing problems due to their reduced reliance on handcrafted heuristics. However, current training paradigms suffer from a fundamental limitation: they primarily focus on next-node prediction for solution construction, resulting in myopic decision-making that undermines long-horizon planning capacity. To this end, we introduce Multi-node Lookahead Prediction (MnLP), a novel training strategy that extends the supervised learning paradigm to predict multiple future nodes simultaneously. We incorporate causal and discardable MnLP modules that operate exclusively during training, facilitating models to anticipate multi-step decisions while preserving inference-time efficiency. By incorporating multi-depth auxiliary supervision into the loss function, MnLP equips neural policies with the ability of long-range contextual understanding. Experimentally, MnLP outperforms existing training methods, improving the generalization capability of neural policies across various problem sizes, distributions, and real-world benchmarks. Moreover, MnLP can be seamlessly integrated into diverse neural architectures without introducing additional inference overhead.

[AI-23] Block-Sphere Vector Quantization

链接: https://arxiv.org/abs/2605.19972
作者: Heesang Ann,Joongkyu Lee,Min-hwan Oh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Data Structures and Algorithms (cs.DS)
备注:

点击查看摘要

Abstract:Vector quantization is a fundamental primitive for scalable machine learning systems, enabling memory-efficient storage, fast retrieval, and compressed inference. Recent rotation-based quantizers such as EDEN, RabitQ, and TurboQuant have introduced strong guarantees and empirical performance, but the surrounding comparisons have been difficult to interpret because they rely on different distortion criteria, probability regimes, and implementation assumptions. As our first contribution, we provide a unified theoretical comparison of these methods and show that their relative advantages are criterion-dependent rather than absolute: EDEN and TurboQuant are favorable for MSE distortion, EDEN is also effective for expected inner-product distortion, and RabitQ provides strong high-probability control. This comparison further clarifies that EDEN provides particularly strong guarantees for expected distortion measures. As our second contribution, we introduce Block-Sphere Quantization (BlockQuant), a new rotation-based block quantization algorithm designed around the spherical geometry of randomly rotated vectors. Unlike coordinate-wise quantizers, BlockQuant quantizes blocks on the sphere, preserving the geometry of rotated embeddings more faithfully. We prove that this block-spherical design theoretically improves over the baselines considered in this paper for both reconstruction MSE and expected inner-product distortion. Our experiments on real embedding datasets and long-context LLM inference tasks show practical gains that are consistent with our theoretical improvements.

[AI-24] Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes ICML2026

链接: https://arxiv.org/abs/2605.19966
作者: Mohammed Alshaalan,Miguel R. D. Rodrigues
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026; 20 pages, including 9 pages main text, references, and appendix

点击查看摘要

Abstract:Optimization-based adversarial suffixes can jailbreak aligned large language models (LLMs) while remaining fluent, weakening static and windowed perplexity-based detectors. We cast adversarial suffix detection as an online change-point detection problem over the token-level next-token entropy stream. Using the LLM system prompt to estimate a robust baseline, we standardize user-token entropies and apply a one-sided CUSUM statistic. The resulting detector, CPD Online (CPD), is model-agnostic, training-free, runs online, and localizes the adversarial suffix onset. On a benchmark of 1,012 optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 perplexity-controlled benign prompts, CPD improves F1 over the strongest windowed-perplexity baseline on all six open-weight chat models (LLaMA-2-7B/13B, Vicuna-7B/13B, Qwen2.5-7B/14B). On LLaMA-2-7B at the canonical CUSUM setting ( k=0 ), CPD reaches AUROC 0.88 and F1 0.82 . Beyond prompt-level detection, CPD concentrates 79.6% of its triggers inside the adversarial suffix, versus 17-46% for windowed perplexity. Finally, when used as a lightweight gate for LLaMA Guard, CPD reduces guard calls by 17-22% on a high-volume, benign-dominated deployment while preserving guard-level detection quality

[AI-25] Probabilistic Tiny Recursive Model

链接: https://arxiv.org/abs/2605.19943
作者: Amin Sghaier,Ali Parviz,Alexia Jolicoeur-Martineau
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tiny Recursive Models (TRM) solve complex reasoning tasks with a fraction of the parameters of modern large language models (LLMs) by iteratively refining a latent state and final answer. While powerful, their deterministic recursion can lead to convergence at suboptimal solutions, without escape mechanism. A common workaround relies on task-specific input perturbations at test time combined with answer aggregation via voting. We introduce Probabilistic TRM (PTRM), a task-agnostic framework for test-time compute scaling that addresses this limitation through stochastic exploration. PTRM injects Gaussian noise at each deep recursion step, enabling parallel trajectories to explore diverse solution basins, and selects among them using the model’s existing Q head (used for early stopping in the original TRM). Without requiring retraining or task-specific augmentations, PTRM enables substantial accuracy gains across benchmarks, including Sudoku-Extreme (87.4% to 98.75%) and on various puzzles from Pencil Puzzle Bench (62.6% to 91.2%). On the latter, PTRM achieves nearly double the accuracy of frontier LLMs (91.2% vs. 55.1%) at less than 0.0001x the cost, using only 7M parameters.

[AI-26] Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

链接: https://arxiv.org/abs/2605.19940
作者: Rebecca Ramnauth,Drazen Brscic,Brian Scassellati
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Under review at Journal of Artificial Intelligence Research (JAIR)

点击查看摘要

Abstract:Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-dependent. Existing guardrail approaches – ranging from training-time alignment to prompting, decoding constraints, and post-hoc moderation – primarily provide empirical risk reduction rather than enforceable behavioral guarantees, and largely treat safety as a property of individual outputs rather than interaction trajectories. We reframe guardrails as a problem of runtime behavioral control over interaction trajectories, drawing on robotics to introduce formal constructs for constraint enforcement in uncertain, closed-loop systems. We instantiate these ideas in the Grounded Observer framework and apply it across three real-world deployments: small talk, in-home autism therapy, and behavioral de-escalation in schools. Across settings, the framework enables runtime interventions that mitigate drift into undesirable interaction regimes while adapting to diverse social contexts. We discuss extensions to the framework and propose research directions toward stronger guarantees.

[AI-27] Real-Time Parallel Counterfactual Regret Minimization

链接: https://arxiv.org/abs/2605.19928
作者: Boning Li,Longbo Huang
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Counterfactual Regret Minimization (CFR) is the dominant algorithmic family for solving large imperfect-information games, underpinning breakthroughs such as Libratus and Pluribus in No-Limit Texas Hold’em poker. In real-time game-playing systems, the solver must compute a near-equilibrium strategy within a strict time budget of only a few seconds per decision, and the number of CFR iterations completed in this window directly determines play strength. We present \textbfParallel CFR, the first parallelization framework for real-time depth-limited CFR solving that seamlessly integrates pruning, abstraction, and advanced CFR variants. We decompose each CFR iteration into a pipeline of seven stages and identify two orthogonal dimensions of parallelism: \emphby information set and \emphby tree node. Leaf node evaluation is offloaded to GPUs via batched neural network inference, creating a heterogeneous CPU–GPU pipeline. Experiments on Heads-Up No-Limit Texas Hold’em demonstrate that Parallel CFR achieves 3.3 – 3.4\times speedup over the single-threaded baseline on postflop streets, with per-iteration time of \sim47 – 54 ~ms on a depth-limited game tree with over 1 billion histories. All experiments run on a single desktop-class device (NVIDIA DGX Spark), enabling hundreds of CFR iterations within a typical real-time decision budget without requiring datacenter-scale infrastructure.

[AI-28] Fast and Featureless Node Representation Learning with Partial Pairwise Supervision

链接: https://arxiv.org/abs/2605.19916
作者: Sujan Chakraborty,Saptarshi Bej
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Contrastive FUSE, a fast and unified framework for scalable node representation learning in graphs with partially available pairwise node labels and no available node features. Unlike existing methods, we directly optimize a spectral contrastive objective that integrates community-aware structural signals with signed pairwise constraints. To support large-scale training, we replace the expensive modularity gradient with a lightweight approximation, which preserves the structure-seeking behavior of modularity while reducing the computational cost significantly. This yields an efficient optimization scheme with a natural gradient decomposition and adaptive learning-rate scaling, enabling fast iterative updates even on million-edge graphs. Extensive experiments on benchmark citation networks, large co-purchase graphs, and OGB datasets show that Contrastive FUSE achieves competitive or superior contrastive classification performance without relying on node features, while offering substantial runtime gains over existing baselines. These results highlight the effectiveness of coupling modularity-inspired structural learning with contrastive supervision for efficient and scalable contrastive node representation learning.

[AI-29] Streamlined Constraint Reasoning via CNN Pattern Recognition on Enumerated Solutions

链接: https://arxiv.org/abs/2605.19895
作者: Patrick Spracklen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Constraint programming practitioners accelerate hard problems through a layered set of techniques applied in order of risk. Standard hardening (symmetry-breaking and implied constraints) is applied first and preserves satisfiability. Streamliner constraints, which restrict search to a structural sub-family of solutions, do not preserve satisfiability and are reserved as a final lever. Existing automated streamliner-synthesis approaches either search a constraint grammar or prompt a Large Language Model directly on the problem model. We propose a different approach: enumerate feasible solutions, train a Convolutional Neural Network contrastively against perturbed non-solutions to detect structural patterns, and translate the CNN’s discriminative signal into candidate MiniZinc streamliners through LLM-driven synthesis. The CNN grounds the LLM’s constraint generation in observed solution structure rather than model text alone. We evaluate on hardened benchmark models where streamliner discovery is the residual performance lever. Our pipeline achieves 98.8% portfolio time reduction on hardened Vessel Loading, 98.6% on hardened Social Golfers, and 89.4% on Black Hole, with best-single streamliners reaching geometric-mean speedups of 932x, 356x, and 1103x respectively. Discovered streamliners include class-based packing constraints on Vessel Loading, beyond-hardening canonicalisations on Social Golfers, and layout-coordinate bounds on Black Hole.

[AI-30] Deep Tech to Space: Space Data Centers and AI Revolution at the Edge

链接: https://arxiv.org/abs/2605.19892
作者: Jonas Weiss,Patricia Sagmeister,Gabriel Maiolini Capez,Dinesh Verma,Roberto Garello,Alberto Perotti,Dawid Lazaj,Alicja Musial,Jakub Nalepa,Thomas Morf,Martin Schmatz,Marek Krawczyk,Mateusz Przeliorz,Kevin Roche,Sagar Tayal,Mahalakshmi Lakshminarayanan,Nicolas Longépé,Pierre-Philippe Mathieu,Agata Wijata
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
备注: 7 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Dramatic cost reductions driven by private sector innovations have led to a rapid increase in the number of satellites in orbit and a corresponding surge in space-generated data. As this trend continues, transmitting large volumes of data to Earth for processing may become increasingly costly and challenging due to potential space-to-Earth link congestion and increased latency. Moreover, traditional ground station networks may face difficulties accommodating growing data flows and workloads because of capacity constraints, complex scheduling logistics, and restricted visibility windows, which can limit scalability. Space Data Centers (SDCs) – software-driven, multi-tenant artificial intelligence-based service platforms capable of processing data in orbit to generate actionable insights for client satellites and ground users – represent a promising approach to address these challenges. This article presents the architecture of a Low Earth Orbit SDC satellite constellation, considering orbital design, inter-satellite links and network topology, computational resource organization, and software service orchestration. We analyze the potential technical feasibility and economic viability of SDCs using forecasting models informed by technology roadmaps and illustrate the concept through Earth observation and lunar exploration use cases. Comments: 7 pages, 4 figures, 2 tables Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2605.19892 [cs.DC] (or arXiv:2605.19892v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2605.19892 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-31] StableGrad: Backward Scale Control without Batch Normalization

链接: https://arxiv.org/abs/2605.19856
作者: Jose I. Mestre,Alberto Fernández-Hernández,Cristian Pérez-Corral,Manuel F. Dolz,Enrique S. Quintana-Ortí
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training very deep neural networks requires controlling the propagation of magnitudes across depth. Without such control, activations and gradients may vanish, explode, or enter unstable regimes that make optimization fail. Modern architectures often mitigate this problem through Batch Normalization, residual connections, or other normalization layers, which repeatedly re-scale or bypass intermediate representations. However, these mechanisms are not always appropriate. In Physics-Informed Neural Networks (PINNs), the network represents a continuous physical field and its input derivatives define the training objective, making batch-dependent normalization problematic because it can introduce non-local dependencies into the predicted field and its derivatives. We propose StableGrad, an optimizer-level scale-control mechanism that corrects layer-wise weight-gradient imbalances without modifying the forward model. Because the normalization is applied only after backpropagation and before the optimizer update, the network output, its derivatives, and the physical residual remain unchanged. We analyze the effective training dynamics induced by this rescaling and evaluate StableGrad on deep PINNs as the target application, with BatchNorm-free convolutional networks serving as a diagnostic stress test. On PINN benchmarks, StableGrad improves matched-depth solution accuracy and makes deeper models more reliable under standard optimization. On ResNet and EfficientNet architectures, where removing Batch Normalization normally leads to training collapse, StableGrad stabilizes optimization without introducing any other architectural change. These results show that optimizer-level control of weight-gradient scale can provide a practical alternative when forward normalization is unavailable or undesirable.

[AI-32] A Closed-loop State-centric Multi-agent Framework for Passenger Load Estimation from Heterogeneous Data Streams ITSC

链接: https://arxiv.org/abs/2605.19834
作者: Yiyao Xu,Hao Zhou,Yuhang Wang,Jingran Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Preprint version of a paper accepted by the 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC). 7 pages, 4 figures

点击查看摘要

Abstract:To support operations and passenger-facing services, transit agencies need reliable passenger load trajectories. Currently, load estimates are typically inferred from imperfect sensing systems rather than fully observed, and the accuracy of modern automatic passenger counting (APC) systems still varies with station layout, flow intensity, and operating conditions. To address the challenges of robust passenger load estimation from heterogeneous data streams, including incremental count errors, evidence conflicts, and context-dependent sensor reliability, we propose a closed-loop, state-centric, multi-agent framework. This method enforces physical feasibility at every step, allocates trust dynamically among evidence sources, and feeds physics-derived violation residuals back into training for robustness improvement. The architecture consists of a unified stop-event backbone, a coupled Perception–Physical–Fusion loop for stop-by-stop inference, and optional trip-level macro-correction and closed-loop calibration modules.

[AI-33] Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support

链接: https://arxiv.org/abs/2605.19826
作者: Gary Simethy,Daniel Ortiz Arroyo,Petar Durdevic
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 7 figures, 6 tables, 2 algorithms. Supplementary material (7 pages) included as ancillary file

点击查看摘要

Abstract:Operators of safety-critical industrial processes increasingly rely on digital twins to screen control interventions, but such simulators rarely carry certified safety guarantees. Wastewater treatment plants exemplify the gap: operators face a daily safety-efficiency trade-off where aerating too little risks effluent violations and nitrous-oxide (N2O) spikes, and aerating too much wastes energy. We develop an explainable digital twin for aeration and dosing setpoints. CCSS-IX, the simulator, is a bank of interpretable locally linear state-space “experts” adaptively mixed by a context-aware gating network, building on a continuous-time regime-switching scaffold. A runtime decision layer applies conformal risk control to abstain, reopen, or return a falsifying temporal witness for any operator-proposed action that cannot be statistically certified. The artificial-intelligence contribution is twofold: an identifiable, context-conditioned structured surrogate that retains operator-readable dynamics, and a self-falsifying decision rule with finite-sample coverage guarantees. The engineering contribution is a validated, end-to-end decision-support pipeline, tested on a 1000-step slice of the Avedøre full-scale plant (42.6% sensor missingness, 2-minute sampling), the Agtrup/BlueKolding full-scale plant in Denmark, and the Benchmark Simulation Model No. 2 (BSM2) international benchmark, under a matched ten-seed protocol. The static structured ensemble lies within 0.78% root-mean-square error of an unconstrained black-box reference, and the adaptive variant within 1.08%. The calibrated reopen rule cuts aggregate two-plant regret by 43.6% at an unsafe-action cost weight of 4 and eliminates unsafe chosen actions on the BSM2 main slice. Event-aligned temporal witnesses prevent 93 of 187 false-safe N2O approvals, about 4.65x the dyadic baseline (paired McNemar p 1e-21).

[AI-34] Smooth Piecewise Cutting for Neural Operator to Handle Discontinuities and Sharp Transitions

链接: https://arxiv.org/abs/2605.19823
作者: Ha Dang,Sebastian Schmidt,Juergen Hesser
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP); Dynamical Systems (math.DS); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Neural operators have achieved strong performance in learning solution operators of partial differential equations (PDEs), but their inherently continuous representations struggle to capture discontinuities and sharp transitions. Existing approaches typically approximate such features within continuous function spaces, often requiring increased model capacity and high-resolution data. In this work, we propose Cut-DeepONet, a two-stage training framework that explicitly models discontinuities while reducing learning complexity. Our approach reformulates the problem via a lifting strategy, partitioning the domain into smooth subregions while representing discontinuities as boundaries in a higher-dimensional space. This separation aligns the operator learning task with the inductive bias of neural networks and avoids directly approximating discontinuities. An additional network predicts input-dependent discontinuity locations for unseen inputs, which are then used to guide the neural operator in generating smooth components within each region. Experiments on benchmark PDEs show that Cut-DeepONet outperforms state-of-the-art methods, even when trained on low-resolution datasets. The method excels on problems with discontinuities and sharp transitions, while using fewer trainable parameters. Our results highlight the benefits of changing the representation of operator learning rather than increasing model complexity.

[AI-35] ST-TGExplainer: Disentangling Stability and Transition Patterns for Temporal GNN Interpretability

链接: https://arxiv.org/abs/2605.19822
作者: Hongjiang Chen,Xin Zheng,Pengfei Jiao,Huan Liu,Zhidong Zhao,Huaming Wu,Feng Xia,Shirui Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Temporal graph neural networks (TGNNs) have gained significant traction for solving real-world temporal graph tasks. However, their interpretability remains limited, as most TGNNs fail to identify which historical interactions most influence a given prediction. Despite promising progress on interpretable TGNNs, existing methods predominantly focus on previously seen historical interactions, which we term stability patterns, while overlooking newly emerging first-time interactions, which we term transition patterns. Both types of patterns are essential for faithful temporal explanations. To address this limitation, we propose ST-TGExplainer, a self-explainable TGNN that disentangles Stability and Transition patterns in temporal graphs for a more faithful Temporal GNN Explainer. Guided by a disentangled information bottleneck objective, ST-TGExplainer learns a compact explanatory subgraph that remains predictive of the event label while explicitly suppressing label-conditioned redundancy between stability and transition patterns. Extensive experiments demonstrate that ST-TGExplainer achieves strong predictive performance and yields more faithful explanations. Code is available at this https URL.

[AI-36] FLUXtrapolation: A benchmark on extrapolating ecosystem fluxes

链接: https://arxiv.org/abs/2605.19812
作者: Anya Fries,Jacob A Nelson,Martin Jung,Markus Reichstein,Jonas Peters
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We introduce FLUXtrapolation, a benchmark for extrapolating ecosystem fluxes under progressively harder distribution shifts. Ecosystem fluxes are central to understanding the carbon, water, and energy cycles, yet they can only be measured directly at sparsely located measurement towers. Producing global flux estimates therefore requires training models on observed sites using globally available covariates and predicting in unobserved regions, that is, upscaling. Flux upscaling is a challenging domain generalization problem that is affected by a shift in covariate distribution across climates, ecosystem types, and environmental conditions, as well as by conditional shift: important drivers remain unobserved at global scale. We provide a quantitative analysis of both these shifts in P_X and P_Y\mid X . FLUXtrapolation is designed based on domain expertise on flux upscaling: it defines temporal, spatial, and temperature-based extrapolation scenarios and evaluates performance across held-out domains, temporal aggregations, and tail errors. In a pilot study, we find that baselines perform similarly under median hourly RMSE, but separate under the proposed tail-focused and multi-scale evaluation. FLUXtrapolation therefore poses a realistic and thus relevant challenge for machine learning methods under distribution shift; at the same time, progress on this benchmark would directly support the scientific goal of improving flux upscaling.

[AI-37] Latent Laplace Diffusion for Irregular Multivariate Time Series ICML2026

链接: https://arxiv.org/abs/2605.19805
作者: Zinuo You,Jin Zheng,John Cartlidge
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Camera-ready Spotlight paper at ICML 2026. 27 pages, 5 figures. Code: this https URL

点击查看摘要

Abstract:Irregular multivariate time series impose a trade-off for long-horizon forecasting: discrete methods can distort temporal structure via re-gridding, while continuous-time models often require sequential solvers prone to drift. To bridge this gap, we present Latent Laplace Diffusion (LLapDiff), a generative framework that models the target as a low-dimensional latent trajectory, enabling horizon-wide generation without step-by-step integration over physical time. We guide the reverse process utilizing a stable modal parameterization motivated by stochastic port-Hamiltonian dynamics, and parameterize its mean evolution in the Laplace domain via learnable complex-conjugate poles, enabling direct evaluation over irregular timestamps. We also link continuous dynamics to irregular observations through renewal-averaging analysis, which maps sampling gaps to effective event-domain poles and motivates a gap-aware history summarizer. Extensive experiments show that LLapDiff improves over baselines in long-horizon forecasting, and its continuous-time generative nature supports missing-value imputation by querying the same model at historical timestamps. Code is available at this https URL.

[AI-38] Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization

链接: https://arxiv.org/abs/2605.19782
作者: Dmitry Redko(1),Albert Fazlyev(2),Konstantin Sozykin(1),Maria Ivanova(3 and 1),Evgeny Burnaev(1),Egor Shvetsov(1) ((1) Applied AI Institute, (2) AI Talent Hub, ITMO University, (3) YSDA)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:LLM discovery and optimization systems are increasingly applied across domains, implementing a common propose-evaluate-revise loop. Such optimization or discovery progresses via context conditioning on received feedback from an environment. However, as modern LLM agents are increasingly complex in their structure, it is difficult to evaluate which components contribute the most, and when and how this exploration may fail. We answer these questions through three controlled experiments. Our findings: (1) In pure black-box optimization, LLMs act as greedy optimizers. (2) In zero-shot kernel generation, providing explicit input-size information has no measurable effect, models converge to the same kernel parameters regardless of size or temperature, as though the size instruction were invisible. Moreover, when tasked to perform kernel optimization for uncommon kernel sizes, performance sharply degrades regardless of the language used. (3) In feedback-loop kernel optimization, CUDA improves monotonically under iterative feedback, while TVM IR actively degrades, which demonstrates that kernel optimization degrades when models operate with low-density language. Our results conclude that LLMs in code optimization tasks highly depend on pretrained priors rather than provided feedback or agentic structure.

[AI-39] From SGD to Muon: Adaptive Optimization via Schatten-p Norms

链接: https://arxiv.org/abs/2605.19781
作者: Thomas Massena(IRIT, DTIPG - SNCF, UT3),Corentin Friedrich,Mathieu Serrurier(IRIT)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern optimizers, like Muon, impose matrix-wise geometry constraints on their updates. These matrix-wise constraints can be unified under Linear Minimization Oracle (LMO) theory. However, all current methods impose fixed LMO geometries for the update rules, chosen by-design or empirically, which are not necessarily optimal according to the problem’s geometry. We introduce a novel efficient datadriven criterion for dynamically choosing proxy-optimal update LMO geometries on individual Deep Neural Network layers. Derived in closed form from gradient and activation statistics using a single-step random feature regression surrogate model, our criterion navigates a design space interpolating from SGD to Muon updates. Moreover, integrating parameter-wise preconditioning allows our framework to recover SGD, Muon, Adam, and MuAdam as specific extrema. To make this adaptive approach scalable, we pair it with efficient computational strategies, achieving only a \sim 3% runtime overhead on highly optimized baselines. As a proof of concept, we show that this data-driven optimizer beats or remains competitive with the performance of the best performing optimizer between Muon and AdamW across three different training scenarios. Ultimately, this work provides evidence that LMO geometry can be successfully and efficiently adapted from runtime data, opening a new pathway for optimizer design beyond static geometries.

[AI-40] Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation ICML2026

链接: https://arxiv.org/abs/2605.19779
作者: Yuxuan Gao,Megan Wang,Yi Ling Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 7 figures, 2 tables. Accepted at the ICML 2026 Workshop on Agentic Uncertainty Quantification (AgenticUQ) - Poster

点击查看摘要

Abstract:We adapt split conformal prediction and adaptive conformal inference (ACI) to continuous AI agent evaluation, providing distribution-free coverage guarantees for forecasted quality scores. Conformal intervals achieve calibration error below 0.02 across all nominal levels at the 24h horizon, while ACI correctly widens intervals by 35% following agent releases then reconverges. We further develop compositional uncertainty bounds for multi-agent pipelines (validated via simulation across inter-stage correlations rho in [-0.5, 0.9]), a conformal abstention rule for pairwise rankings with controlled false-ranking rate, and FDR-corrected abstention for leaderboard-scale multiple testing. Evaluating 50 agents via 18 real-time signals collected hourly, we show that per-agent conditional coverage is well-concentrated around the nominal level (mean 80.4%, 90% of agents within [72%, 90%]), and that cross-source sentiment divergence predicts ranking instability (r=0.64, p0.01). A circularity-controlled validation confirms the framework captures signal beyond benchmarks (rho_s=0.52, p0.01, n=35). Code and data are released under CC BY 4.0.

[AI-41] OpenComputer: Verifiable Software Worlds for Computer-Use Agents

链接: https://arxiv.org/abs/2605.19769
作者: Jinbiao Wei,Qianran Ma,Yilun Zhao,Xiao Zhou,Kangqi Ni,Guo Gan,Arman Cohan
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer’s hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.

[AI-42] Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

链接: https://arxiv.org/abs/2605.19768
作者: Pierre Boudart(SIERRA),Pierre Gaillard(Thoth),Alessandro Rudi(PSL, DI-ENS, Inria)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of \smash\tildeO(dH^2\sqrtT) (Li et al., 2024), where d is the feature dimension, H the episode length, and T the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant \bar\sigma_T \leq 1/2 , measuring the normalised average variance of the optimal downstream value function along the learner’s trajectory. We propose an algorithm achieving a regret of \smash\tildeO(dH^2\bar\sigma_T\sqrtT) , which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, \bar\sigma_T = O(H^-1) , reducing the horizon dependence by a factor H . We further establish a matching \smash\Omega(dH^2\bar\sigma_T\sqrtT) lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.

[AI-43] AR1-ZO: Topology-Aware Rank-1 Zeroth-Order Queries for High-Rank LoRA Fine-Tuning

链接: https://arxiv.org/abs/2605.19767
作者: Ziye Chen,Hongbin Lin,Chenyu Zhang,Xiangda Yan,Yongjie Yang,Yao Shu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zeroth-order (ZO) optimization enables large-language-model fine-tuning without storing backpropagation activations, while LoRA supplies compact trainable adapters. Combining them creates a rank paradox: increasing LoRA rank improves adapter capacity, but standard two-point ZO either perturbs a rank-dependent number of coordinates or, under atomwise updates, can make the finite-difference signal unobservable. This paper shows that the bottleneck is a measurement-topology problem rather than a need for an external subspace. LoRA already decomposes into matched rank- 1 atoms, each a complete factor-coordinate block of dimension d_\textout+d_\textin . Querying one atom per step keeps the stored adapter rank r while removing r from the single-query perturbation dimension. The naive atomwise query is still miscalibrated: if it inherits canonical LoRA scaling \alpha/r , the active finite-difference signal shrinks as 1/r and the active finite-difference signal-to-noise ratio (FD-SNR) as 1/r^2 , producing directional collapse under a fixed residual evaluation-noise floor. AR1-ZO pairs alternating rank- 1 atom queries with topology-aware scaling \gamma=\alpha r , restoring rank-invariant active signal without auxiliary bases, activation hooks, curvature estimates, or extra forward queries. Theory proves atom minimality, rank-independent active query dimension, directional collapse and restoration, and the remaining rank dependence as an amortized coverage cost. Experiments on OPT and Qwen3 models validate the signal mechanism and show that AR1-ZO makes high-rank LoRA effective among matched-budget ZO methods under the standard two-forward-pass query budget.

[AI-44] GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction

链接: https://arxiv.org/abs/2605.19765
作者: Meisam Jamshidi Seikavandi,Alice Modica,Anna Obara,Shan Ahmed Shaffi,Fabricio Batista Narcizo,Tanya Ignatenko,Ted Vucurevich,Karim Haddad,Daniel Barratt,Daniel Overholt,Jesper Bunsow Boldt,Paolo Burelli,Andrew Burke Dittberner
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Existing affective-computing, social-signal-processing, and meeting corpora capture important parts of human interaction, but they rarely support analysis of affect in co-located groups as a coupled individual, interpersonal, and group-level process. The required signals (per-participant physiology, eye movement, audio, self-report, task outcomes, and personality) are usually fragmented across separate dataset traditions. We introduce GroupAffect-4, a multimodal corpus of 40 participants in 10 four-person groups, each completing four ecologically varied collaborative tasks spanning information pooling, negotiation, idea generation, and a public-goods game. Each participant is instrumented with a wrist-worn physiology sensor, eye-tracking glasses, and a close-talk microphone; sessions include continuous affect self-reports, post-task questionnaires, task outcomes, and Big-Five personality scores, all time-aligned to a shared clock. The dataset covers over 91% of expected physiology windows and 98% of eye-tracking windows, with strong task validity confirmed by a clear affective manipulation check across the negotiation block. We define fifteen benchmarkable targets spanning three analysis levels – within-person state, between-person traits, and group dynamics – and report leave-one-group-out feasibility baselines establishing the dataset’s evaluative scope. GroupAffect-4 is released with a BIDS-inspired structure, Croissant metadata, a datasheet, per-session quality reports, and open processing scripts. Code and processing scripts are available at this https URL the dataset is publicly archived at this https URL.

[AI-45] CogScale: Scalable Benchmark for Sequence Processing

链接: https://arxiv.org/abs/2605.19758
作者: Yannis Bendi-Ouis(Mnemosyne),Romain de Coudenhove(ENS-PSL),Xavier Hinaut(Mnemosyne)
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The ability to maintain and manipulate information over time is a fundamental aspect of living beings and Artificial Intelligence. While modern models have achieved remarkable success in tasks like natural language processing, evaluating the capacity of novel architectures to process sequential information remains computationally expensive and time-consuming. Testing a new architecture often requires scaling up to massive datasets and models, leading to vast computational costs and slow iteration cycles. In this paper, we propose CogScale, a benchmark of 14 scalable synthetic tasks designed to isolate and evaluate specific cognitive and memory abilities at different parametrizable scales. By providing a standardized, lightweight framework, CogScale allows researchers to rapidly validate architectural innovations before committing to large-scale training. To establish a solid baseline, we evaluate seven distinct architectures: Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), xLSTM, Echo State Network (ESN), Mamba, Transformer Decoder, and Transformer Encoder-Decoder. These evaluations are conducted under strict parameter budgets (1k, 10k, and 100k) and across different difficulty levels and scales. Our results show that while classical RNNs and Echo State Networks excel at basic retention within strict parameter budgets, only attention mechanisms and modern state-space models consistently maintain high performance as reasoning complexity and task difficulty scale.

[AI-46] Measuring Safety Alignment Effects in Autonomous Security Agents

链接: https://arxiv.org/abs/2605.19722
作者: Isaac David,Arthur Gervais
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Do stock safety-aligned language models and their uncensored or abliterated derivatives behave differently when run as autonomous security agents? Single-turn refusal benchmarks cannot answer this question: security agents must inspect repositories, call tools, and produce vulnerability evidence inside authorized sandboxes. We present a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks, and compare four stock models against uncensored or abliterated derivatives: Gemma 4 31B, Gemma 4 26B A4B, Qwen2.5-Coder 7B, and Llama 3.1 8B. The artifact contains 1,500 security-agent traces and 800 non-security control traces. The Gemma pairs show large less-restricted gains on security tasks: 14.0% versus 0.7% success for 31B and 10.7% versus 0.0% for 26B, with higher mean grounding (3.91 versus 3.27 and 4.12 versus 1.64 out of five) and 0.0% refusal, suppressed-action, and unsafe-action rates in the 31B traces. However, controls and non-Gemma pairs rule out a clean security-specific or universal less-restricted effect: Gemma gaps also appear on ordinary coding tasks, Qwen2.5-Coder success is lower for the less-restricted derivative (2.0% versus 5.3%), and the abliterated Llama derivative fails the tool protocol. Across all families, hard proof-of-trigger and patch-verification tasks remain unsolved. These results show that safety alignment effects in autonomous security agents should be measured at the system level, separating refusal, unsafe action, tool reliability, and evidence grounding rather than treating refusal rate as the safety signal. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.19722 [cs.CR] (or arXiv:2605.19722v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.19722 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-47] Projecting Latent RL Actions: Towards Generalizable and Scalable Graph Combinatorial Optimization

链接: https://arxiv.org/abs/2605.19721
作者: Franco Terranova(UL, LORIA, Inria),Guillermo Bernardez(UC Santa Barbara),Albert Cabellos-Aparicio(UPC),Nina Miolane(UC Santa Barbara),Abdelkader Lahmadi(LORIA, UL, Inria)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: Preprint

点击查看摘要

Abstract:Graph combinatorial optimization (GCO) has attracted growing interest, as many NP-hard problems naturally admit graph formulations, yet their combinatorial explosion renders exact methods computationally intractable. Recent advances in Reinforcement Learning (RL) combined with Graph Neural Networks (GNNs) have significantly improved learning-based GCO solvers. However, existing approaches face limitations in both generalization across diverse graph instances and computational scalability as action spaces grow. To address both challenges, we introduce projection agents, a novel RL-GCO approach that operates directly in a continuous GNN-based action embedding space, predicting a desired latent action in a single forward pass and subsequently decoding it into a valid discrete action. Additionally, we enable fair comparison across RL methods through a shared embedding space for both observations and actions. Across diverse benchmarks, our approach achieves up to 16.2x faster inference and up to 40% better generalization than existing solutions using only simple nearest-neighbor decoding, while opening the door to strong RL performance in super-linear decision spaces with multiple interdependent variables. Finally, we release LaGCO-RL, a Python library that automates latent action-space construction and supports existing RL-GCO solutions, promoting reproducibility and adaptation to new GCO benchmarks.

[AI-48] Beyond Rational Illusion: Behaviorally Realistic Strategic Classification ICML2026

链接: https://arxiv.org/abs/2605.19674
作者: Xinpeng Lv,Yunxin Mao,Renzhe Xu,Chunyuan Zheng,Yikai Chen,Haoxuan Li,Yang Shi,Jinxuan Yang,Zhouchen Lin,Yuanlong Chen,Yuanxing Zhang,Shaowu Yang,Wenjing Yang,Haotian Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML2026

点击查看摘要

Abstract:Strategic classification(SC) studies the interaction between decision models and agents who strategically manipulate their features for favorable outcomes. Existing SC frameworks typically rely on the idealized assumption that agents are strictly rational. However, evidence from behavioral economics and psychology consistently shows that real-world decision-making is often shaped by cognitive biases, deviating from pure rationality. To formalize this limitation, we identify and define a new problem setting, termed the behaviorally realistic strategic classification problem, where agents’ strategic manipulations deviate from full rationality due to psychological biases. Motivated by the identified limitation, we propose the Prospect-Guided Strategic Framework (Pro-SF) to address the problem, a principled framework grounded in prospect theory to model and learn under behaviorally realistic strategic responses. Specifically, to capture behaviorally realistic strategic manipulations, our framework reformulates the Stackelberg-style interaction between agents and the decision-maker by incorporating three key mechanisms inspired by prospect theory, including the asymmetry between benefits and costs, different subjective reference points, and non-rational probability distortion. Experiments on synthetic and real-world datasets establish Pro-SF as a behaviorally grounded approach to strategic classification, bridging machine learning and behavioral economics for more reliable deployment in the real world.

[AI-49] ransforming Constraint Programs to Input for Local Search

链接: https://arxiv.org/abs/2605.19671
作者: Jo Devriendt,Patrick De Causmaecker,Marc Denecker
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Unpublished paper accepted and presented at the Fourteenth International Workshop on Constraint Modelling and Reformulation (ModRef) in 2015

点击查看摘要

Abstract:Applying local search algorithms to combinatorial optimization problems is not an easy feat. Typically, human intervention is required to compile the constraints to input data for some metaheuristic algorithm. In this paper, we establish a link between symmetry properties of constraint optimization problems and local search neighborhoods, and we use this link to automatically generate neighborhoods from a constraint specification in the context of the IDP system. We evaluate the obtained neighborhoods for six classical optimization problems. The resulting observations support the viability of this technique.

[AI-50] CriterAlign: Criterion-Centric Rationale Alignment for Code Preference Judging

链接: https://arxiv.org/abs/2605.19665
作者: Zhenyu Li,Aleksandar Cvejic,Zehui Chen,Peter Wonka
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pairwise human preference prediction is central to evaluating code-generation systems, where quality often depends on task-specific trade-offs beyond functional correctness. While rubric-based LLM judges improve interpretability by decomposing evaluation into explicit criteria, most existing pipelines remain pointwise: they score each response independently and derive preferences by comparing aggregated scores. We show that this design is poorly matched to pairwise code preference prediction and can underperform a strong monolithic judge. We propose CriterAlign, a criterion-centric framework that adapts rubric-based judging to pairwise preference evaluation through direct criterion-level pairwise judgments, tie-driven criterion refinement, swap-consistency filtering, and final pairwise synthesis. We further introduce Human-Preference-Aligned Guidance (HPAG), synthesized offline from training examples by extracting recurring rationale gaps between human preferences and monolithic judge predictions, and injected into the criterion generator, criterion judge, and final judge. On BigCodeReward, CriterAlign improves a Qwen2.5-VL-32B monolithic judge from 60.4% to 66.3% accuracy, with ablations confirming the contributions of pairwise criterion design and HPAG.

[AI-51] Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models

链接: https://arxiv.org/abs/2605.19663
作者: Weicong Ni,Tianbao Jiang,Linlin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are becoming the cornerstone of high-level reasoning for robotic automation, enabling robots to parse natural language commands and perceive their environments. However, their susceptibility to hallucinations introduces critical failures in decision-making, posing significant safety and reliability risks in physical deployments. This challenge is exacerbated by the open-ended nature of real-world tasks, where questions vary vastly in difficulty and modality, demanding robust and adaptable reasoning strategies. To tackle this, we propose the Pseudocode-guided Structured Reasoning framework (PStar), which adaptively selects structured pseudocode reasoning paths to help VLMs perform flexible and step-by-step reasoning. We first design a set of abstract reasoning functions and formulate a structured pseudocode library to represent modular reasoning strategies. Crucially, we design a Difficulty Feature Vector (DFV) that allows the model to assess question complexity and adaptively choose appropriate reasoning strategies-enhancing robustness and interpretability. Extensive experiments demonstrate that PStar significantly reduces hallucination rates, achieving state-of-the-art scores of 87.1% on POPE and 68.0% on MMStar, outperforming even GPT-4V. By providing a validated mechanism to reduce visual-language errors, PStar offers a critical step toward deploying more trustworthy and deterministic VLMs for real-world automated systems, where such errors can lead to catastrophic outcomes.

[AI-52] When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach ICML2026

链接: https://arxiv.org/abs/2605.19662
作者: Xinpeng Lv,Yunxin Mao,Renzhe Xu,Chunyuan Zheng,Yikai Chen,Haoxuan Li,Jinxuan Yang,Kun Kuang,Yuanlong Chen,Mingyang Geng,Wanrong Huang,Shixuan Liu,Shaowu Yang,Wenjing Yang,Zhouchen Lin,Haotian Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML2026

点击查看摘要

Abstract:Tabular foundation models based on pretrained prior-data fitted networks~(PFNs) have shown strong generalization on diverse tabular tasks, but they are typically designed for \emphnon-strategic settings where data distributions are independent of deployed classifiers. In many real-world decision scenarios, however, individuals may strategically modify their features after deployment to obtain favorable outcomes, inducing a post-deployment distribution shift. This paper studies whether PFN-style tabular foundation models can generalize to such \emphstrategic tabular data. We show that strategic manipulation creates a mismatch between the non-strategic prior learned during pretraining and the post-manipulation strategic prior, which leads to systematic prediction bias. To address this issue, we propose \textbfStrategic Prior-data Fitted Network~\textit(SPN), an inference-time strategy-aware framework that adapts tabular foundation models to strategic environments without retraining. SPN constructs strategic in-context examples to approximate post-manipulation inputs and aligns PFN predictions with the induced strategic distribution. Experiments on real-world and synthetic tabular datasets show that SPN consistently improves robustness and predictive performance under strategic manipulation compared with both tabular foundation models and classical tabular methods.

[AI-53] EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection CVPR

链接: https://arxiv.org/abs/2605.19630
作者: Aritra Marik,Marcel Klemt,Anna Rohrbach
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at SAFE@CVPRW 2026

点击查看摘要

Abstract:With every advancement in generative AI models, forensics is under increasing pressure. The constant emergence of new generation techniques makes it impossible to collect data for each manipulation to train a deepfake detection model. Thus, generalizing to deepfakes unseen during training is one of the major challenges in current deepfake detection research. To tackle this challenge, we employ high-level semantic cues and argue that these cues can support low-level focused approaches in generalizing to unseen types of manipulations. In this work, we study emotions as a high-level semantic cue. We propose Emo-Boost, a multimodal deepfake detection framework that fuses an off-the-shelf RGB- and acoustic-focused deepfake detector with our emotion-based deepfake detector EmoForensics. EmoForensics utilises vision and audio emotion recognition modules and models intra- and inter-modal temporal consistency in emotion representations from an audio-visual stream. We found that EmoForensics and the low-level focused method capture complementary signals. Consequently, combining both signals in EmoBoost enhances the average cross-manipulation generalization AUC by 2.1% on FakeAVCeleb.

[AI-54] MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

链接: https://arxiv.org/abs/2605.19619
作者: Feihu Huang,Yuning Luo,Songcan Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: 25 pages

点击查看摘要

Abstract:Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of O\big(\frac1N\kappa^T\big) , where N is training sample size, and T denotes iteration number, and \kappa0 denotes minimum difference between singular values of gradient estimate. To enhance generalization of the Muon, we propose an effective mixed Muon (MiMuon) optimizer by cautiously using orthogonalization of gradient, which is a hybrid of Muon and momentum-based SGD optimizers. Then we prove that our MiMuon optimizer has a lower generalization error of O\big(\frac1N\big) than O\big(\frac1N\kappa^T\big) of Muon optimizer, since \kappa generally is very small. Meanwhile, we also studied the convergence properties of our MiMuon algorithm, and prove that our MiMuon algorithm has the same convergence rate of O(\frac1T^1/4) as the Muon algorithm. Some numerical experimental results on training large models including Qwen3-0.6B and YOLO26m demonstrate efficiency of the MiMuon optimizer.

[AI-55] Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

链接: https://arxiv.org/abs/2605.19604
作者: Xi Zhang,Meijun Gao,Yuntian Zhao,Xinyu Tan,Yilun Yao,Feiyu Wang,Yanshu Wang,Dingsiyi,Tong Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Existing skills remain largely informal: Markdown skills and instruction packs encode procedures as long natural-language documents, while function calling, Model Context Protocol (MCP) servers, and framework tools structure individual actions but usually leave workflow state, policy enforcement, and completion discipline outside the skill itself. We introduce Formal Skill, a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. We implement the abstraction in FairyClaw, an open-source event-driven runtime for executable, observable, and composable Formal Skills. On Harness-Bench, FairyClaw obtains highly competitive average scores while using substantially fewer tokens, with especially strong results on tasks that expose the role of Formal Skill.

[AI-56] owards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

链接: https://arxiv.org/abs/2605.19593
作者: Mert Yildiz,Pietro Spadaccino,Alexey Rolich,Francesca Cuomo,Andrea Baiocchi
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: The 2026 Mediterranean Artificial Intelligence and Networking Conference (MAIN 2026)

点击查看摘要

Abstract:Modern deployments of Large Language Models (LLMs) increasingly require serving multiple models with diverse architectures, sizes, and specialization on shared, heterogeneous hardware. This setting introduces new challenges for resource allocation, dispatching, and scheduling, particularly under GPU memory constraints where partial CPU-GPU offloading and preemption become necessary. While existing systems primarily optimize throughput for a single model, comparatively little work addresses multi-model scheduling under these conditions. In this paper, we present an empirical study of how different LLMs behave across hardware platforms, focusing on the performance implications of layer offloading and preemption. We show that offloading leads to strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU residency. We further demonstrate that preemption incurs substantial overhead, largely dominated by model state reload rather than key-value cache transfer, and that this cost varies significantly across models and hardware platforms. Additionally, we highlight the role of sequence length and interconnect bandwidth in amplifying data movement and execution inefficiencies. Based on these findings, we identify a set of key features that future schedulers must consider, including model-specific offloading sensitivity, workload characteristics, and the cost structure of preemption and data transfer. These insights provide guidance for the design of next-generation LLM serving systems capable of efficiently managing heterogeneous, multi-model workloads with hybrid CPU-GPU execution.

[AI-57] Implicit Action Chunking for Smooth Continuous Control

链接: https://arxiv.org/abs/2605.19592
作者: Bosun Liang,Shuo Pei,Zirui Chen,Chuanzhi Fan,Chen Sun,Yuankai Wu,Huachun Tan,Yong Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning often produces high-frequency oscillatory control signals that undermine the safety and stability required for physical deployment. Explicit action chunking addresses this by predicting fixed-horizon trajectories but scales the policy output dimension proportionally with the horizon length, leading to optimization difficulties and incompatibility with standard step-wise interaction. To overcome these challenges, this paper proposes Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control. Unlike explicit methods, DWS enforces temporal coherence without expanding the action space. It uses a dual-window design: an execution window that ensures physical smoothness through deterministic modulation, and a value window that aligns temporal-difference targets over the horizon to correct critic bias caused by open-loop execution. DWS also includes a lightweight actor-side temporal regularizer based on first-order action differences to promote global continuity. This design effectively bridges the gap between temporal abstraction and reactive step-wise control. Experiments on benchmarks including the DeepMind Control Suite and industrial energy management tasks show that DWS outperforms state-of-the-art (SOTA) baselines. In complex vision-based autonomous driving tasks, DWS achieves smoother control, safer behavior with reduced jitter, and attains a 100% success rate.

[AI-58] SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

链接: https://arxiv.org/abs/2605.19587
作者: Puyi Wang,Yuhao Wang,Linjie Li,Zhengyuan Yang,Kevin Qinghong Lin,Yangguang Li,Yu Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipelines, however, typically represent generated content as static meshes and inherit articulation only from curated asset libraries, which limits object-level controllability and prevents new interactable assets from being produced on demand. We address this gap by formulating physically interactable indoor scene synthesis as programmatic world generation, and present SceneCode, a framework that compiles a natural language prompt into an executable, code-driven indoor world rather than a collection of opaque meshes. A room-level agentic backbone first turns the prompt into a structured house layout and emits per-object AssetRequests through a planner–designer–critic loop. Each request is then routed to one of five code-generation strategies and converted into a synthesized part-wise Blender Python programs that are validated through an execution-guided repair-and-refine loop. The resulting programs are compiled into simulation-ready assets, and exported as SDF for physics simulation. A persistent scene-state registry links object requests, executable programs, rendered geometry, and simulation assets, turning scene assembly into a traceable and locally editable world-building process. We evaluate SceneCode across scene-level synthesis, object-level asset quality, human judgment, and downstream robot interaction. Results show that executable world programs improve prompt-faithful indoor scene generation and produce assets with cleaner mesh structure, and simulator-loadable articulation metadata. Project page: this https URL.

[AI-59] ORQ: Two-Level Orthogonal Rotation for MXFP4 Quantization

链接: https://arxiv.org/abs/2605.19561
作者: Zukang Xu,Xing Hu,Dawei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures, 13 tables

点击查看摘要

Abstract:As Large Language Models (LLMs) advance toward practical deployment, the Microscaling FP4 (MXFP4) format has emerged as a cornerstone for next-generation low-bit inference, owing to its ability to balance high dynamic range with hardware efficiency. However, directly applying MXFP4 to LLM activation quantization inevitably leads to significant accuracy degradation. In this paper, we theoretically analyze the error structure of MXFP4 activation quantization, revealing that the root cause of this performance drop lies in two structural imbalances between activation distributions and the MXFP4 block floating-point format: (1) extreme inter-block variance imbalance and (2) intra-block codebook utilization imbalance. To address these challenges, we propose TORQ (Two-level Orthogonal Rotation for MXFP4 Quantization), a training-free Post-Training Quantization (PTQ) framework designed to reshape the geometric properties of the activation space through optimal coordinate transformations. At the macroscopic level, TORQ leverages the Schur-Horn theorem to redistribute activation energy via inter-block orthogonal rotation, preventing high-variance blocks from driving up shared scaling factors and thereby preserving the precision of small-magnitude elements. At the microscopic level, TORQ employs maximum-entropy-guided intra-block rotation to alleviate codebook collapse and maximize the MXFP4 codebook’s information capacity. Experiments on mainstream LLMs such as LLaMA3 and Qwen3 show that TORQ significantly improves the accuracy of MXFP4 activation quantization compared to existing methods: on Qwen3-32B, the perplexity on WikiText is reduced to 8.43 (vs. 7.61 for BF16), and the average accuracy increases from 38.40% with direct RTN to 73.63% (vs. 74.82% for BF16), substantially narrowing the gap between 4-bit floating-point quantization and full-precision inference.

[AI-60] Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM -Enabled Adaptive Assessment

链接: https://arxiv.org/abs/2605.19529
作者: Grandee Lee,Yue Wang,Che Yee Lye,Luke Peh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: BEA 2026

点击查看摘要

Abstract:When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM’s scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance r = 0.698 with systematic positive bias. GEA is strong r 0.7 for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.

[AI-61] Efficient Elicitation of Collective Disagreements

链接: https://arxiv.org/abs/2605.19521
作者: Mohamed Ouaguenouni,Felipe Garrido-Lucero,Umberto Grandi,César Hidalgo,Magdalena Tydrichova
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:We analyze the structure of the disagreement among a population of voters over a set of alternatives. Surveys typically ask either for pairwise comparisons, simple and intuitive for participants, or full rankings over alternatives, eliciting the entire voters’ preferences. Building on the observation that pairwise comparisons cannot distinguish structural disagreement from noise, we propose a stratified framework to identify the minimal aggregated preference information needed to compute a number of disagreement measures from the literature. Specifically, we introduce the plurality matrix, a generalization of pairwise comparisons that records, for every subset S of alternatives, the probability that each a \in S ranks first in S . We define the level of a disagreement measure as the smallest subset size needed to express it, showing that many existing notions, including rank-variance and divisiveness, sit at level 3 , proving that pairwise comparisons are not enough. In addition, we demonstrate the interest of going beyond level 3 both theoretically and experimentally. To make these results actionable, we design two elicitation protocols to estimate the plurality matrix, exploring the trade-off between the number of required participants and the cognitive load requested to each of them.

[AI-62] BLINKG: A Benchmark for LLM -Integrated Knowledge Graph Generation

链接: https://arxiv.org/abs/2605.19518
作者: Carla Castedo,Enrique Iglesias,Manuel Lama,Alberto Bugarin-Diz,Maria-Esther Vidal,David Chaves-Fraga
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating Knowledge Graphs (KGs) remains one of the most time-consuming and labor-intensive tasks for knowledge engineers, as they need to identify semantic equivalences between input data sources and ontology terms. While declarative solutions (e.g., RML, SPARQL-Anything) have helped to generalize this process, aligning input schema elements with ontology terms still involves intricate transformations and requires considerable manual effort. With the advent of Large Language Models (LLMs), there is growing interest in leveraging their capabilities to assist KG engineers. Although some studies have explored using LLMs to automate KG construction, there is still no standardized framework for assessing how effectively they establish correspondences between data schemes and ontology concepts. Therefore, in this paper, we propose BLINKG, a benchmark designed to evaluate the mapping capabilities of LLMs in constructing KGs from heterogeneous data sources. The benchmark includes a set of scenarios with increasing complexity, based on real-world use cases. We conduct an extensive experimental evaluation of several stateof-the-art LLMs using BLINK and observe that they already offer promising solutions. However, their performance remains limited in complex scenarios. Thanks to this benchmark, we can already assess the current capabilities of LLMs for KG construction. Additionally, we define a set of requirements for achieving (semi)automated (LLM-driven) KG construction, opening new research lines in this area.

[AI-63] ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

链接: https://arxiv.org/abs/2605.19503
作者: Carlo Romeo,Andrew D. Bagdanov
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground’s morphological diversity and animation-style stylistic constraints.

[AI-64] CANINE: Coaching Visually Impaired Users for Interactive Navigation with a Robot Guide Dog

链接: https://arxiv.org/abs/2605.19501
作者: Cunjun Yu,Zishuo Wang,Anxing Xiao,Linfeng Li,David Hsu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to RSS 2026

点击查看摘要

Abstract:Robot guide dogs offer navigation assistance that greatly expands the independent mobility of the visually impaired, but their effective use requires subtle human-robot coordination that is difficult for users to learn from generic verbal instructions. To tackle this challenge, we present CANINE, an automated coaching system that trains users for interactive navigation with a robot guide dog, through personalized, adaptive verbal feedback. CANINE decomposes a complex coordination task into sub-skills and operates at two levels. At the high level, it decides what to train by tracking the learner’s proficiency across sub-skills using knowledge tracing and prioritizing training on the weakest areas. At the low level, CANINE decides how to train each sub-skill by observing each human practice episode, using foundation models to infer the underlying causes of errors, and generating targeted verbal corrections adaptively. A controlled study with blindfolded participants, treated as a proxy population for quantitative evaluation, demonstrates that CANINE significantly improves both learning efficiency and final navigation performance compared to generic verbal instructions. We further validate CANINE through a retention study and an exploratory case study. The retention study shows lasting skill improvement after two weeks. The case study confirms CANINE’s effectiveness in training a visually impaired user, while revealing additional design considerations for real-world deployment. Both are well aligned with the findings of the controlled study. Project page: this https URL

[AI-65] Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

链接: https://arxiv.org/abs/2605.19485
作者: Zheng Lin,Zhenxing Niu,Haoxuan Ji,Yuzhe Huang,Haichang Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a model’s internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate jailbreak attacks on LRMs and reveal that the attack success rate (ASR) is closely correlated with LRMs’ attention patterns. Specifically, successful jailbreaks tend to assign lower attention to harmful tokens in the input prompt, while allocating higher attention to those tokens in the reasoning content. Motivated by this finding, we propose a novel jailbreak method for LRMs that leverages reinforcement learning (RL) to enhance attack effectiveness, explicitly incorporating attention signals into the reward function design. In addition, we introduce diverse persuasion strategies to enrich the RL action space, which consistently improves the ASR. Extensive experiments on five open-source and closed-source LRMs across three benchmarks demonstrate that our method achieves substantially higher ASR, outperforming existing approaches in terms of effectiveness, efficiency, and transferability.

[AI-66] Sampling-Based Safe Reinforcement Learning

链接: https://arxiv.org/abs/2605.19469
作者: Luca Vignola,Bruce D. Lee,Manish Prajapat,Manuel Wendl,Melanie Zeilinger,Andreas Krause,Yarden As
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Safe exploration remains a fundamental challenge in reinforcement learning (RL), limiting the deployment of RL agents in the real world. We propose Sampling-Based Safe Reinforcement Learning (SBSRL), a model-based RL algorithm that maintains safety throughout the learning process by enforcing constraints jointly across a finite set of dynamics samples. This formulation approximates an intractable worst-case optimization over uncertain dynamics and enables practical safety guarantees in continuous domains. We further introduce an exploration strategy based on constraining epistemic uncertainty, eliminating the need for explicit exploration bonuses. Under regularity conditions, we derive high-probability guarantees of safety throughout learning and a finite-time sample complexity bound for recovering a near-optimal policy. Empirically, SBSRL achieves safe and efficient exploration both in simulation and in real robotic hardware, and readily extends to practical deep-ensemble implementations that scale to high-dimensional continuous control problems.

[AI-67] Quantifying the Pre-training Dividend: Generative versus Latent Self-Supervised Learning for Time Series Foundation Models

链接: https://arxiv.org/abs/2605.19462
作者: Noam Major,Kathy Razmadze,Yoli Shavit
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The success of self-supervised learning (SSL) in vision and NLP has motivated its rapid adoption for time series. However, research has focused primarily on Generative paradigms and forecasting tasks, leaving the broader utility of learned representations unquantified. We establish a controlled framework to evaluate the “pre-training dividend”: the value added by SSL across diverse temporal tasks. We systematically compare Generative paradigms against Latent Alignment architectures, introducing adaptations of LeJEPA and DINO for time series. These adaptations utilize Discrete Wavelet Transform (DWT) augmentations to enforce invariance to local fluctuations. Our analysis reveals that the pre-training dividend is highly asymmetric: SSL yields gains of up to 375% for anomaly detection and classification, yet remains marginal for forecasting. We demonstrate that representational utility is non-universal, governed by a precision-invariance trade-off where the specific signal resolution required by the task must align with the objective. Finally, we show that representation quality is largely independent of data origin and saturates at moderate architectural depths, suggesting a path to scaling via massive synthetic generation. Our code is available at: this https URL

[AI-68] Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

链接: https://arxiv.org/abs/2605.19461
作者: Xiaozhe Li,Yang Li,Xinyu Fang,Shengyuan Ding,Peiji Li,Yongkang Chen,Yichuan Ma,Tianyi Lyu,Linyang Li,Dahua Lin,Qipeng Guo,Qingwen Liu,Kai Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization’s mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality, an ideal testbed for evaluating exploration. DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO’s 40.1%) and 43.1% on vision-based NP-Bench (vs. 38.4%), demonstrating 9% and 12% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.

[AI-69] Generative Auto-Bidding with Unified Modeling and Exploration SIGIR2026

链接: https://arxiv.org/abs/2605.19457
作者: Mingming Zhang,Feiqing Zhuang,Na Li,Shengjie Sun,Xiaowei Chen,Junxiong Zhu,Fei Xiao,Keping Yang,Lixin Zou,Chenliang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11pages, sigir2026

点击查看摘要

Abstract:Automated bidding is central to modern digital advertising. Early rule-based methods lacked adaptability, while subsequent Reinforcement Learning approaches modeled bidding as a Markov Decision Process but struggled with long-term dependencies. Recent generative models show promise, yet they lack explicit mechanisms to balance exploration and safety, relying solely on action perturbations or trajectory guidance without a safety fallback. This results in inefficient exploration and elevated financial risk for advertising platforms. To address this gap, we propose GUIDE (Generative Auto-Bidding with Unified Modeling and Exploration), a framework that synergistically integrates directed exploration with a safe fallback mechanism. GUIDE employs a Decision Transformer (DT) to jointly model historical bidding actions and environmental state transitions. A Q-value module guides the DT’s exploration via regularization constraints, while an Inverse Dynamics Module (IDM) leverages DT-predicted future states to infer robust, behaviorally consistent actions as a safe policy fallback. The Q-value module then adaptively selects the final action between these two options, balancing exploration and safety. Together, these components form an integrated “explore-safeguard-select” pipeline that unifies efficiency and safety. We conduct extensive experiments on public datasets, in simulated auction environments, and through large-scale online deployment on Taobao, a leading Chinese advertising platform. Results show GUIDE consistently outperforms state-of-the-art baselines across all scenarios. In real-world deployment, GUIDE achieves notable gains: +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI, demonstrating its effectiveness and strong industrial applicability. Comments: 11pages, sigir2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.19457 [cs.AI] (or arXiv:2605.19457v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.19457 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3805712.3809661 Focus to learn more DOI(s) linking to related resources

[AI-70] Resilient Byzantine Agreement with Predictions

链接: https://arxiv.org/abs/2605.19452
作者: Julien Dallot,Darya Melnyk,Tijana Milentijevic,Stefan Schmid,Patrik Welters
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper studies the Byzantine Agreement problem where the nodes have access to a predictor that flags nodes for suspicion of faulty (Byzantine) behavior. We focus on algorithmic resilience – the maximum number of faulty nodes an algorithm can tolerate – and present algorithms and impossibility results whose resilience depend on the accuracy of the predictor. As our first main result, we bring a complete characterization of the consistency–robustness trade-offs in both the non-authenticated and authenticated settings: for n nodes and a parameter \alpha \in [0, 1] , we present algorithms that tolerate up to \alpha \cdot n faulty nodes when the predictor is correct (consistency), and up to \frac1-\alpha2 \cdot n - 1 faulty nodes when the predictor is arbitrarily wrong (robustness); in the authenticated setting the robustness bound improves to (1-\alpha) \cdot n - 1 . These trade-offs are exactly tight as we show that one additional faulty node renders the problem impossible. Our second main result characterizes smoothness: the rate at which resilience degrades as the predictor becomes less accurate. We show that resilience linearly decreases in the number of wrong predictions as long as that number stays within a constant fraction of n . Concretely, in the non-authenticated setting each additional wrong prediction loses one unit of resilience, whereas in the authenticated setting the decline is halved since two wrong predictions are needed to lose one unit of resilience.

[AI-71] What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

链接: https://arxiv.org/abs/2605.19447
作者: Xiaozhe Li,Tianyi Lyu,Yang Li,Yichuan Ma,Peiji Li,Linyang Li,Qipeng Guo,Dahua Lin,Kai Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.

[AI-72] When the Majority Votes Wrong the Intervention Timing for Test-Time Reinforcement Learning Hides in the Extinction Window

链接: https://arxiv.org/abs/2605.19444
作者: Hongxiang Lin,Zhirui Kuai,Erpeng Xue,Lei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo-label signal. We argue these gains are systematically misinterpreted: most reflect sharpening of already-solvable problems rather than genuine learning, while problems corrupted from correct to incorrect outnumber truly learned ones, and this damage is irreversible once majority vote locks onto a wrong answer. Per-problem tracking reveals that correct-answer signals in low-ability problems are briefly active before being permanently suppressed, a phenomenon we term the \textitCorrect-Answer Extinction Window, with Flip Rate (FR) as its leading indicator. We thus propose \textbfTTRL-Guard, a lightweight framework with three mechanisms targeting the extinction window: Flip-Rate-Aware Reward Scaling (FRS) down-weights at-risk updates as FR declines, Minority-Preserving Sampling (MPS) retains gradient signal from minority correct answers, and Risk-Conditioned Sparse Updatings (RCSU) suspends updates on polarized problems. Experiments across three models and four benchmarks show that TTRL-Guard achieves the best average pass@1 on Qwen2.5-7B-Instruct and Qwen3-4B, improves relatively over TTRL by +54% on AIME 2025. \footnoteOur code and implementation details are available at this https URL.

[AI-73] When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR

链接: https://arxiv.org/abs/2605.19425
作者: Yuchun Miao,Sen Zhang,Yuqi Zhang,Yaorui Shi,Qi Gu,Xunliang Cai,Lefei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 10 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become the dominant paradigm for advanced reasoning in Large Language Models (LLMs), but rollout samples are expensive to obtain, making sample efficiency a critical bottleneck. A natural remedy is to reuse each rollout batch for multiple gradient updates, a standard practice in classical RL. Yet in RLVR, this amplifies policy shift, leading to severe performance degradation. Detecting the onset of degradation early enough to stop reuse remains an open and challenging problem. We close this gap by identifying the \textitDisproportionate Weight Divergence (DWD) phenomenon: performance degradation is synchronized with a sharp surge in the \textttlm_head weight change, while intermediate layers remain stable. Empirically, we verify that DWD emerges consistently across diverse LLMs and tasks. Theoretically, we prove that (i) harmful gradients concentrate at the \textttlm_head while intermediate layers are structurally attenuated, and (ii) the \textttlm_head gradient norm lower-bounds the policy divergence. These results establish the \textttlm_head gradient norm as a principled, real-time signal of catastrophic policy shift. Guided by this insight, we propose \textitDynamic Gradient Gating (DGG), a lightweight intervention that monitors the \textttlm_head gradient norm in real time and intercepts harmful gradients before they corrupt the optimizer. DGG consistently matches or exceeds the standard single-use baseline, achieving up to 2.93\times sample efficiency and 2.14\times wall-clock speedup across math, ALFWorld, WebShop, and search-augmented QA tasks.

[AI-74] Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling

链接: https://arxiv.org/abs/2605.19418
作者: Longgang He,Longzhu He,Daojing He,Chaozhuo Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based multi-agent systems (MAS) have demonstrated strong reasoning and decision-making capabilities that consistently surpass those of single LLM agents. However, their performance often suffers from naive aggregation mechanisms that assume uniformly cooperative interactions. Upon close inspection, we observe that existing graph-based MAS frameworks (1) propagate errors when conflicting signals arise without control, and (2) lack explicit modeling of conflicting inter-agent relations as well as structural awareness, failing to identify reliable interaction patterns. To bridge this gap, we introduce SIGMA, a novel SIgned Graph-informed Multi-Agent reasoning framework that explicitly captures trust, conflict, and neutral relations among agents via a signed relational graph. Specifically, given a query, SIGMA first selects a set of relevant and diverse agents, then constructs a structured signed interaction graph with confidence-weighted edges. Reasoning proceeds through conflict-aware signed message passing, which reinforces information from trustworthy agents while suppressing conflicting signals, and terminates with a structure- and conflict-aware weighted aggregation to yield globally consistent and conflict-resilient predictions. Extensive experiments on six benchmark datasets, across multiple LLM backbones and diverse multi-agent configurations, demonstrate that SIGMA consistently outperforms state-of-the-art baselines, achieving notable gains in both accuracy and conflict-resilient performance.

[AI-75] Unlocking the Potential of Continual Model Merging: An ODE Perspective

链接: https://arxiv.org/abs/2605.19409
作者: Lihong Lin,Haidong Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:Continual Model Merging (CMM) enables rapid customization of foundation models across sequentially arriving tasks, offering a scalable alternative to repeated retraining. However, existing merging rules lack explicit controllability over the allocation of learning capacity between previously learned capabilities and newly merged models. Consequently, as tasks are merged sequentially, this deficiency accumulates into severe forgetting, particularly in scenarios with heterogeneous task importance, where performance allocation becomes highly inconsistent. The key reason can be attributed to the fact that previous methods treat each task model as an isolated parameter point and apply fixed algebraic combinations, rather than explicitly constructing a transition that respects how independently trained models can be connected in parameter space. Motivated by mode connectivity, we assume that desirable merged models lie on low loss connecting paths, and that continual merging should follow such paths without crossing loss barriers that induce forgetting. Grounded in these insights, we propose a novel ODE-driven Merging (ODE-M) tailored for CMM that traces such a path by integrating a time-dependent velocity field and enforcing barrier constraints to prevent loss-increasing steps. Extensive experiments demonstrate that ODE-M achieves state-of-the-art performance compared to its competitors across mainstream CMM benchmarks.

[AI-76] A Bitter Lesson for Data Filtering

链接: https://arxiv.org/abs/2605.19407
作者: Christopher Mohri,John Duchi,Tatsunori Hashimoto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor’’ data.

[AI-77] PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

链接: https://arxiv.org/abs/2605.19382
作者: Qiran Zhang,Yuheng Wang,Runde Yang,Lin Wu,Jingru Fan,Shu Yao,Jie Zhang,Tianle Zhou,Huatao Li,Ruijie Shi,Yihan Li,Chen Qian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Programmatic video generation through code offers geometric precision and temporal coherence beyond pixel-level diffusion models, yet rigorously evaluating whether language models can produce spatially correct animated outputs remains an open problem. We introduce PRISM, a large-scale benchmark of 10,372 human-calibrated instruction-code pairs (20 times larger than prior programmatic video generation benchmarks), grounded in real-world knowledge visualization scenarios across English and Chinese and spanning 437 subject categories. We further propose a funnel-style evaluation framework with four complementary metrics: Code-Level Reliability for executability, Spatial Reasoning for layout correctness over full animation sequences, and Prompt-Aware Dynamic Visual Complexity (PADVC) and Temporal Density (TD) for diagnosing dynamic expression and temporal activity. Systematic evaluation of seven mainstream LLMs reveals a striking Execution-Spatial Gap: the average drop from execution success rate to spatial pass rate is approximately 41%, showing that runnable code does not necessarily yield spatially coherent visual output. These findings show that programmatic video generation evaluation should go beyond executability. PRISM provides a principled benchmark for advancing spatially coherent code generation.

[AI-78] he Evaluation Game: Beyond Static LLM Benchmarking

链接: https://arxiv.org/abs/2605.19377
作者: Paul Wang,Jade Garcia-Bourrée,Anne-Marie Kermarrec,Vincent Corruble
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 36 pages

点击查看摘要

Abstract:As jailbreaks, adversarially crafted inputs that bypass safety constraints, continue to be discovered in Large Language Models, practitioners increasingly rely on fine-tuning as a defensive strategy. Yet the theoretical foundations underlying this robustness fine-tuning remain underexplored. We introduce a game-theoretic framework in which the interaction between an evaluator (auditing the model for jailbreaks) and a trainer is formalized as a two-player game. A key feature of our approach is the use of group actions, a mathematical structure that captures symmetries and transformations, to formally represent data augmentation. The simplest non-trivial instance is the circle with cyclic translation groups, where we exhibit various regimes depending on the trainer’s generalization range. Below a critical threshold, the evaluator maintains a constant miss ratio for linearly many rounds, whereas other settings can yield very different behaviors. We further provide empirical evidence supporting locality-dependence of the model: for the three model families we tested (Llama, Qwen and Mistral), we have significant evidence that fine-tuning on adversarial prompts induces only local generalization, with refusal rates on test examples highly correlated with the distance to the fine-tuning prompts. Our framework recasts the central object of adversarial evaluation: a benchmark is not a static set of prompts but an orbit under the evaluator’s group action, and audit protocols that ignore trainer-side adaptation cannot distinguish a genuine fix from a memorized patch. Comments: 36 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.19377 [cs.LG] (or arXiv:2605.19377v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.19377 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-79] Generative Recursive Reasoning

链接: https://arxiv.org/abs/2605.19376
作者: Junyeob Baek,Mingyu Jo,Minsu Kim,Mengye Ren,Yoshua Bengio,Sungjin Ahn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce \emphGenerative Recursive reAsoning Models (GRAM), a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via p_\theta(y \mid x) and, with fixed or absent inputs, unconditional generation via p_\theta(x) . Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. \hrefthis https URLthis https URL

[AI-80] Conflict-Free Replicated Data Types for Neural Network Model Merging: A Two-Layer Architecture Enabling CRDT-Compliant Model Merging Across 26 Strategies

链接: https://arxiv.org/abs/2605.19373
作者: Ryan Gillespie
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:All 26 neural network merge strategies we tested including weight averaging, SLERP, TIES, DARE, Fisher merging, and evolutionary approaches – fail the algebraic properties (commutativity, associativity, idempotency) required for conflict-free distributed operation. We prove that this failure is structural: normalisation-based merges cannot simultaneously satisfy all three properties. To resolve this, we present a two-layer architecture – CRDTMergeState – that wraps any merge strategy in a CRDT-compliant (Conflict-Free Replicated Data Type) layer. Layer 1 manages contributions via OR-Set CRDT semantics, where the merge operation is set union – trivially commutative, associative, and idempotent. Layer 2 applies merge strategies as deterministic pure functions over a canonically-ordered contribution set, with randomness seeded from the Merkle root. We prove that this separation guarantees Strong Eventual Consistency: all replicas receiving the same contributions compute identical merged models, regardless of message ordering. Empirical validation spans three tiers: controlled 4x4 tensors (104/104 tests pass), production-scale models up to 7.24B parameters (208 strategy-level tests, 43,368 layer-level property checks at capped tensor resolution), and multi-node convergence under gossip and partition healing (100 nodes, 20 orderings), with CRDT overhead below 0.5 ms. Because the wrapper is transparent, downstream performance is identical by construction, confirmed via byte-identical output verification. The reference implementation is available as crdt-merge v0.9.4.

[AI-81] Agent ic Trading: When LLM Agents Meet Financial Markets

链接: https://arxiv.org/abs/2605.19337
作者: Yihan Xia,Panpan You,Taotao Wang,Fang Liu,Han Qi,Xiaoxiao Wu,Shengli Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 59 pages, 15 figures, 27 tables

点击查看摘要

Abstract:A growing body of work explores how Large Language Models (LLMs) can be embedded in trading systems as agents that perceive market information, retrieve context, reason about decisions, emit tradable actions, and adapt under market feedback. This paper reframes LLM-based trading agents as expert-system decision pipelines and presents an audit-oriented evidence map of 77 included studies in a protocol-coded snapshot screened through 2026-03-09. A primary empirical subset (n=19) satisfies the minimum boundary of Action Output plus Closed-Loop Evaluation; the remaining 58 included studies are retained as background and design context. The central empirical finding is protocol incomparability: within the primary subset, only 2/19 studies report extractable time-consistent split protocols, 1/19 reports an explicit transaction-cost model, 1/19 documents universe or survivorship handling, 11/19 report execution timing or semantics, 15/19 are coded as R0, and no study reaches R3 reproducibility. We therefore use Architecture-Capability-Adaptation as a working analytical lens rather than a validated taxonomy, and we foreground the evidence ledger, reproducibility audit, and reporting checklist as the main contributions. The resulting survey shows that architectural experimentation is expanding rapidly, while comparable evaluation protocols, execution semantics, and reproducible artifacts remain the field’s immediate bottlenecks.

[AI-82] MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

链接: https://arxiv.org/abs/2605.19330
作者: Md Mehrab Tanjim,Jayakumar Subramanian,Xiang Chen,Branislav Kveton,Subhojyoti Mukherjee,Anlan Zhang,Sungchul Kim,Somdeb Sarkhel,Sunav Choudhury
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Preprint. 25 pages, 14 figures, 5 tables

点击查看摘要

Abstract:LLM agents organize behavior through skills - structured natural-language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi-field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co-resident skills compete for limited context windows. These constraints make skill optimization inherently multi-objective: a skill must simultaneously maximize task performance and satisfy platform limits. Yet existing prompt optimizers either ignore these trade-offs or collapse them into a weighted sum, missing Pareto-optimal variants in non-convex objective regions. We introduce MOCHA (Multi-Objective Chebyshev Annealing), which replaces single-objective selection with Chebyshev scalarization - covering the full Pareto front, including non-convex regions - combined with exponential annealing that transitions from exploration to exploitation. In our experiments across six diverse agent skills - where all methods share the same multi-objective mutation operator and baselines receive identical per-objective textual feedback - existing optimizers fail to improve the seed skill on 4 of 6 tasks: 1000 rollouts yield zero progress. MOCHA breaks through on every task, achieving 7.5% relative improvement in mean correctness over the strongest baseline (up to 14.9% on FEVER and 10.4% on TheoremQA) while discovering twice as many more Pareto-optimal skill variants.

[AI-83] Exploring and Developing a Pre-Model Safeguard with Draft Models

链接: https://arxiv.org/abs/2605.19321
作者: Hongyu Cai,Arjun Arunasalam,Yiming Liang,Antonio Bianchi,Z. Berkay Celik
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) alignment remains vulnerable to jailbreak attacks that elicit unsafe responses, motivating pre-model and post-model guards. Pre-model guards audit the safety of prompts before invoking target models. However, relying solely on the prompt often leads to high false-negative rates (i.e., jailbreak attacks go undetected). Post-model guards address this issue by auditing both the user prompt and the target model’s response. However, they incur a high computational cost, including increased token usage and processing time, because they operate after target model inference. In this paper, we introduce a safeguard design that leverages the transferability of jailbreak attacks to enforce prompt safety before target model inference. We first conduct a systematic study of jailbreak transferability, particularly from LLMs to small language models (SLMs). Through these experiments, we identify key factors influencing transferability. Building on these insights, we observe that responses from smaller draft models reflect the safety implications of those from large target models; \ie given a jailbreak prompt constructed for an LLM, an SLM is likely to be triggered to generate an unaligned response. Based on this observation, our safeguard design leverages speculative inference with SLMs to generate a set of draft responses. It then feeds the original prompt and these drafts into existing guards to predict their safety. We demonstrate that this design reduces the false-negative rate of pre-model guards and offers a low \Efficiency alternative to post-model guards. \textcolorred\bf Notice: This paper contains examples of harmful language. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.19321 [cs.CR] (or arXiv:2605.19321v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.19321 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ACM Conference on AI and Agentic Systems (ACM CAIS 2026)

[AI-84] Inference-Time Scaling in Diffusion Models through Iterative Partial Refinement ICLR2026

链接: https://arxiv.org/abs/2605.19317
作者: Taegu Kang,Jaesik Yoon,Sungjin Ahn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the ICLR 2026 Workshop on AI with Recursive Self-Improvement

点击查看摘要

Abstract:Inference-time scaling has emerged as a major approach for improving reasoning capabilities, and has been increasingly applied to diffusion models. However, existing inference-time scaling methods for diffusion models typically rely on external verifiers or reward models to rank and select samples, limiting their scalability to settings where such evaluators are available and reliable. Moreover, while recent diffusion models perform sequential inference with region-wise, mixed-noise conditioning, inference-time scaling tailored to this setting remains relatively underexplored. We propose Iterative Partial Refinement (IPR), an inference-time scaling method for sequential diffusion that requires no external verifier. Starting from an already-generated sample, IPR re-noises a subset of regions and regenerates them conditioned on the remaining regions, enabling the model to revise earlier decisions under a richer context than was available during the initial generation. This iterative partial refinement produces more globally consistent samples without external verification. On reasoning tasks requiring global constraint satisfaction, IPR consistently improves performance: on MNIST Sudoku, the valid solution rate increases from 55.8% to 75.0%. These results show that iterative partial refinement alone can serve as an effective inference-time scaling strategy for diffusion models in sequential, mixed-noise settings. Code is available at: this https URL

[AI-85] ContextFlow: Hierarchical Task-State Alignment for Long-Horizon Embodied Agents

链接: https://arxiv.org/abs/2605.19314
作者: Shuhan Guo,Kun Zhang,Haifei Liu,Xingyu Gao,Yongqi Zhang,Yaqing Wang,Quanming Yao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon embodied agents increasingly delegate navigation, search, approach, and manipulation to specialist executors. As these executors become stronger, the main bottleneck shifts from local skill execution to maintaining a coherent task frontier across planning, monitoring, memory, and execution. We study task-state misalignment, a task-level consistency failure in which the planner’s active stage, runtime evidence, remembered context, and delegated executor no longer justify the same next-step decision. This failure can lead to unsupported handoffs, stage lock, executor-context mismatch, and unnecessary replanning. We propose ContextFlow, an inspectable alignment framework that represents stages as explicit contracts, converts runtime observations into evidence packets, and applies scoped updates including continue, refine, transfer, promote, and repair. ContextFlow keeps specialist executors responsible for local closed-loop control while making task-frontier alignment explicit and auditable. Experiments and demonstration traces on long-horizon embodied tasks illustrate how evidence-grounded scoped updates diagnose and mitigate recurring task-state failures.

[AI-86] DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies

链接: https://arxiv.org/abs/2605.19294
作者: Yixiang Zhu,Yonghao Chen,Rui Meng,Jingyu Guo,Jiaxiang Zou,Zijie Yang,Taowen Wang,Xinyu Chen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) policies are typically deployed with asynchronous inference: the robot executes a previously predicted action chunk while the model computes the next one. This creates a prediction-execution misalignment: the chunk is conditioned on the observation taken before inference began, but executes in a physical state that has already drifted forward by several control steps; naive asynchronous rollover collapses from 89% to under 1% on Kinetix as the inference cycle covers up to seven control steps. We introduce DEFLECT, a fully offline post-training refinement that applies as a near drop-in upgrade to existing async-VLA stacks by converting latency itself into a label-free preference signal: counterfactual fresh/stale action pairs are constructed from a frozen reference policy and scored under the deployment-time conditioning via an implicit flow-matching likelihood-ratio surrogate, with no human labels, reward models, or online rollouts. DEFLECT substantially extends the usable delay envelope of async VLA control, with +6.4 success-rate gain in the high-latency regime (5-7 control steps), +4.6 when transferred to a real-scale VLA at the longest delay, and consistent improvements on two real-robot tasks (a bimanual conveyor pick-and-place and a reactive whack-a-mole).

[AI-87] EviTrack: Selection over Sampling for Delayed Disambiguation

链接: https://arxiv.org/abs/2605.19283
作者: Omer Haq
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: this https URL

点击查看摘要

Abstract:Sequential prediction is challenging in regimes of delayed disambiguation, where early observations are ambiguous and multiple latent explanations remain plausible until sufficient evidence accumulates. Standard approaches based on marginal inference struggle in this setting, either collapsing uncertainty prematurely or failing to recover once informative evidence arrives. We introduce EviTrack, a test-time inference framework that operates over latent trajectories rather than marginal states. EviTrack maintains a set of competing trajectory hypotheses and applies evidence- and likelihood-ratio-based selection to delay commitment until supported by data, drawing inspiration from hypothesis management in multiple hypothesis tracking and track-before-detect. To evaluate this setting, we construct a controlled synthetic benchmark with known latent ground truth that explicitly exhibits delayed disambiguation. At matched inference budget, EviTrack substantially outperforms sampling-based baselines, achieving faster post-disambiguation recovery. These results show that, in delayed disambiguation regimes, moderate trajectory-level selection is more effective than increasing sampling coverage, highlighting selection over sampling as a key principle for reliable sequential inference. Comments: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2605.19283 [cs.LG] (or arXiv:2605.19283v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.19283 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Omer Haq [view email] [v1] Tue, 19 May 2026 03:01:42 UTC (666 KB)

[AI-88] ExECG: An Explainable AI Framework for ECG models

链接: https://arxiv.org/abs/2605.19258
作者: Jong-Hwan Jang,Yong-yeon Jo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning has enabled ECG diagnostic models with strong performance in tasks such as arrhythmia classification and abnormality detection. However, accuracy alone is insufficient for clinical deployment because it does not explain why a specific output was produced, limiting justification, error analysis, and trust. Although ECG XAI has been extensively investigated and steadily improved, practical pipelines and reporting conventions vary across studies, hindering reuse and reproducibility. To address these issues, we present Explainable AI framework for ECG models (ExECG), a Python framework that provides a three-stage pipeline: Wrapper standardizes access across heterogeneous ECG formats and intermediate representations, Explainer unifies diverse XAI methods under a shared execution protocol, and Visualizer supports consistent cross-method comparison within a unified interface. We demonstrate end-to-end usage with concise examples and two case studies, highlighting interoperable and reproducible ECG explainability.

[AI-89] Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination

链接: https://arxiv.org/abs/2605.19250
作者: Jinrui Jiang,Zhangtai Wu,Zhen Wu,Xinyu Dai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modality-conflict hallucination occurs when multimodal large language models (MLLMs) prioritize erroneous textual premises over contradictory visual evidence. To understand why visual evidence fails to prevail during generation, we take a mechanistic perspective and examine which internal components drive or resist this failure. We perform head-level causal analysis using path patching across five open-source MLLMs and identify two groups of attention heads with opposing causal roles: hallucination-driving heads and hallucination-resisting heads. We find a consistent asymmetry: driving effects are more broadly distributed and carry greater aggregate weight, whereas resisting effects concentrate in a small number of high-importance heads. Ablation experiments further confirm that these groups exert opposing effects during generation: distributed driving influence and localized resistance together form an imbalanced routing structure that biases generation toward the erroneous premise. Motivated by this finding, we propose MACI (Modality-conflict-Aware Causal Intervention), a conditional intervention that suppresses causally identified hallucination-driving heads only when conflict is detected. Across five MLLMs, MACI achieves the largest hallucination reduction among compared inference-time baselines on the MMMC benchmark with a favorable hallucination-accuracy trade-off, and transfers zero-shot to the SCI-SemanticConflict test.

[AI-90] Euclidean Embedding of Data Using Local Distances

链接: https://arxiv.org/abs/2605.19243
作者: Dimitris Arabadjis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
备注:

点击查看摘要

Abstract:We study the problem of recovering a globally consistent Euclidean embedding of data, given only a local distance graph and propose a method that optimally represents these distances. The method operates solely on a neighborhood graph weighted by pairwise distances, without requiring any prior vector representation of the data. The embedding is obtained by solving a variational problem that matches local, on-graph distances to the Euclidean metric, induced by the differentials of the embedding functions. The resulting Euler-Lagrange equations are derived in a coordinate-free form, enabling direct evaluation of all operators from the distance graph alone. Though non-linear and missing an explicit expression for their non-linearity, these equations are shown to be resolved as an iteratively updated sparse linear problem. The main contributions of the proposed approach are (a) the derivation of the functional equations governing the optimal Euclidean embedding in the continuum, (b) a representation-free formulation that requires only a neighborhood distance graph and no feature vectors and © an estimation procedure based exclusively on local graph operations. We experimentally evaluate the resulting non-parametric algorithm on synthetic manifolds and real datasets, demonstrating consistent preservation of local metric structure and neighboring relations, while approximating the global isometric embedding.

[AI-91] Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

链接: https://arxiv.org/abs/2605.19229
作者: Yan Wang,Ziyi Guo,Christopher McCarty
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Survey research faces mounting structural challenges: declining response rates, sample bias, block-wise missingness among at-risk respondents, and AI-assisted fraudulent completions in online panels. Large language models (LLMs) have been proposed as a remedy, yet rigorous evaluations across the full survey workflow remain scarce, particularly in disaster contexts where data quality matters most. We present and evaluate a five-stage framework for LLM integration covering questionnaire design, sample selection, pilot testing, missing-data imputation, and post-collection analysis, using the 2024 Hurricane Milton preparedness survey of Florida residents (n=946) as a shared empirical testbed. We introduce a Protection Motivation Theory (PMT)-constrained co-occurrence knowledge graph and develop seven LLM configurations spanning zero-shot inference, retrieval-augmented baselines, and novel theory-informed variants. Our proposed Anchored Marginal Theory-Informed LLM (A-TLM) outperforms all three classical imputation baselines (IPW/MI, MICE+PMM, missForest) on RMSE under disaster-relevant block-wise MNAR conditions (S4 RMSE 1.439 vs. 1.496 for the next-best), while achieving near-zero signed bias (-0.121) where the random-forest imputer produces the largest absolute bias (-0.631). Organizing retrieval around PMT causal structure and integrating all evidence in a single model call outperforms unstructured retrieval and staged sequential inference (MAE 0.993 vs. 1.097 for standard RAG). We document that near-zero aggregate bias can mask opposing subgroup errors and propose subgroup-stratified bias auditing as a reporting standard. A retrieval-constrained knowledge-graph chatbot demonstrates that hallucination is architecturally manageable through grounded refusal.

[AI-92] oken by Token Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models

链接: https://arxiv.org/abs/2605.19227
作者: Tobias Braun,Jonas Henry Grebe,Hossein Shakibania,Anna Rohrbach,Marcus Rohrbach
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unified autoregressive models (UAMs) are transformer models that generate text as well as image tokens within a single autoregressive pass. Shared parameters and a multimodal vocabulary simplify the training pipeline and facilitate flexible multimodal generation, yet might introduce new vulnerabilities. In particular, we are the first to show that this unified architecture enables multimodal backdoor attacks, where a trigger can propagate malicious effects across multiple output modalities. Specifically, we present the Token by Token Backdoor Attack (ToBAC), the first backdoor attack targeting UAMs, exploring both data-based and model-based poisoning strategies. We demonstrate that innocuous characters or even common words can be transformed into triggers that elicit harmful behavior in autoregressive image generation. ToBAC can jointly manipulate visual outputs and accompanying text, increasing the perceived authenticity of fabricated content. With model access, ToBAC enables attacks on the unified Liquid model in which a subtle word (e.g., ``cool’') induces modality-aligned brand promotion or ideological influence in 55% of generations. Without model access, ToBAC can be induced through data poisoning, achieving an average success rate of 63.1% against JanusPro.

[AI-93] SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

链接: https://arxiv.org/abs/2605.19219
作者: Han Li,Vibhor Malik,Zahra Zanjani Foumani,Alberto Castelo,Shuang Xie,Ailin Fan,Keat Yang Koay,Yuanzheng Zhu,Meysam Feghhi,Ronie Uliana,Zhaoyu Zhang,Angelo Ocana Martins,Mingyu Zhao,Francis Pelland,Jonathan Faerman,Nikolas LeBlanc,Aaron Glazer,Andrew McNamara,Zhong Wu,Lingyun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks degrading user experience. We present SimGym, a framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM) agents operating in a live browser. The framework comprises three key components: (a) a traffic-grounded persona generation pipeline that derives per-shop buyer archetypes and intents from production clickstream data; (b) a live-browser agent architecture that combines multimodal perception over visual and browser-structured observations with episodic memory and guardrails to conduct coherent shopping sessions across control and treatment storefronts; and © an evaluation protocol that compares simulated outcome shifts with observed shifts in real buyer behavior. We validate SimGym on A/B tests of visually driven UI theme changes from a major e-commerce platform across diverse storefronts and product categories. Empirical results show that SimGym agents achieve strong agreement with observed outcome shifts, attaining 77% directional alignment with add-to-cart shifts observed across interface variants in real-buyer traffic. It reduces experimental cycles from weeks to under an hour, enabling rapid experimentation without exposing real buyers to candidate variants.

[AI-94] Not all uncertainty is alike: volatility stochasticity and exploration

链接: https://arxiv.org/abs/2605.19215
作者: Payam Piray
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adaptive decision-making in biological and artificial intelligence requires balancing the exploitation of known outcomes with the exploration of uncertain alternatives. Although prior work suggests that uncertainty generally promotes exploration, it has typically treated distinct sources of environmental uncertainty as equivalent. We consider environments with latent reward states that drift over time (volatility) and are observed through noisy outcomes (stochasticity). Both increase posterior uncertainty, yet we show they drive optimal exploration in opposite directions: volatility enhances it, stochasticity suppresses it. We establish this asymmetry formally by extending the Gittins index framework to Gaussian state-space bandits with latent dynamics. We further derive Cause-Aware Uncertainty-Sensitive Exploration (CAUSE), a closed-form exploration bonus obtained via control-as-inference that inherits the same monotonicities. CAUSE outperforms standard exploration strategies in environments with heterogeneous noise structure, and also improves on a Gittins-per-arm policy whose rested-bandit optimality does not transfer to restless settings. Learning and exploration are governed by the same noise-inference asymmetry, and the framework predicts that pathological noise inference produces \emphreversed rather than merely impaired exploration, with implications for computational accounts of psychiatric conditions.

[AI-95] Aerial Inspection Behaviors via RL-based Quadrotor Control for Under-canopy Forest Environments

链接: https://arxiv.org/abs/2605.19202
作者: Fausto Mauricio Lagos Suarez,Akshit Saradagi,Vidya Sumathy,Viswa Narayanan Sankaranarayanan,George Nikolakopoulos
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Submitted to 2026 IEEE 22nd International Conference on Automation Science and Engineering

点击查看摘要

Abstract:This paper addresses the problem of using a deep Reinforcement Learning (RL)-based low-level Quadrotor controller within an autonomous Quadrotor navigation stack for aerial inspection missions in under-canopy forest environments. Specifically, the article presents an end-to-end (mapping states to RPMs) Quadrotor control policy that achieves inspection view-pose tracking (simultaneous position and yaw reference tracking), which is crucial for various target inspection behaviors and point-to-point navigation in forests. To ensure safe and reliable deployment of the end-to-end RL controller in long-range missions, this article utilizes a higher navigation guidance layer comprising of a Traveling Salesman Problem planner (TSP) and a Rapidly-exploring Random Tree Star (RRT*) planner. Over a known map of a forest and a set of user-specified inspection regions, the TSP planner finds the optimal visitation sequence. Between two target regions, collision-free paths that respect the tracking limitations of the lower end-to-end RL policy are generated by an RRT* planner. Through five target inspection scenarios, this article demonstrates that an RL-based motor-level stabilizing controller, supported by a navigation guidance layer, can be used effectively as the low-level inspection execution module for under-canopy forest inspection missions.

[AI-96] On-Device Continual Learning with Dual-Stage Buffer and Dynamic Loss for Point-of-Care Pneumonia Diagnosis

链接: https://arxiv.org/abs/2605.19201
作者: Danu Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Presented at 32nd Samsung Humantech Paper Awards

点击查看摘要

Abstract:Deep learning models detect pneumonia from chest X-rays with high accuracy, but the performance declines under domain shifts caused by differences in devices, patients, or institutions. We present PneumoNet, a domain-incremental learning method for point-of-care pneumonia diagnosis in resource-limited settings. PneumoNet combines a lightweight CNN for on-device prediction, a dual-stage balanced buffer for class-balanced replay, and a dynamic class-weighted loss to correct training-batch imbalances. Evaluated on a domain-shifted PneumoniaMNIST dataset simulating five realistic domain change scenarios, PneumoNet achieves 86.6% accuracy with 1.4% forgetting while being smaller and faster than existing baselines. These results highlight PneumoNet’s potential to enable adaptive, privacy-preserving diagnostic AI directly on point-of-care medical devices in real-world and pandemic-ready healthcare.

[AI-97] Hallucination as Exploit: Evidence-Carrying Multimodal Agents

链接: https://arxiv.org/abs/2605.19192
作者: Guijia Zhang,Hao Zheng,Harry Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 21 pages, 6 figures, 13 tables

点击查看摘要

Abstract:Multimodal agents use screenshots, documents, and webpages to choose tool calls. When a false visual claim triggers a click, email, extraction, or transfer, hallucination becomes an authorization failure rather than an answer-quality error. We formalize this failure mode as hallucination-to-action conversion: an unsupported perceptual claim supplies the precondition that makes a privileged action appear permitted. We propose evidence-carrying multimodal agents (ECA), which treat free-form model text as inadmissible evidence. ECA decomposes each tool call into action-critical predicates, obtains typed certificates from constrained DOM/OCR/AX verifiers, and lets a deterministic gate grant only the privileges those certificates support. The architecture does not hide perception error; it converts opaque model belief into named verifier, schema, and implementation residuals. Verifier red-teaming over 1,900 attacks exposes this residual directly: four targeted hardening steps reduce gate bypass from 15% to 1.3%. With content-derived certificates, ECA obtains 0% unsafe-action rate on a 200-task end-to-end pipeline (Wilson 95% upper bound 2.67%) and a 120-task browser proof-of-concept (upper bound 4.3%). A direct HACR audit on 500 stratified task keys shows that unsupported action-critical claims reach unsafe execution for naive agents (100.0%) and prompt-only defense (49.6%), but not for ECA. Oracle-certificate replay on 7,488 GPT-5.4 benchmark traces serves as a gate-correctness sanity check, and neural judge baselines remain bypassable under the same threat model. The resulting principle is simple: model language may propose actions, but external evidence must authorize them.

[AI-98] Discoverable Agent Knowledge – A Formal Framework for Agent ic KG Affordances (Extended Version)

链接: https://arxiv.org/abs/2605.19186
作者: Terry R. Payne,Valentina Tamma,Enrico Daga
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Two decades ago, the Semantic Web Services community was asked how agents with different ontological commitments could discover, compose, and invoke web services coherently. The response was OWL-S and WSMO: formally grounded capability descriptions specifying what a service could do, what the agent must already know for invocation to be epistemically sound, and how ontological mismatches could be formally bridged. Current Knowledge Graph (KG) metadata standards such as VoID and DCAT describe what a KG contains yet say nothing about what a specific agent can prove from it, what closure assumptions govern empty results, or whether the agent’s task vocabulary is grounded in the schema. Furthermore, in deployed KGs the governing schema DL and the operative entailment regime can diverge: an epistemic failure mode invisible to current metadata. We revisit and extend these insights for the KG setting with a four-dimensional formal framework from which we derive the Agentic Affordance Profile (AAP): a semantic layer above VoID and DCAT enabling principled KG selection, composition, and failure diagnosis at agent planning time. A five-point research agenda identifies the formal, computational, and engineering work needed to realise AAP-based affordance matching at scale.

[AI-99] Planner-Admissible Graph-PDE Value Extensions for Sparse Goal-Conditioned Planning

链接: https://arxiv.org/abs/2605.19185
作者: Shiheng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse goal-conditioned planning with few cost-to-go labels can be viewed as a graph-PDE Dirichlet extension problem: extend sparse labels on a goal-dependent boundary to unlabelled graph vertices so that greedy rollouts reach the goal. We study which graph value extensions are planner-admissible under the operational argmin-Q planner. Our main result is a local action-gap certificate: if the surrogate value error along the rollout stays below half the true action gap, then the greedy rollout reaches the goal. Absolutely Minimal Lipschitz Extension (AMLE), the p=infinity endpoint of the graph p-Laplacian family, instantiates this certificate through a comparison-principle fill-distance bound. Harmonic extension, by contrast, can mis-rank local actions because its values reflect boundary hitting probabilities rather than shortest-path greedy order. On 120 AntMaze layout-derived graph configurations, harmonic extension achieves 0.584 aggregate rollout success, while AMLE reaches 0.970. Finite high-p methods also enter a high-success regime, with success 0.903 for p=4, 0.973 for p=8, and 0.982 for a fixed-budget p=16 solver, though the p=16 row is not used as a converged endpoint ranking due to incomplete solver certification. Mechanism audits show that many rollout decisions occur in AMLE-compatible but harmonic-incompatible local geometry, and that AMLE corrects most harmonic inversions on the rollout-weighted decision scope. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.19185 [cs.LG] (or arXiv:2605.19185v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.19185 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-100] Bridge: Retrieval-Augmented Spatiotemporal Modeling for Urban Delivery Demand

链接: https://arxiv.org/abs/2605.19172
作者: Yihong Tang,Tong Nie,Junlin He,Qianjun Huang,Dingyi Zhuang,Lijun Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Forecasting urban delivery demand becomes substantially more challenging when newly added service regions lack historical records. Existing spatiotemporal forecasters effectively model spatial dependence once sufficient node histories are available. Still, they remain parametric and therefore struggle to recover short-term operational dynamics in cold-start regions. Geospatial embeddings help identify where a region is and what function it serves, yet they do not directly reveal how a similar region behaves under a comparable temporal context. We propose Bridge, a retrieval-augmented spatiotemporal graph framework that combines an inductive contextual graph backbone with a time-aware memory of region-time windows. For each target region, Bridge retrieves future demand patterns from the memory using both regional context and recent dynamics, and refines the backbone forecast through a gated fusion mechanism. To align retrieval with forecasting utility, we further train the retriever with a future-aware objective that favors entries whose future trajectories best match the target. Experiments on four real-world delivery datasets show that Bridge consistently improves over competitive spatiotemporal baselines in both within-city cold-start and cross-city transfer with partial observations. The results show that retrieval augmentation provides a useful operational memory for cold-start urban demand forecasting when parametric graph generalization alone is insufficient.

[AI-101] Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models

链接: https://arxiv.org/abs/2605.19150
作者: Aleksandar Terzić,Francesco Carzaniga,Nicolas Menet,Yannick Biehl,Michael Hersche,Thomas Hofmann,Abbas Rahimi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:State-space models (SSMs) face a fundamental trade-off between efficiency and expressivity that is mainly dictated by the structure of the model’s transition matrix. Unstructured transition matrices enable maximal expressivity, as measured by their ability to model finite-state automaton (FSA) transitions, but come at a prohibitively high compute and memory cost. In contrast, most structured transition matrix forms are highly efficient both in runtime and memory consumption, but suffer from limited expressivity. Building on recent work on structured sparse SSMs, we propose Flash PD-SSM, a novel SSM that achieves comparable throughput to widely-used structured SSMs with significantly better expressivity guarantees. Flash PD-SSM maintains a trainable set of structured sparse matrices, a single one of which is discretely selected at each time-step, enabling FSA expressiveness at the level of unstructured matrices while maintaining the efficiency required for training models at scale. First, we validate Flash PD-SSM against a suite of alternative models on synthetic mechanistic and state-tracking tasks, finding that its theoretical expressivity is achieved in practice. Second, on multivariate time-series tasks involving sequences of length over 17,000, we find that Flash PD-SSM defines a new state-of-the-art (SoTA) accuracy among competing SSM methods. Finally, we demonstrate that Flash PD-SSM is an effective drop-in replacement for hybrid LLMs, yielding improvements both in natural language state-tracking and in common language modeling scenarios. The model exhibits increased throughput and decreased memory consumption compared to SSMs widely used in frontier language models.

[AI-102] Be Kind Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

链接: https://arxiv.org/abs/2605.19147
作者: John T. Halloran,Noopur S. Bhatt
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 2 Figures, 5 Tables

点击查看摘要

Abstract:Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples–termed open-book benign rewriting (OBBR)–the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.

[AI-103] Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

链接: https://arxiv.org/abs/2605.19140
作者: Jiayu Li,Enpei Zhang,Dawei Zhou,Elynn Chen,Yujun Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories – the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC- Q , an asynchronous decentralized Q -learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC- Q that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural Q -learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC- Q matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.

[AI-104] COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

链接: https://arxiv.org/abs/2605.19138
作者: Ayush Agarwal,Ansh Gandhi,Jeremy A. Collins,Omar Rayyan,Aryan Sarswat,Ranjani Koushik,Masoud Moghani,Ajay Mandlekar,Animesh Garg
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An inmemory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency for up to 8 concurrent users per GPU. We also demonstrate stable operation supporting 256 simulated clients across 8 GPUs, underscoring the system’s ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset’s quality by training state-of-the-art imitation learning algorithms. Please visit \hrefthis https URLthis http URL for more details.

[AI-105] POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

链接: https://arxiv.org/abs/2605.19127
作者: Qiaoyuan Zheng,Yiqu Yang,Qi Gao,Imanol Schlag
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:LLM agents increasingly have access to private user data and act on the user’s behalf when interacting with third-party systems. The user defines what may and must not be shared, and the agent must robustly follow that intent even when third-party systems behave adversarially. We introduce POLAR-Bench (Policy-aware adversarial Benchmark), in which a trusted model with a privacy policy and a task converses with a third-party model that adversarially probes for both task-relevant and protected attributes. Across 10 domains and 7,852 samples, we score privacy and utility by deterministic set-membership and vary privacy policy dimension and attack strategy along two orthogonal axes, producing a 5 times 5 diagnostic surface per model. Our results reveal a sharp split: current frontier models withhold over 99% of protected attributes, while smaller open-weight models in the 1–30B range, the class users most commonly run as their own trusted agent on-device or via private inference, score notably worse, with the weakest leaking over half. POLAR-Bench thus localizes where each model’s intent-following breaks down, providing a foothold for privacy alignment where it matters most.

[AI-106] GOAL: Graph-based Objective-Aligned Diffusion Solvers for Dynamic Multi-Objective Optimization

链接: https://arxiv.org/abs/2605.19119
作者: Xingyu Li
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing neural combinatorial optimization solvers frame solution search as imitation of optimal decisions, inherently limiting their utility to single-objective minimization and static constraints. We propose GOAL, a conditioned diffusion solver over relational graph representations that enables controllable decision generations by conditioning on human-specified objectives. We introduce a heterogeneous graph encoding in which distinct edge types, corresponding to different classes of constraints, define the message passing structure of the graph neural network, which allows information to propagate selectively according to the ontology of each constraint. GOAL is instantiated and evaluated on three canonical scheduling benchmarks of various constraint complexity: the Flow Shop Problem (FSP), the Job Shop Scheduling Problem (JSP), and the Flexible Job Shop Scheduling Problem (FJSP). Generalization is demonstrated across structurally distinct constraint regimes and problem types without architectural modification. On all three benchmarks, GOAL achieves 100% solution feasibility and near-zero MAPE (below 0.20%) on multiple objectives for problem sizes up to 20 jobs and 60 operations, outperforming NSGA-II and MOEA/D in both solution quality and inference speed by up to 25x.

[AI-107] Neural Operators for Design-Space Surrogate Modeling of Tendon-Actuated Continuum Robots ICRA2026

链接: https://arxiv.org/abs/2605.19104
作者: Branden Frieden,James M. Ferguson,Alan Kuntz,Varun Shankar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to ICRA 2026

点击查看摘要

Abstract:Continuum robots enable dexterous manipulation in constrained environments, but require accurate and efficient models for real-time manipulation and control. Traditional physics-based models can be computationally expensive and may suffer from inaccuracies due to unmodeled effects, while current learning-based methods often generalize poorly beyond the specific robot on which they are trained. We present a formulation of surrogate modeling for tendon-driven continuum robots as an operator learning problem that maps robot design parameters and tendon actuation inputs to resulting configurations. This formulation enables a single trained model to generalize across a large class of robot designs. We develop four novel neural operator architectures–two based on Deep Operator Networks (DeepONets) and two based on Fourier Neural Operators (FNOs)–and train them on simulation data to predict robot configurations. All architectures achieve good accuracy while allowing for fast and accurate generalization across designs. Our results demonstrate that operator learning provides an effective and generalizable surrogate for continuum robot mechanics in the design space, enabling fast modeling for control, planning, and design optimization in surgical and industrial applications.

[AI-108] ScheduleFree: Scaling Learning-Rate-Free Schedule-Free Learning to Large Language Models

链接: https://arxiv.org/abs/2605.19095
作者: Aaron Defazio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training large language models which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.

[AI-109] Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

链接: https://arxiv.org/abs/2605.19093
作者: Zhiyuan Jerry Lin,Benjamin Letham,Samuel Dooley,Maximilian Balandat,Eytan Bakshy
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per-example labels, failures, or critiques. We study this aggregate feedback setting as sample-constrained black-box optimization over discrete, variable-length text. We introduce ReElicit, a Bayesian optimization framework based on \emphembedding by elicitation. Given a task description, previously evaluated prompts, and scalar scores, an LLM elicits a compact, interpretable feature space and maps prompts into it. Leveraging a probabilistic Gaussian process surrogate, an acquisition function then selects target feature vectors, which the LLM realizes and refines into deployable system prompts. Re-eliciting the feature space as new evaluations arrive lets the representation adapt to the observed prompt-score history. We evaluate the setting using offline benchmark accuracy as a controlled aggregate proxy: the optimizer observes one scalar score per prompt and no per-example labels, errors, or critiques. Across ten system prompt optimization tasks with a 30 total evaluation budget, ReElicit achieves the strongest aggregate performance profile among representative aggregate-only prompt-optimization baselines. These results suggest that LLMs can serve as adaptive semantic representation builders, not only prompt generators, for Bayesian optimization over natural-language artifacts.

[AI-110] MANGO: Meta-Adaptive Network Gradient Optimization for Online Continual Learning

链接: https://arxiv.org/abs/2605.19080
作者: Ankita Awasthi,Marco Apolinario,Kaushik Roy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In Online Continual Learning (OCL), a neural network sequentially learns from a non-stationary data stream in a single-pass with access only to a limited memory replay buffer. This contrasts sharply with off-line continual learning where training is multiple epoch dependent on large datasets. The main challenge faced by OCL is to overcome catastrophic forgetting of past tasks (stability) while learning new ones efficiently (plasticity). Existing methods counter forgetting via replay-based rehearsal, output level distillation, fixed regularization, or meta-learning on the current data. However, these methods have limitations: rehearsal introduces a stored sample bias; distillation operates on output-distributions without modulating parameter updates; fixed-regularization penalizes parameters irrespective of sensitivity; stream-only meta-learning lacks a feedback controlled parameter update. We propose Meta-Adaptive Network Gradient Optimization (MANGO), an OCL framework that balances stability-plasticity via gradient-gating and meta-learned regularization. Gradient-gating scales parameter updates based on sensitivity, preventing destructive updates. Meta-learned regularization adapts stability coefficients, evaluating the effect of parameter update on replay. In MANGO, replay acts as both a training signal and a forgetting evaluator. We evaluated our method on three standard OCL benchmark datasets. MANGO outperforms strong baselines, achieving state-of-the-art results with consistent performance across replay sizes. In domain incremental learning on CLEAR-10 and class incremental learning on CIFAR-100 and Tiny-ImageNet, it achieves highest accuracy among all baselines and achieves positive Backward Transfer, overcoming forgetting on CLEAR-10.

[AI-111] Riemannian Networks over Full-Rank Correlation Matrices ICML2026

链接: https://arxiv.org/abs/2605.19073
作者: Ziheng Chen,Xiaojun Wu,Bernhard Schölkopf,Nicu Sebe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Representations on the Symmetric Positive Definite (SPD) manifold have garnered significant attention across different applications. In contrast, the manifold of full-rank correlation matrices, a normalized alternative to SPD matrices, remains largely underexplored. This paper introduces Riemannian networks over the correlation manifold, leveraging five recently developed correlation geometries. We systematically extend basic layers, including Multinomial Logistic Regression (MLR), Fully Connected (FC), and convolutional layers, to these geometries. Besides, we present methods for accurate backpropagation for two correlation geometries. Experiments comparing our approach against existing SPD and Grassmannian networks demonstrate its effectiveness.

[AI-112] KVBuffer: IO-aware Serving for Linear Attention

链接: https://arxiv.org/abs/2605.19049
作者: Longwei Zou,Lin Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by recurrently computing and updating a large linear attention state in every decoding step. Since the state is much larger than the per-token key and value, recurrent decoding incurs substantial memory access and becomes inefficient for serving linear attention. In this paper, we propose KVBuffer, an IO-aware serving mechanism for linear attention. By buffering recent keys and values, KVBuffer enables serving systems to compute linear attention outputs in more flexible and memory-efficient ways. For decoding, KVBuffer enables chunkwise computation, which reduces average memory access and decoding latency by deferring state updates and applying them in batch. For speculative decoding, KVBuffer verifies draft tokens in parallel and avoids storing temporary states. For short contexts, KVBuffer computes attention outputs directly from buffered keys and values, without creating or updating the linear attention state. We implement KVBuffer in SGLang for Qwen3-Next. Our evaluations show that KVBuffer can reduce linear attention decoding latency by up to 45.17% and increase the maximum number of serving requests by 5x for speculative decoding when verifying four draft tokens.

[AI-113] Interference-Aware Multi-Task Unlearning

链接: https://arxiv.org/abs/2605.19042
作者: Ying-Hua Huang,Rui Fang,Hsi-Wen Chen,Ming-Syan Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine unlearning aims to remove the contribution of designated training data from a trained model while preserving performance on the remaining data. Existing work mainly focuses on single-task settings, whereas modern models often operate in multi-task setups with shared backbones, where removing supervision for one task or instance can unintentionally affect others. We introduce multi-task unlearning with two settings: full-task unlearning, which removes a target instance from all tasks, and partial-task unlearning, which removes supervision only from selected tasks. We show that shared parameters couple the forget and retain sets, causing task-level interference on non-target tasks and instance-level interference on other instances. To address this issue, we propose an interference-aware framework that combines task-aware gradient projection, which constrains updates within task-specific subspaces, with instance-level gradient orthogonalization, which reduces conflicts between forget and retain signals. Experiments on two multi-task computer vision benchmarks across five tasks show that our method achieves effective unlearning while maintaining strong generalization, reducing UIS compared with the strongest baseline by 30.3% in full-task unlearning and 52.9% in partial-task unlearning.

[AI-114] rustworthy Agent Network: Trust in Agent Networks Must Be Baked In Not Bolted On KDD2026

链接: https://arxiv.org/abs/2605.19035
作者: Yixiang Yao,Yuhang Yao,Xinyi Fan,Jiechao Gao,Jie Wang,Minjia Zhang,Srivatsan Ravi,Carlee Joe-Wong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by SIGKDD 2026 Blue Sky Ideas Track

点击查看摘要

Abstract:The rapid advancement of Large Language Models has given rise to autonomous LLM-based agents capable of complex reasoning and execution. As these agents transition from isolated operation to collaborative ecosystems, we witness the emergence of the Agent-to-Agent (A2A) network, a paradigm where heterogeneous agents autonomously coordinate to solve multi-step tasks. While these networks may offer better task performance compared to simply using one agent to complete the entire task, they introduce systemic vulnerabilities, such as adversarial composition, semantic misalignment, and cascading operational failures, that existing agent alignment techniques cannot address. In this vision paper, we argue that the trustworthiness of A2A networks cannot be fully guaranteed via retrofitting on existing protocols that are largely designed for individual agents. Rather, it must be architected from the very beginning of the A2A coordination framework. We present a comprehensive conceptual framework that situates trust in A2A systems through four design pillars.

[AI-115] KAN-MLP-Mixer: A comprehensive investigation of the usage of Kolmogorov-Arnold Networks (KANs) for improving IMU-based Human Activity Recognition

链接: https://arxiv.org/abs/2605.19031
作者: Mengxi Liu,Sizhen Bian,Vitor Fortes,Francisco Calatrava Nicolas,Daniel Geißler,Maximilian Kiefer-Emmanouilidis,Bo Zhou,Paul Lukowicz
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 24 pages, and 9 figures

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have demonstrated an exceptional ability to learn complex functions on clean, low-dimensional data but struggle to maintain performance on noisy and imperfect real-world datasets. In contrast, conventional multi-layer perceptrons (MLPs) are far more tolerant to noise and computationally efficient. Replacing all MLP components with KANs in HAR models often degrades accuracy and computation efficiency, highlighting an open challenge: how to combine KANs’ precision with MLPs’ noise robustness and efficiency. To address this, we systematically explore various placements of KAN modules within deep HAR networks and propose a hybrid architecture that strategically synergizes the strengths of both paradigms, which uses a KAN-based input embedding layer, retains MLP layers for intermediate feature mixing, and introduces a specialized LarctanKAN module for final activity classification. Across eight public HAR datasets, the hybrid KAN-MLP model achieves an average macro F1 score relative improvement of 5.33% compared pure-MLP model, significantly outperforming standalone KAN and MLP baselines. Furthermore, integrating this hybrid strategy into other state-of-the-art HAR architectures consistently boosts their performance. Our findings demonstrate that a carefully orchestrated combination of KAN, MLP, or other conventional neural components yields more robust and accurate HAR models for real-world wearable sensing environments.

[AI-116] Agent NLQ: A General-Purpose Agent for Natural Language to SQL

链接: https://arxiv.org/abs/2605.19010
作者: Olena Bogdanov,Yeunji Jung,Chandra Dhir,Pareekshitreddy Gaddam,Saurabh Jain,Lakshmi Tumati,Vijay Parthasarathy,Anup Shirgaonkar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural language to SQL (NL2SQL) conversion is an important problem for researchers and enterprises due to the ubiquitous importance of relational databases in broad-ranging practical problems. Despite the rapid advancements in the capabilities of LLMs, NL2SQL has not reached parity in accuracy with human expert SQL writers, hence needing additional improvements in NL2SQL algorithms. This study presents a new multi-agent method for NL2SQL that achieves 78.1% semantic accuracy on the BIg Bench for LaRge-scale Database (BIRD) benchmark. Our method leverages a semantically enriched representation of user-provided schema, adds user-provided business rules, and produces accurate SQL queries. The main contributions of this study are (a) We designed an optimized new orchestrator in a multi-agent solution that uses LLMs to plan, orchestrate, reflect, and self-correct to generate accurate SQL queries, (b) We developed an advanced schema enrichment method that creates context-aware metadata to improve accuracy, and © We demonstrated the accuracy and generalizability of the method across different domains and datasets by evaluating it on the BIRD-SQL benchmark.

[AI-117] Distilling Linearized Behavior for Effective Task Arithmetic ICML2026

链接: https://arxiv.org/abs/2605.18993
作者: Thomas Sommariva,Francesca Morandi,Simone Calderara,Angelo Porrello
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Task vector composition has emerged as a promising paradigm for editing pre-trained models, enabling model merging through addition and unlearning through subtraction. Fine-tuning in the tangent space of a pre-trained model (linear fine-tuning) has proven effective, as it produces task vectors that are naturally disentangled and resistant to interference. However, linearized models suffer from limited expressivity during training and incur higher computational costs at inference time, which restrict their practical applicability. In this work, we bridge the gap between linear and standard non-linear fine-tuning. We show that linearity with respect to weight perturbations, a property defined in parameter space, can be enforced through constraints in activation space during training. Concretely, we distill hidden representations from a curvature-regularized linearized teacher into a non-linear student trained via conventional fine-tuning. We find that the resulting model inherits key properties of linearized models for task arithmetic, enabling effective composition of task vectors and achieving strong performance across vision and language benchmarks without incurring any inference-time overhead.

[AI-118] Agent Security is a Systems Problem

链接: https://arxiv.org/abs/2605.18991
作者: Mihai Christodorescu,Earlence Fernandes,Ashish Hooda,Somesh Jha,Johann Rehberger,Kamalika Chaudhuri,Xiaohan Fu,Khawaja Shams,Guy Amir,Jihye Choi,Sarthak Choudhary,Nils Palumbo,Andrey Labunets,Nishit V. Pandya
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Through this lens, efforts to increase model robustness (the dominant viewpoint in the community) are insufficient on their own. Instead, we must complement existing efforts with techniques from the systems security domain. Based on our experience as cybersecurity researchers in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles, grounded in decades of systems security research, that provide a foundation for designing agentic systems with predictable guarantees. As evidence, we analyze eleven representative real-world attacks on agents and discuss how systems principles, if realized, could have prevented these attacks. We also identify the research challenges that stand in the way of implementing these principles in agents.

[AI-119] Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks

链接: https://arxiv.org/abs/2605.18988
作者: Doohee You
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The expansion of Multimodal Large Language Models (MLLMs) and their integration into autonomous agentic workflows has introduced a non-stationary attack surface. Empirical observations indicate that adversaries employ progressive, cross-modal perturbations that evade turn-specific guardrails by distributing malicious intent across longitudinal conversational trajectories. Static defense mechanisms, constrained by the Markov property, evaluate inputs in isolation and fail to detect cumulative structural poisoning. To handle this limitation, this paper formulates safety verification as a dynamic survival prediction and trajectory dynamics problem. The Triple-tier Anomaly Defense (TRIAD) framework is proposed as a predictive model that maps multimodal and multi-turn conversational flow as a continuous trajectory. The framework integrates structural anomaly detection to monitor covariance shifts, a Ledoit-Wolf regularized Mahalanobis distance to monitor covariance shifts in high-dimensional spaces, and topological trajectory acceleration to differentiate benign creative exploration from continuous malicious drift. These kinematic and geometric features are integrated into a time-varying Cox Proportional Hazards model via a Bayesian Hidden Markov Model (HMM) feedback loop. Theoretical analysis demonstrates that the TRIAD framework provides a mathematically bounded expected time-to-failure under adversarial perturbations, ensuring that malicious acceleration diverges positively. This framework provides a computationally efficient, interpretable, and predictive safeguard for real-time agentic AI systems, establishing a rigorous foundation for continuous safety alignment without relying on empirical retraining.

[AI-120] Shaping the Prior: How Synthetic Task Distributions Determine Tabular Foundation Model Quality

链接: https://arxiv.org/abs/2605.18971
作者: Mohamed Bouadi,Nassim Bouarour,Varun Kulkarni,Shivam Dubey,Aditya Tanna,Vinay Kumar Sankarapu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:What determines the quality of a tabular foundation model? Unlike language or vision, tabular foundation models acquire their inductive biases almost entirely from synthetic pretraining distributions, yet the design of these distributions remains poorly understood. Standard synthetic priors are too well-behaved: they omit the irregularities and failure modes that determine deployment robustness. We introduce O’Prior, a compositional realism prior built around four coupled components: a hierarchical SCM meta-generator spanning diverse functional families; a modular realism engine covering heterogeneous marginals, missingness, and target transforms; an explicit stress module injecting confounding and support-query mismatch; and a curriculum-governed, leakage-safe generation protocol. To isolate prior design as the scientific variable, we hold architecture, optimizer, and compute budget fixed and vary only the synthetic task distribution. O’Prior yields consistent and substantial improvements in downstream accuracy and robustness across real tabular benchmarks, with gains concentrated in regimes characterized by distributional irregularities. Ablations confirm that mechanism diversity, realism composition, and shift-aware stress each contribute independently, their effects are not interchangeable. These results establish synthetic prior construction as a first-order and largely overlooked determinant of tabular foundation model quality

[AI-121] Evaluating the Utility of Personal Health Records in Personalized Health AI

链接: https://arxiv.org/abs/2605.18937
作者: Rory Sayres,Kejia Chen,Ayush Jain,Matthew Thompson,Jonathan Richina,Xiang Yin,Jimmy Hu,Fan Zhang,Bob Lou,Mike Sanchez,Ines Mezerreg,Meredith Schreier,Hamsa Subramaniam,I-Ching Lee,Yugang Jia,Daniel Mcduff,Yossi Matias,Avinatan Hassidim,Dale Webster,Yun Liu,Jackie Barr,Quang Duong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 35 pages, 3 figures, 10 tables

点击查看摘要

Abstract:Patient-managed Personal Health Records (PHRs) promises to empower patients to better understand their health; but information in the record is complex, potentially hindering insights. In this study, we assess the potential of large language models (LLMs, Gemini 3.0 Flash) to provide helpful answers to user health queries, when provided clinical data from PHRs as context. A total of 2,257 user queries were drawn from 3 different distributions to represent patient questions: shorter web search queries, longer questions derived from templates of chatbot conversations, and questions patients asked to their healthcare team (patient calls). Queries were matched with de-identified PHRs (from a pool of 1,945). Gemini responses were generated (1) without PHR context; (2) with a basic summary of demographics, conditions, and medications; (3) with full, extensive clinical notes. For evaluation, we leveraged an existing rating framework (SHARP), and developed a new framework for specific error modes when interpreting PHRs. Evaluation was performed using autoraters for the full set, and with clinician ratings for a subset (n=95), with both sets of raters knowing the full PHR context. We see significant improvements in the helpfulness of answers to all question types with PHR data (p 0.001, paired t-test). We also observe potential gains in safety, accuracy, relevance and personalization of answers. Our PHR evaluation framework further identifies gaps in LLM understanding of particular aspects of complex PHRs, such as temporal disorientation, and rare but meaningful confabulations. These results suggest potential for PHR data to help people with a wide range of user needs; and provide a framework for monitoring for gaps in LLM answers based on PHR context. This study motivates further work to assess and realize potential benefits to users from understanding their health records.

[AI-122] HypergraphFormer: Learning Hypergraphs from LLM s for Editable Floor Plan Generation

链接: https://arxiv.org/abs/2605.18932
作者: Nikita Klimenko,Hesam Salehipour,Parham Eftekhar,Amir Khasahmadi,Ramon Elias Weber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we propose HypergraphFormer, a novel and efficient approach to floor plan generation based on learning hypergraph representations with a large language model (LLM). The model is trained via supervised fine-tuning to generate a hypergraph-based textual representation that encodes spatial relationships and connectivity information within floor plans. We train and evaluate our approach on the RPLAN dataset, and further demonstrate its generalizability on a separate out-of-distribution dataset, which we release in this paper. Our method outperforms state-of-the-art techniques based on rasterized or vectorized representations across a diverse set of metrics. We also show improved data efficiency, particularly under distribution shift. The hypergraph formulation enables the generation of floor plans for arbitrary, irregular, user-specified boundaries by decoupling apartment footprints from their functional and geometric subdivisions. Furthermore, we show that the proposed methodology offers a high degree of editability, making it particularly well suited to design-oriented workflows supported by LLMs.

[AI-123] OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences

链接: https://arxiv.org/abs/2605.18930
作者: Kaixiang Wang,Jiong Lou,Zhaojiacheng Zhou,Jie Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Memory-augmented large language model (LLM) agents use iterative reflection and self-evolution to solve complex tasks, but these mechanisms introduce security risks. Existing agentic memory attacks require privileged access or explicit malicious content, making them detectable by advanced safety filters. This leaves a subtler attack surface underexplored: whether adversaries can induce agent to generate experiences that appear locally correct and semantically plausible yet induce harmful generalization during reflection. We find that reflective agents are vulnerable to such clean experiences, especially when paired with severe but plausible hypothetical consequences. Based on this observation, we introduce Obsessive Experience Poisoning (OEP), a low-privilege black-box attack requiring no direct control over the system prompt or memory database. OEP constructs adversarial clean edge-cases that combine locally correct solutions, non-transferable methods, and severe consequences, biasing reflection toward risk-averse rule formation. During memory consolidation, agents may over-trust self-generated reflections and distill localized experiences into high-priority but over-generalized rules, causing downstream failures. Evaluations across three domains show that OEP achieves ASR above 50% with GPT-4o agents, and outperforms existing attacks under LLM auditing defense.

[AI-124] MoCo-EA: Exploiting Adversarial Mode Connectivity for Efficient Evolutionary Attacks

链接: https://arxiv.org/abs/2605.18919
作者: Hyo Seo Kim,Gang Luo,Can Chen,Binghui Wang,Yue Duan,Ren Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evolutionary algorithms for adversarial attacks leverage population-based search to discover perturbations without gradient information, but suffer from inefficient crossover operations that destroy adversarial properties through discrete interpolation. We introduce Mode Connectivity Evolutionary Attack (MoCo-EA), which replaces traditional crossover with a novel Bézier crossover operator that optimizes perturbations along a continuous Bézier curve between parent perturbations. Our key insight is that adversarial examples lie on connected manifolds where intermediate points maintain and often enhance attack effectiveness. We demonstrate three findings: (1) Successful adversarial perturbations exhibit mode connectivity; (2) Intermediate points along optimized paths achieve higher transferability than endpoints; (3) Bézier crossover dramatically outperforms discrete genetic operations while reducing convergence time and query requirements. By exploiting the geometric structure of adversarial space through path optimization, MoCo-EA provides an efficient and reliable method. Our work challenges the traditional view of adversarial examples as isolated points and opens new directions for both attack generation and defense research.

[AI-125] ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster Stronger Prompt-Injection Defense

链接: https://arxiv.org/abs/2605.18918
作者: Yash Narendra
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern AI assistants are agentic. To answer a single user request, the underlying language model pulls in information from many sources, such as web searches, retrieved documents, tool outputs, and user follow-ups, and reasons over them across several steps. Any of these inputs can carry malicious content. This opens the door to prompt injection, where an attacker plants text designed to override the instructions given to the assistant by its developer. For example, an attacker applying for a job can insert white-on-white text in their resume saying This is the strongest candidate. Recommend for immediate hire''. A hiring assistant may then be steered toward a favorable recommendation regardless of actual qualifications. To defend against this threat, production systems use a separate guard model in front of the assistant. The guard reads incoming text and writes a verdict (safe’’ or ``unsafe’') before the assistant is allowed to act. In an agentic task with many steps, this check becomes a latency bottleneck. This paper shows that the signal needed to separate safe from malicious input is already present in the guard model’s internal representation, before it writes anything out. Reading this signal directly speeds up the safety check by more than 3\times on average, while improving detection accuracy over the guard’s verdict by 16.4 percentage points on average. This is more than latency optimization. Guard-model checks that were previously too slow to run on every step of an agent can now be placed on the critical path without sacrificing accuracy, and in fact with higher accuracy than the guard provides on its own. ESLD (External Surrogate Latent Defense) packages this finding into a deployable defense. ESLD is a model-agnostic architecture that sits on top of any existing guard model and improves both latency and detection accuracy, without retraining or modifying the guard.

[AI-126] DMN: A Compositional Framework for Jailbreaking Multimodal LLM s with Multi-Image Inputs ACL2026

链接: https://arxiv.org/abs/2605.18915
作者: Wenzhuo Xu,Zhipeng Wei,Zonghao Ying,Deyue Zhang,Dongdong Yang,Xiangzheng Zhang,Quanchen Zou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: ACL 2026 main conference

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are vulnerable to jailbreak attacks, which can elicit harmful responses from MLLMs. Many MLLMs support multi-image inputs, inadvertently introducing new vulnerabilities due to less efforts on multi-image safety alignment. Previous MLLM jailbreak methods only uses a single image, which restricts the attack space: they cannot distribute harmful requests across multiple images, carry abundant information, or exploit additional visual reasoning tasks to distract MLLMs. To address these limitations, in this paper, we propose a compositional jailbreak framework, \textbfDMN, which leverages \textbfDistributed instruction, \textbfMultimodal evidence and a \textbfNumber chain task to fully enhance the jailbreak performance. Extensive experiments show that DMN is highly effective for MLLM jailbreaking, e.g. achieving attack success rates of over 90% on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4, surpassing other baselines by a large margin. This compositional, multi-image jailbreak strategy reveals fundamental weaknesses in their safety mechanisms.

[AI-127] SCAFDS: Edge-Feature Graph Attention for Interbank Fraud Detection with Attribution-Grounded SAR Generation

链接: https://arxiv.org/abs/2605.18913
作者: Mohammad Nasir Uddin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The U.S. financial system processes approximately 1.3 million interbank transactions daily, yet no system in the reviewed literature models fraud propagation across the interbank network using fraud co-occurrence edge features. Prior interbank GNN architectures model credit contagion using credit distress supervision signals, producing systems misaligned for fraud forensics. No existing system generates SAR narratives with per-assertion forensic traceability to specific numerical detection outputs, creating regulatory auditability gaps in FinCEN-submitted reports. This paper introduces SCAFDS (Systemic Contagion-Aware Fraud Detection System), a seven-stage integrated surveillance pipeline addressing five structural limitations of prior art: (1) fraud-specific interbank topology encoding using fraud co-occurrence frequency metrics f(u,v,t) derived from FinCEN SAR registry records; (2) edge-feature-informed graph attention where coefficients are computed from both node representations and fraud co-occurrence edge features; (3) bilinear fraud co-occurrence risk fusion producing institution-level systemic fraud risk scores; (4) attribution-conditioned SAR narrative generation with per-assertion significance thresholds ensuring each FinCEN SAR assertion is traceable to a specific numerical pipeline output; and (5) topology-aware adaptive forensic feedback updating graph attention weights from regulatory dispositions. Experiments on the IEEE-CIS Fraud Detection Dataset (590,540 transactions) and a synthetic FDIC-aligned interbank network (8,103 institutions, 169,800 edges) show SCAFDS achieves AUPRC=0.515+/-0.032 and AUROC=0.802+/-0.018, representing +15.9pp and +13.7pp improvements over GraphSAGE-AML. Partial validation on FDIC enforcement action records (n=4,279) confirms consistent model ranking. USPTO Provisional Patent Application No. 64/061,083, filed May 8, 2026.

[AI-128] Does Your Wildfire Prediction Model Actually Work or Just Score Well?

链接: https://arxiv.org/abs/2605.18911
作者: Yangshuang Xu,Yuyang Dai,Liling Chang,Qi Wang,Yushun Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages

点击查看摘要

Abstract:Wildfire prediction is important for early warning and resource allocation, yet existing Earth foundation models (Earth FMs) are pretrained for general atmospheric and geophysical objectives rather than wildfire forecasting. To address this gap, we introduce WILDFIRE-FM, the first foundation model pretrained specifically for wildfire prediction using weather, active-fire observations, topography, vegetation, and static environmental data. However, introducing a domain-specific backbone alone does not solve the evaluation problem: wildfire events are sparse in space and time, making transfer conclusions highly sensitive to matching rules and evaluation settings. To address this problem, we introduce a fixed-contract evaluation framework with two controlled checks: a fixed-output check for matching-rule effects and a fixed-feature check for head-selection effects. Under matched contracts, we compare WILDFIRE-FM with ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks. Our results show that wildfire transfer conclusions depend strongly on evaluation design and task formulation. We hope this framework and WILDFIRE-FM provide a foundation for future wildfire-specific Earth-FM research and benchmarking. Our code is available at this https URL.

[AI-129] Fast and Lightweight Backdoor Detection via Head Random Probing

链接: https://arxiv.org/abs/2605.18908
作者: Yinbo Yu,Xueyu Yin,Jing Fang,Chunwei Tian,Qi Zhu,Jiajia Liu,Daoqiang Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) remain critically vulnerable to backdoor attacks. Existing post-training detectors often require clean or surrogate data, gradients, or iterative trigger reconstruction, leading to high computational costs and limited robustness under practical model-auditing scenarios. In this paper, we propose HTell, a fast and lightweight data-free backdoor detector based on head random probing. Instead of reconstructing diverse trigger patterns, HTell inspects their unified manifestation in the prediction head: backdoored models tend to exhibit abnormal response concentration on the target class under random latent probes. HTell generates architecture-aware random latent probes, feeds them directly into the model head, and detects backdoors by analyzing class-wise response statistics, without accessing real or surrogate data, model gradients, or parameter optimization. We evaluate HTell on a large-scale benchmark containing more than 6,000 backdoored models and over 700 clean models, covering 4 datasets, 14 architectures, and 21 types of backdoor attacks. HTell achieves 99.03% true positive rate and 2.11% false positive rate with only 12.69 ms/model detection latency, reducing the time cost by over 30,000 \times compared with representative gradient-based detectors. These results demonstrate that head random probing provides an accurate, robust, and efficient solution for large-scale data-free backdoor model auditing.

[AI-130] Lightweight and Fast Backdoor Model Detection

链接: https://arxiv.org/abs/2605.18907
作者: Yinbo Yu,Jing Fang,Xuewen Zhang,Chunwei Tian,Qi Zhu,Daoqiang Zhang,Jiajia Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep neural networks (DNN), despite their remarkable performance, are highly vulnerable to backdoor attacks. Existing defenses mainly rely on activation anomaly analysis or trigger reverse engineering and often require clean samples or prior knowledge of trigger patterns, resulting in limited efficacy, practicability, and generalizability. More critically, while advanced attacks can implement backdoor implantation in milliseconds, current detection approaches typically demand minutes or even hours. To this end, we propose DFBScanner, a lightweight static parameter inspection framework for fast backdoor scanning. DFBScanner leverages our key observation that backdoor-induced feature perturbations can lead to distinctive and anomalous parameter updates in the final classification layer. Hence, we shift our detection focus from recognizing diverse and attack-specific trigger patterns targeted by prior work, to identifying the unified backdoor manifestation within the final layer, thereby enabling efficient and attack-agnostic detection. Specifically, by constructing and strategically combining multiple anomaly indicators of the final-layer parameters into a Trojan clue, DFBScanner detects backdoors through maximum anomaly scoring. DFBScanner is evaluated on a large-scale backdoor benchmark, including over 5,000 backdoor models trained on 4 datasets, 12 network architectures, 20 types of backdoor triggers, 2 attack strategies (all-to-one and -all), and 3 backdoor injection methods (data poisoning, training pipeline manipulation, and bit-flips). Numerical results show that DFBScanner achieves a 97.17% true-positive rate, 0.95% false-positive rate, and an average detection time of only 1 ms per model, significantly outperforming prior methods.

[AI-131] Stability and Discretization Error of State Space Model Neural Operators

链接: https://arxiv.org/abs/2605.18905
作者: Abderrahim Bendahi,Adrien Fradin,Johan Peralez,Julie Digne,Madiha Nadri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Neural operators have emerged as a powerful, discretization-invariant framework for solving partial differential equations (PDEs). Although established approaches like the Deep Operator Network (DeepONet) have successfully achieved universal approximation for operators, and architectures such as Fourier Neural Operators (FNOs) have shown algebraic convergence rates, a precise theoretical connection between the continuous theory and its discrete numerical implementation remains a challenge. Specifically, the relationship between the continuous formulation and the discrete numerical stability has yet to be fully explored. In this paper, we address this gap by establishing theoretical guarantees for the discretization error and stability of neural operator approximation schemes. We prove analytical bounds that link solution regularity to input discretization, providing a formal quantification of neural operator accuracy under real-world numerical constraints. We derive these bounds to the specific cases of State Space Model-based Neural Operators (SS-NOs) and FNOs, thus providing a new discretization error theorem for these models. Additionally, through an input-to-state stability (ISS) analysis, we formally assess the impact of discretization on the stability of SS-NOs results obtained in the continuous domain. Our empirical experiments on 1D and 2D benchmarks validate our theoretical bounds and show the robustness of SS-NOs under varying resolutions.

[AI-132] Dont Let Bandit Feedback Pull Continual LLM -Recommender Updates Off Target

链接: https://arxiv.org/abs/2605.18899
作者: Taesan Kim,Hyeongjun Yun,Jaegul Choo,Chung Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative LLM-based recommenders (LLM-Rec) require continual post-deployment updates, yet deployment logs provide only policy-shaped contextual bandit feedback: outcomes are observed solely for items exposed by a prior serving policy, inducing exposure bias and yielding partial, asymmetric signals consisting of relatively reliable positive responses and ambiguous no-responses. We propose an Anchored Bandit Policy Optimization (ABPO) framework for continual LLM-Rec updates that combines group-relative policy optimization (GRPO) with explicit treatment of exposure bias and feedback ambiguity. Specifically, we insert the exposed recommendation as a logged anchor into each GRPO rollout group, so that group-relative normalization is calibrated against the action actually exposed by the prior policy rather than against newly sampled rollouts alone. Because both positive- and no-responses are observed only through prior-policy exposure, we apply self-normalized inverse propensity scoring to the fixed anchor for both feedback types to correct for policy mismatch. At the same time, we treat the two feedback types asymmetrically in reliability: positive responses provide relatively direct endorsement signals, whereas no-responses remain ambiguous because they may reflect either true disinterest or unobserved external factors. To avoid overly aggressive updates from ambiguous no-responses, we temper their penalties with self-certainty, using the model’s output-token confidence as a verifier-free reliability signal. Across five domains from Amazon Reviews and MovieLens, our method yields consistent post-update gains in recommendation accuracy while mitigating prior-policy-induced exposure bias more effectively than prior baselines.

[AI-133] KG-ASG: Collision-Knowledge-Guided Closed-Loop Adversarial Scenario Generation With Primary-Support Attribution

链接: https://arxiv.org/abs/2605.18895
作者: Cheng Wang,Chen Xiong,Ziwen Wang,Yuchen Zhou,Qiang Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety validation of autonomous driving systems requires high-risk scenario coverage, clear collision semantics, executable trajectories, and attributable multi-vehicle interactions. Existing safety-critical scenario generation methods often rely on low-level trajectory perturbations, collision-proxy optimization, or single-adversary search, which may produce adversarial samples with ambiguous collision causes or uncontrolled multi-vehicle collisions. This paper proposes KG-ASG, a collision-knowledge-guided closed-loop adversarial scenario generation framework with primary-support attribution. KG-ASG constructs a structured collision knowledge base and trains a lightweight Collision Expert to infer the target collision mode, the unique primary adversary, support vehicles, and their interaction roles. Guided by this semantic prior, multi-vehicle adversarial generation is formulated as a primary-support process, where the primary adversary induces the main conflict and support vehicles shape the surrounding risk structure without becoming additional colliders. Rule, physical, interaction-safety, and single-collider constraints are imposed as hard gates to filter non-executable samples. To handle reactive ego behaviors, planner-controller feedback is further used for failure diagnosis, candidate re-ranking, and terminal refinement. Experiments on WOMD scenarios reconstructed in MetaDrive show that KG-ASG achieves strong adversarial effectiveness while improving Valid Primary Attack, reducing multi-collision, and obtaining closed-loop recovery gains under IDM, Cruise, and Expert controllers. These results demonstrate that collision-knowledge guidance and primary-support single-collider reasoning improve adversarial effectiveness, interpretability, and executability for autonomous driving safety validation.

[AI-134] Data-Free Client Contribution Estimation via Logit Maximization for Federated Learning

链接: https://arxiv.org/abs/2605.18892
作者: Asim Ukaye,Nurbek Tastan,Mubarak Abdu-Aguye,Karthik Nandakumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 22 pages, 7 figures

点击查看摘要

Abstract:Federated learning (FL) enables collaborative learning of computer vision models, where privacy and regulatory constraints prevent centralizing data across devices or organizations. However, practical FL deployments often exhibit severe class imbalance and label skew, causing standard aggregation protocols to overfit dominant clients and degrade minority-class performance. We propose a data-free, class-wise contribution estimation and aggregation framework based on logit maximization (CELM) that does not require sharing raw data, client metadata, or auxiliary public datasets. The FL server probes client updates to obtain class-wise evidence scores and assembles a cross-client evidence matrix, which quantifies both per-class competence and class coverage. Using this matrix, we compute contribution weights that upweight clients providing strong, discriminative evidence for underrepresented classes. The resulting aggregation is stable due to simplex constraints and momentum smoothing, and it remains compatible with standard FL training pipelines. We evaluate the approach on representative vision benchmarks under controlled non-IID and pathological label splits, demonstrating that CELM-based aggregation improves robustness to imbalance and statistical heterogeneity, while yielding better performance without requiring any additional data exchange.

[AI-135] Auditing Reasoning -Trace Memorization Claims after Unlearning with Head-Conditioned Canaries

链接: https://arxiv.org/abs/2605.18891
作者: Yanhang Li,Zhichao Fan,Zexin Zhuang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluations of unlearning on reasoning models sometimes show a bypass pattern. The answer side looks unlearned, but the model’s own thinking trace keeps emitting the forgotten content, and the gap is taken as evidence that the weights still remember. We audit this reading on DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors and NPO unlearning, conditioned on a six-token canary head. On one seed, swapping the thinking trace for a short non-canary prefill on the same weights drops the answer rate by as much as the bypass gap itself, whether the prefill mimics the training template or not. On a second seed the bypass gap shrinks rather than vanishing, and the prefill swap reverses direction and brings the answer rate to ceiling. A positive parser-split bypass gap thus does not by itself identify hidden weight-level memorization, and does not rule it out either. On a different distillate the same metric flips sign because the parser cannot find the closing tag. We recommend a decode-time template swap as a cheap sanity check alongside the canonical audit.

[AI-136] Soft Learning

链接: https://arxiv.org/abs/2605.18889
作者: Mohammed Aledhari,Ali Aledhari,Fatimah Aledhari,Mohamed Rahouti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern machine learning forces practitioners to choose between powerful but expensive deep networks and fast but limited classical algorithms. Here we introduce Soft Learning, a framework that maintains a library of heterogeneous specialists – spanning linear models, tree ensembles, kernel machines, and neural networks – and discovers provably optimal combination weights through cross-validated non-negative least squares. Soft Learning is guaranteed to match or exceed the best weighted combination of its specialists, trains over two orders of magnitude faster than deep networks on CPU alone (72-435x faster across tested configurations), provides inherent interpretability through learned weights that reveal which algorithmic paradigm best fits the data, and is future-proof: adding specialists is mathematically guaranteed to maintain or improve performance. Across 37 datasets (25 classification, 12 regression) against nine methods including CatBoost and tuned deep networks, Soft Learning ranks first on 70% of tasks, achieves the best mean rank (Friedman test, p = 1.12 x 10^-12), and is the only method to simultaneously excel at both classification and regression – all without GPU hardware or hyperparameter tuning. These results suggest a paradigm shift from “which algorithm is best?” to “what is the provably optimal combination?” – a question Soft Learning answers with formal guarantees for any data modality.

[AI-137] he Extremum Stack is a Minimal Sufficient Statistic for Rate-Independent Functionals: A Kolmogorov Complexity Characterisation

链接: https://arxiv.org/abs/2605.18885
作者: Piotr Frydrych
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注: 6 pages, 1 algorithm, 1 table. Submitted to Information Processing Letters (Elsevier)

点击查看摘要

Abstract:We prove that the extremum stack of a discrete sequence is a minimal sufficient statistic for the class of all computable, causal, rate-independent functionals, in the sense of Kolmogorov complexity. Specifically, we establish K(Pi_n) - O(1) = K_R(u_0:n) = K(Pi_n) + O(1), where K_R(u_0:n) is the length of the shortest program answering every query in the class R, and the O(1) overhead is independent of both the sequence length n and the stack depth k. Sufficiency follows from the classical wiping property of the Preisach hysteresis operator. Minimality is established via a finite indicator family whose rate-independence is verified explicitly. Any compression of a hysteresis-driven stream that preserves the full class R must therefore retain at least K(Pi_n) - O(1) bits; the stack-based compression algorithm implied by the result carries a Kolmogorov optimality guarantee that none of the standard time-series compression methods provide.

[AI-138] Prediction Is Not Physics: Learning and Evaluating Conserved Quantities in Neural Simulators

链接: https://arxiv.org/abs/2605.18883
作者: Andrew Bukowski,Aditya Kothari,Simba Shi,Ishir Rao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:A diffusion model trained on Hamiltonian trajectories can achieve rollout MSE near 10^-3 , but the standard deviation of its energy over time is between 7500 and 36000 times larger than the ground-truth energy standard deviation, indicating a failure to preserve conservation laws. This gap motivates our central question of whether neural networks can learn or select globally conserved quantities from physical trajectories. We investigate this across three Hamiltonian systems: projectile motion, pendulum, and spring-mass. We use a structured T(v)+V(q) energy model, a black-box Conservation Discovery Network (CDN), a polynomial CDN, and a conditional diffusion baseline. The structured network reaches R^2 \geq 0.9999 against analytical energy on clean data, while the black-box CDN reaches R^2 \geq 0.996 when trained with temporal consistency plus a small alignment loss to analytical energy at t=0 ( \lambda_\mathrmalign=0.2 ). With \lambda_\mathrmalign=0 , CDN Pearson R^2 collapses on pendulum and spring-mass ( 10^-3 ), showing that temporal consistency alone is not enough to reliably identify the true energy. Under 1% additive Gaussian noise, the CDN outperforms the structured model on the projectile and spring-mass systems, suggesting that the CDN may be more robust to noisy inputs in this setting. However, the polynomial CDN is sensitive to training configuration: it achieves R^2=0.78 under a short training schedule on the pendulum system, but reaches R^2=0.9998 with more training time and data, regardless of whether noise is added.

[AI-139] o Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents

链接: https://arxiv.org/abs/2605.18882
作者: Wei Shi,Ziheng Peng,Sihang Li,Xiting Wang,Xiang Wang,Mengnan Du,Na Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents exhibit a consistent tendency to over-call, invoking tools even in situations where none is needed. On the When2Call benchmark, six models from three families show high call accuracy but much lower no-call accuracy, leaving overall accuracy in the 55%-70% range. We trace this to an Intrinsic Bias Hypothesis (IBH): the call/no-call decision mapping carries an activation-independent call offset, so the model favors call even at activation parity. Using Sparse Autoencoders (SAEs), we recover behavior-aligned feature bases for the call/no_call decision, reduce them to a signed activation margin, and estimate the offset directly. Across all six models, the model is decision-neutral only when no_call activation outweighs call activation, consistent with IBH. We then causally test IBH with Adaptive Margin-Calibrated Steering (AMCS), a closed-form counter-bias shift along SAE decoder directions. Cancelling the diagnosed offset mitigates over-calling and improves overall accuracy with a negligible drop in call accuracy. Our work recasts over-calling from an empirical phenomenon into a mechanistic object amenable to causal correction. Code is available at this https URL.

[AI-140] GenAI-FDIA: Physics-Informed Generative Models for False Data Injection Attacks

链接: https://arxiv.org/abs/2605.18873
作者: Mohammad A. Razzaque,Muta Tah Hira
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to IEEE Transactions on Smart Grid

点击查看摘要

Abstract:Training and evaluating false data injection attack (FDIA) detectors for power systems is constrained by data scarcity. Operational grid measurements are commercially sensitive, and hand-crafted attacks fail to capture complex distributional structures imposed by network physics. We present \textscGenAI-FDIA, a framework benchmarking a pool of P=20 architectures for physics-compliant FDIA synthesis, spanning Wasserstein GANs, MMD-VAEs, normalising flows, diffusion models, and cross-family hybrids. These are evaluated across three IEEE testbeds (14-bus DC, 30-bus DC, and 14-bus AC) under a 60/20/20 chronological split using data-driven Bad Data Detection (BDD) threshold calibration. Our empirical results verify that these models generate high-fidelity attacks, with all architectures achieving evasion rates of \epsilon_\textBDD \ge 86.6% on the 14-bus network; additionally, limiting an attacker’s topological knowledge induces a measurable degradation in stealthiness ( p \le 0.0022 ). Crucially, we identify a previously unreported failure mode: applying affine physics projections directly in normalised feature spaces critically displaces the attack vector, collapsing BDD evasion from \sim55% to !2% on the 30-bus testbed. We resolve this via a novel inference-time harmoniser, restoring full stealthiness ( \epsilon_\textBDD=100% ) across all physics-informed variants without retraining. Finally, we isolate a covariance-collapse phenomenon ( \kappa \approx -0.076 ) within advanced hybrid architectures and rectify it through 50-epoch warm-up schedules ( \kappa \to 0.785 , \Delta\textMMD=-3.1% ). Ultimately, \textscGenAI-FDIA delivers a robust recovery blueprint applicable to any physics-constrained generative model deployed for power-system security.

[AI-141] EUPHORIA: Efficient Universal Planning via Hybrid Optimization for Robust Industrial Robotic Assembly

链接: https://arxiv.org/abs/2605.18872
作者: Shih-Yu Lai,Chia-Ching Yen,Yang-Ting Shen,Peter Yichen Chen,Yu-Lun Liu,Bing-Yu Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Robotic assembly in architectural construction faces a persistent bottleneck: existing planners are either highly specialized, requiring prohibitive retraining for every new geometric design, or operationally inefficient, treating structural sequencing and kinematic motion as disjoint processes. We present EUPHORIA, a unified framework that achieves universal few-shot adaptability and dynamic efficiency through a hybrid optimization strategy. To overcome the retraining bottleneck, we propose a Meta-Geometric Encoder based on Graph Hypernetworks: unlike standard contrastive learning, which performs only feature-level recognition, our hypernetwork dynamically generates policy parameters from a minimal support set, enabling parameter-level adaptation to complex topologies (e.g., domes, arches) without gradient-based retraining. For structural reasoning, we introduce a Physics-Informed Graph Transformer trained via Soft Actor-Critic (SAC), with a Physics-Bias Attention mechanism that modulates attention scores using contact forces from Discrete Element Model (DEM) simulations, guiding the planner toward structurally critical connections. We further ensure operational efficiency through Kinematics-Aware Sequencing, where the SAC objective penalizes high-energy transitions. Finally, we bridge the Sim2Real gap via Residual Stability Correction, a differentiable optimization layer that fine-tunes coarse assembly actions by minimizing a joint energy-stability cost prior to execution. Experiments show that EUPHORIA significantly reduces energy consumption over decoupled baselines and achieves state-of-the-art success rates on unseen, non-standard geometries with minimal few-shot examples, fusing meta-learning, physics-informed attention, and residual optimization into a cohesive, generalized planner.

[AI-142] Distributional Energy-Based Models for Uncertainty-Aware Structured LLM Reasoning

链接: https://arxiv.org/abs/2605.18871
作者: Shireen Kudukkil Manchingal,Abhey Kalia,Fernanda Gonçalves,Shebin Rawther
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When Large Language Models produce structured outputs such as travel plans, code solutions, or multi-step proofs, individual reasoning steps may appear correct while the output as a whole violates budgets, fails test cases, or contradicts earlier deductions. We propose a decomposed energy function that combines a learned quality scorer with deterministic analytical constraint penalties for verifying structured LLM outputs. The quality scorer is a heterogeneous ensemble of low-rank adapters on a single frozen encoder (3% trainable parameters); the ensemble mean ranks candidates while the standard deviation quantifies epistemic uncertainty, driving a two-pass inference loop that triggers targeted regeneration or abstention. Across five benchmarks (GSM8K, MuSR, TravelPlanner, TACO, Knights Knaves), our 149M-parameter verifier orchestrating a pool of 7-26B open generators outperforms single-shot Qwen-72B on every benchmark, matches Claude Sonnet 4.6 on MuSR (67.7% vs. 68.0%), and reduces constraint violations by 53% relative to Opus 4.6 on TravelPlanner (oracle 0.028, random 0.231). The two routes are complementary: structural verification wins when constraints are checkable (the verifier captures signal frontier models cannot self-detect), while pretraining-scale priors win where they are not (narrative inference, code semantics). A cross-dataset confounding analysis confirms genuine quality discrimination on four reasoning tasks and identifies a model-identity shortcut on code, mitigated via last-layer retraining. Scorers trained on difficult data transfer zero-shot: a MuSR-trained scorer achieves 93.9% on GSM8K without seeing a math problem.

[AI-143] MO-CAPO: Multi-Objective Cost-Aware Prompt Optimization

链接: https://arxiv.org/abs/2605.18869
作者: Jan Büssing,Moritz Schlager,Timo Heiß,Tom Zehle,Matthias Feurer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance across a wide range of tasks but are highly sensitive to prompt design, motivating the need for automatic prompt optimization. Existing methods predominantly focus on performance alone, ignoring competing objectives such as inference cost or latency. At the same time, existing work on multi-objective prompt optimization relies on off-the-shelf NSGA-II, ignoring optimization efficiency. As a remedy, we introduce MO-CAPO, a novel multi-objective prompt optimization algorithm that jointly optimizes performance and inference cost while leveraging budget allocation for cost-efficient optimization. We further propose a deployment-oriented cost objective that captures the full computational profile of LLM inference. We evaluate our approach across four tasks and three LLMs and compare it to an NSGA-II-based multi-objective method and state-of-the-art single-objective prompt optimizers. Results show that MO-CAPO consistently identifies strong, robust, and diverse Pareto front approximations while maintaining cost-efficiency. It outperforms the NSGA-II baseline on 8 out of 12 cases in terms of the noisy R2 metric and achieves competitive performances often already at a considerably lower budget. The discovered solution sets span diverse performance-cost trade-offs that are omitted by single-objective optimizers, yet the top-performance candidates remain competitive with single-objective solutions. Additionally, we conduct the first evaluation of multi-objective machine learning experiments that considers generalization and robustness through noisy R2 and approximation gap, enabling a more realistic assessment of solution quality. MO-CAPO enables practitioners to select from an efficiently discovered set of multiple prompts offering different trade-offs between performance and cost.

[AI-144] EVA-0: Test-Time Model Evolution with Only Two Forward Passes per Sample

链接: https://arxiv.org/abs/2605.18867
作者: Guohao Chen,Shuaicheng Niu,Geng Li,Yunbei Zhang,Shilin Shan,Chunyan Miao,Jianfei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-time model evolution offers a promising way for deployed models to improve from unlabeled test-time experience, yet most existing methods depend on backpropagation (BP), which incurs substantial memory overhead and makes them difficult to deploy on edge devices, quantized models, specialized accelerators, or black-box models. In this work, we study test-time model evolution under a strict two-forward budget, a setting that pushes adaptation toward highly efficient real-world deployment. We reveal three key obstacles in zeroth-order test-time optimization: susceptibility to shortcut solutions, uncontrolled weight drift, and ineffective update direction estimation. To overcome them, we propose EVA-0, a minimal zeroth-order adaptation framework that: 1) keeps the loss scale-invariant to prevent shortcut solutions; 2) devises an anchor-guided optimization strategy to alleviate weight drift; 3) uses sample-wise symmetric two-sided perturbation for update direction estimation and inference. EVA-0 requires no BP and performs both inference and adaptation within only two forward passes per sample. Results on ImageNet-C ViT-Base show that EVA-0 outperforms both BP-based DeYO and BP-free FOA, while achieving a 14x speed-up over FOA. Code will be released.

[AI-145] FLUIDSPLAT: Reconstructing Physical Fields from Sparse Sensors via Gaussian Primitives

链接: https://arxiv.org/abs/2605.18866
作者: Huaxi Huang,Meng Li,Zhengqing Gao,Xi Zhou,Xiaoshui Huang,Xiao Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, 4 figures,preprint

点击查看摘要

Abstract:Reconstructing continuous flow fields from sparse surface-mounted sensors is central to aerodynamic design, flow control, and digital-twin instrumentation. Existing neural methods for this task typically encode sensor readings into implicit latent codes with little spatial interpretability and limited formal guidance on how representational capacity should scale with observation count. Inspired by 3D Gaussian Splatting, we introduce FLUIDSPLAT, a sensor-conditioned model that predicts K anisotropic Gaussian primitives forming a partition-of-unity scaffold, a spatially explicit and interpretable intermediate representation of the flow. For an idealized Gaussian primitive estimator, we prove an O(K^-s/d) approximation rate for fields with Sobolev smoothness s ; incorporating N noisy observations yields a squared-risk decomposition with bias O(K^-2s/d) and variance O(\sigma^2K/N) .Balancing the two yields K^*!\sim!(N/\sigma^2)^d/(2s+d) : primitive count cannot grow freely under sparse sensing, revealing a variance bottleneck that motivates complementing the scaffold with a state-conditioned residual decoder. On a standard cylinder-flow benchmark, FLUIDSPLAT achieves the best mean error across all surface-sensor layouts; on AirfRANS with 8 surface-pressure sensors, it reduces error by 11-23% over the strongest baseline across three standard splits.

[AI-146] From Sparsity to Simplicity: Enabling Simpler Sequential Replacements via Sparse Attention Distillation

链接: https://arxiv.org/abs/2605.18865
作者: Yuxin Ren,Maxwell D Collins,Miao Hu,Huanrui Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-attention serves as the core foundation of large-scale transformer pretraining, but its quadratic token interaction cost makes inference expensive. Replacing attention with simpler sequential modules is appealing, yet naive substitution is often lossy, especially at larger scales. This paper revisits attention replacement through the lens of sparsity. Based on the observation of diverse sparsity patterns across transformer layers, we posit that pretrained transformers decompose the complex token dependency across tokens into various sequence-to-sequence mappings of diverse complexities, where some layer functionalities can be approximated and replaced with much simpler sequential modules without loss. We evaluate this premise using a plug-and-play layer-wise distillation framework to approximate and replace attention functionalities in pretrained vision transformer models. Controlled group-wise replacements under a fixed training budget reveal a clear pattern: substituting layers with sparser attention incurs substantially smaller accuracy drops than replacing denser ones. We further impose explicit attention sparsity on the pretrained ViT via AViT-style token retention and perform sparsity-guided distillation for sequential replacing models, where we see increasing teacher sparsity consistently reduces the student-teacher gap. The proposed method achieves efficient attention replacement for reduced parameter size and latency through the guidance of attention sparsity.

[AI-147] owards Family-Grouped Hierarchical Federated Learning on Sub-5KB Models: A Feasibility Study of Privacy-Preserving ECG Monitoring for Ultra-Resource-Constrained Wearables

链接: https://arxiv.org/abs/2605.18862
作者: Hangyu Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Supported by Shenzhen Coddie Technology Co., Ltd. This is a preprint and has not been peer-reviewed

点击查看摘要

Abstract:Cardiovascular disease remains the leading cause of death worldwide, and early detection of arrhythmias through continuous ECG monitoring on wearable devices can prevent life-threatening events. Federated Learning (FL) enables privacy-preserving collaborative training by keeping raw ECG data on device, yet standard FL incurs prohibitive communication overhead and standard deep learning models cannot fit on ultra-low-power microcontrollers. We propose Family-Grouped Hierarchical Federated Learning (Family-FL), a three-tier architecture that uses the family as a natural privacy boundary for intra-family aggregation before global synchronization. We further design a hardware-constrained Tiny CNN-LSTM architecture with only 669 parameters, INT8-quantized to occupy merely 4.65KB Flash and 2.95KB RAM, meeting the constraints of STC32G12K128-class microcontrollers. Experiments on the MIT-BIH Arrhythmia Database (mean of 5 independent runs with different seeds) demonstrate that Family-FL reduces communication volume by 76.7% compared to FedAvg while maintaining comparable accuracy. Family-FL-Tiny achieves 91.9 +/- 1.2% accuracy with macro-F1 of 0.483 +/- 0.031, reducing total communication to 0.31% of FedAvg. The model achieves reliable ventricular arrhythmia detection (per-class F1 = 0.80), the most clinically critical abnormality for home-based preliminary screening. These results demonstrate the technical feasibility of privacy-preserving federated learning on ultra-resource-constrained microcontrollers through simulation-based evaluation. We honestly discuss limitations: no hardware deployment, single-dataset validation (MIT-BIH, 47 subjects), reduced rare-class sensitivity, and absence of formal differential privacy guarantees.

[AI-148] winRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agent ic LLM Routing

链接: https://arxiv.org/abs/2605.18859
作者: Pei Yang,Wanyi Chen,Tongyun Yang,Pengbin Feng,Jiarong Xing,Wentao Guo,Yuhang Yao,Yuhang Han,Hanchen Li,Xu Wang,Zeyu Wang,Jie Xiao,Anjie Yang,Liang Tian,Lynn Ai,Eric Yang,Tianyu Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step-level routing benchmark with two tracks. The static track provides 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution-verified target tier estimated under a released downgrade-and-cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator-side LLM judge. The dynamic track supplies a harness that runs routers on the full 500-case SWE-bench Verified suite; in this paper we report a 100-case held-out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end-to-end validation under live agent execution. Code and data are available at this https URL.

[AI-149] When Individually Calibrated Models Become Collectively Miscalibrated

链接: https://arxiv.org/abs/2605.18858
作者: Zhaohui Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
备注: 42 pages, 1 main figure, multiple tables. Accepted at ProbML 2026

点击查看摘要

Abstract:Probabilistic prediction systems often aggregate probability estimates from multiple models into a single decision. A common assumption is that if each model is individually calibrated, the aggregate prediction will also be well calibrated. We show that this assumption fails in multi-agent settings: individually calibrated predictors can become collectively miscalibrated when their predictions interact strategically, in the game-theoretic sense of Brier-optimal local response, even without deliberate coordination. This phenomenon arises naturally when agents are independently trained on overlapping data. We prove that under Brier-score-based aggregation with positively correlated beliefs, each agent’s individually optimal report systematically underestimates the positive-class probability, yielding a Price of Anarchy greater than one whenever Cov(b_i, b_j) 0. In a canonical setting (n = 5 agents, pairwise correlation = 0.5, base rate = 0.3), the empirically measured PoA in false-negative rate reaches 7.25x. In contrast, VCG-based aggregation aligns incentives by rewarding marginal contribution, achieving dominant-strategy incentive compatibility and near-optimal performance. Experiments on three real-world datasets (NSL-KDD, UNSW-NB15, Credit Card Fraud) show that VCG provides strong robustness while maintaining comparable accuracy. It performs particularly well in data-sparse and adversarial settings, and adaptive weighting further improves performance under distribution shift. Comments: 42 pages, 1 main figure, multiple tables. Accepted at ProbML 2026 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML) Cite as: arXiv:2605.18858 [cs.LG] (or arXiv:2605.18858v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.18858 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Zhaohui Wang [view email] [v1] Thu, 14 May 2026 05:25:16 UTC (515 KB)

[AI-150] INSIGHTS: Demonstration-Based Summaries of Time Series Predictors

链接: https://arxiv.org/abs/2605.18849
作者: Bar Eini Porat,Rom Gutman,Uri Shalit,Ofra Amir
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explainability methods have progressed rapidly, but global explanations for time-series models remain underdeveloped, with most approaches focusing on local, instance-level attributions. We introduce INSIGHTS, a model-agnostic, user-centric approach for providing global explanations of time series models. Our approach prioritizes simplicity, efficiency, and transparency in its design, ensuring that stakeholders can readily adopt its outputs. While current methods focus on local explanations, INSIGHTS generates sample summaries that offer a comprehensive overview of model behavior. It balances the importance and diversity of time series samples to create informative subsets using utility functions that capture domain-specific aspects of time series behavior, such as exceeding domain norms. We evaluate INSIGHTS through experiments, interviews, and a user study. Our results indicate INSIGHTS effectively constructs comprehensive, diverse time series subsets, producing summaries manageable for individual evaluation. It is preferred by domain experts for its ability to provide a stable understanding of model behavior and the quality of the samples identified. Moreover, user study participants presented with INSIGHTS-based summaries exhibit an enhanced understanding of the model’s overall behavior.

[AI-151] Exact Linear Attention

链接: https://arxiv.org/abs/2605.18848
作者: Weinuo Ou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 16 figures, journal

点击查看摘要

Abstract:This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention by leveraging the exact decomposition property of kernel functions, without any approximation error. It identifies and addresses gradient explosion and token attention dilution in prior linear attention methods by imposing kernel constraints that ensure non-negativity, discriminability, and geometric interpretability. Several kernel functions are proposed, including the Hadamard Exp Kernel, Summation Squared Euclidean Distance Kernel, and Subtraction Squared Euclidean Distance Kernel. Beyond the core attention formulation, the paper presents three engineering innovations: a Hyper Link structure that replaces traditional residual connections to mitigate gradient degradation, a Memory Lobe module based on bidirectional linear attention that captures transformation flow across layers to implement qualitative memory and an implicit reinforcement learning paradigm, and a routing score based bias mechanism for Mixture of Experts to improve interpretability and semantic alignment.

[AI-152] ransformers Linearly Represent Highly Structured World Models

链接: https://arxiv.org/abs/2605.18847
作者: Roman Kniazev,Nathanaël Fijalkow
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Do transformers, when trained on sequential reasoning traces, build internal models of the underlying task? And if so, does the structure of those internal representations mirror the structure of the domain? We train an 8-layer transformer on Sudoku solving traces and perform a mechanistic analysis of its internal computation. We establish two results. First, the model builds a substructure world model: it does not represent the board state cell by cell, as a human analyst would expect, but organizes information around the rows, columns, and boxes that Sudoku’s constraints act on. Second, we identify a naked-single circuit: a small set of dedicated neurons in the final MLP layer, each individually detecting when exactly one digit remains possible for a specific cell, and reliably promoting that digit. These findings show that the geometry of an emergent world model is shaped by the constraint algebra of the domain, not its surface presentation, and that the resulting decision circuit is sparse, monosemantic, and fully interpretable. More broadly, they demonstrate that mechanistic interpretability tools can recover an end-to-end algorithmic account of how a transformer solves a combinatorial reasoning task.

[AI-153] Lost and Found in Translation: Variational Diagnostics for Neural Codebook Channels

链接: https://arxiv.org/abs/2605.18846
作者: Yusuke Hayashi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Classical communication systems fail not only through random noise but also when transmitter and receiver use incompatible operational codebooks. Variational autoencoders (VAEs) train an encoder q_\phi and decoder p_\theta jointly, and practitioners treat the resulting latent space as a discrete code – for clustering, conditional generation, and mechanistic interpretability. Yet standard VAE diagnostics – ELBO, active units, mutual information, and code histograms – certify only whether this code is used, never whether the decoder reads each latent under the encoder’s code. We close this gap with the neural codebook channel K_e\to d(j\mid i) , a coupled encoder-decoder diagnostic whose off-diagonal mass is bounded by an architecture-free Bernoulli-KL certificate d_\mathrmbin(1-\mathcalA ,|, \bar\eta_p) \le \bar\Delta controlled by the variational gap. The certificate is the operational specialization of the classical KL chain rule under disintegration to the encoder-decoder disagreement event, complemented by a constructive marginal-impossibility result: no combination of marginal histograms, entropies, active-code counts, or mutual information determines K_e\to d . We audit the certificate on four sklearn datasets (finite-grid exact, 5/5 seeds, 20/20 pairs satisfy the bound), a 2D model where the bound is non-vacuous at 2.71\times the observed disagreement and the four-term identity closes within 10^-4 , MNIST under importance-sampling control, and a VQ-VAE attaining the predicted limit \hat\mathcalA=1.000 . The package (K_e\to d, \mathcalA, R_\mathrmeff, R, \mathrmAU) is an audit-ready reporting unit. More broadly, the framework makes mismatched decoding – a failure mode classical communication theory named decades ago – visible inside a single deep generative model. Comments: 9 pages, 2 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT) Cite as: arXiv:2605.18846 [cs.LG] (or arXiv:2605.18846v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.18846 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-154] First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation

链接: https://arxiv.org/abs/2605.18845
作者: Truong Xuan Khanh,Truong Quynh Hoa,Luu Duc Trung,Phan Thanh Duc
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 51 pages, 7 figures, 6 tables. Preprint

点击查看摘要

Abstract:We give the first quantitative prediction of grokking delay under AdamW. Treating the delay as a first-passage time, we derive a closed-form law T_grok - T_mem = (1 / 2 kappa_LL eta lambda) log(V_mem / V_star), where V_t = ||theta_t||^2 is the squared parameter norm, V_star is an architecture-dependent threshold, and kappa_LL absorbs the AdamW correction to the clean-SGD contraction rate 2 eta lambda. Calibrating (kappa_LL, V_star) on a single hyperparameter cell predicts grokking delays on 26 held-out runs with MAPE 17.7% over a 41x delay range; the law generalises to MLPs (MAPE 18.0%, N=34) and degrades to 23.3% on cross-task extension (N=46, 43.5x range), with a structured residual in which V_star / V_mem stays comparatively stable within architecture (CV about 14% on the 1L transformer). First-passage of V_t is necessary but not sufficient. A quantile-margin theorem establishes that positive delay requires both norm separation V_mem V_post and angular reachability of a threshold alpha_star = arcsin(C / V_T_mem^(1/2)), where C is computable from the empirical NTK feature map and the validation-margin quantile. Calibrating C on modulus p=89 predicts alpha_star = 47.2 degrees at p=97 (observed 47.8 degrees, error 1.3%) as a prior cross-cell prediction. Causal interventions that freeze the norm or remove weight decay at memorisation eliminate grokking (0/6 vs. 3/3 baseline), trapping the angular displacement near 12 degrees. kappa_LL is empirically measured per architecture rather than derived from (beta_1, beta_2, epsilon); within-architecture CV stays at most 15% across four architectures, but values differ by about 2x between architectural variants beyond depth alone. Empirical scope is algorithmic tasks (modular arithmetic, sparse parity) under AdamW; whether the law transfers to natural-language scale models is open. Comments: 51 pages, 7 figures, 6 tables. Preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.18845 [cs.LG] (or arXiv:2605.18845v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.18845 Focus to learn more arXiv-issued DOI via DataCite

[AI-155] Graph-Driven Cross-Industry Real-Time Monitoring Framework for Anti-Money Laundering Detection in Converged Mobility-Energy Supply Chain Networks

链接: https://arxiv.org/abs/2605.18844
作者: Rong Liu,Xiaojun Xiao,Zhanqing Su
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the deep integration of the travel and energy industries, cross-industry supply chain finance has gradually become a high-risk field of hidden money laundering incidents. For this reason, this work proposes a graph-driven cross-industry real-time anti-money laundering monitoring framework (GCRMF) for integrated travel - energy supply chain networks. First, a cross-industry heterogeneous graph (CIHG) covering new energy vehicle rental platforms, energy suppliers, fintech institutions, etc., is constructed, and industry semantics are integrated through temporarily Dual-GAT (Temporal Dual-Graph Attention Network), dynamically encoding capital flow paths and evolution features over time. Subsequently, in order to identify the structural fraud behavior together produced by colluding subjects, a meta-path subgraph reasoning module based on contrastive learning and hierarchical graph sampling is proposed to enhance the discrimination capability of cross-industry recurring money laundering behavior. Meanwhile, a self-supervised online learning mechanism is adopted for real-time adaptation and continuous optimization to new money laundering strategies. The experimental results show that compared with existing graph neural network methods in cross-industry scenarios, GCRMF improves the performance by more than 17.8% of F1 score and greatly reduces the false positive rate.

[AI-156] An Integrated Forecasting Prototype for Emergency Department Boarding Time to Support Proactive Operational Decision Making

链接: https://arxiv.org/abs/2605.18839
作者: Orhun Vural,Abdulaziz Ahmed,Ferhat Zengul,James Booth,Bunyamin Ozaydin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, including supplementary materials

点击查看摘要

Abstract:Overcrowding in emergency departments (ED) remains a persistent operational challenge worldwide, causing delays in care delivery and downstream congestion. ED boarding time, defined as the duration admitted patients remain in the ED while awaiting inpatient bed placement, is a key indicator of this congestion. Predicting ED boarding time in advance enables proactive operational decision making before congestion escalates. We developed and evaluated a multi-horizon time series forecasting framework to predict ED boarding time at 6, 8, 10, 12, and 24-hour horizons. Real-world data from a university-affiliated urban hospital in the United States were utilized and integrated with external contextual data sources, including weather, holidays, and major local events. Decomposition-based Linear (DLinear) and Normalization-based Linear (NLinear) time series forecasting deep learning models showed superior performance across multiple horizons. Models were also evaluated under extreme congestion scenarios characterized by elevated boarding times. In addition, a Machine Learning Operations (MLOps) web application prototype was developed to support translation of the forecasting framework into practice through integrated data ingestion, forecast visualization, experimentation, and retraining.

[AI-157] VCR: Learning Valid Contextual Representation for Incomplete Wearable Signals

链接: https://arxiv.org/abs/2605.18837
作者: Yuxuan Weng,Wenhan Luo,Qijia Shao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Wearable devices enable continuous health monitoring from multimodal signals, but real-world deployment is hindered by limited labeled data and pervasive sensor incompleteness. While large-scale self-supervised pretraining reduces label dependence, most existing methods assume full modality availability. Current approaches for handling modality missingness often reconstruct entire absent signals, which can encourage hallucinating modality-specific details that are not inferable from the observed sensor signals and degrade robustness. We propose VCR, a self-supervised framework that learns to extract valid representations robust to modality missingness. VCR employs an orthogonal tokenizer to enforce strict orthogonal disentanglement by rectifying latent manifolds and applying a geometric projection, separating each modality into shared semantics and modality-specific residuals. This design preserves complete information integrity while serving as a structural foundation for robust learning under modality missingness. The resulting tokens are processed by a missing-aware mixture-of-experts backbone that adapts to varying patterns of modality availability. By constraining the objective to reconstruct only the shared components of missing modalities, VCR effectively mitigates hallucinations of non-inferable modality-specific details. Across multiple health monitoring tasks, VCR consistently improves performance and robustness under full, single-missing, and multiple-missing modality settings compared with strong supervised and self-supervised baselines.

[AI-158] Automated Big Data Quality Assessment using Knowledge Graph Embeddings

链接: https://arxiv.org/abs/2605.18833
作者: Hadi Fadlallah,Rima Kilany,Mitri Haber,Ali Jaber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:Automated data quality assessment is crucial for managing big data, but existing solutions face challenges in achieving accurate context-aware assessment. This paper presents a novel knowledge-based approach to enhance automated data quality assessment. Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset’s context representation and the relevant quality rules and dimensions within a knowledge graph representing contextual data characteristics and the required quality assessment operations. We surpass conventional practices by integrating diverse representations within the knowledge graph, drawing insights from contextual information from a thorough literature investigation. This integration allows us to develop a comprehensive and context-specific data quality assessment plan tailored to each context. Leveraging the knowledge graph improves our understanding of the input dataset’s context, overcoming the limitations of traditional methods that rely solely on strict matching and overlook contextual characteristics. By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement, providing a comprehensive data quality assessment plan for the input dataset. To evaluate our approach, we leverage AmpliGraph, a framework developed and benchmarked by AccentureLabs. The evaluation involves employing a real-world radiation sensors dataset provided by the Lebanese Atomic Energy Commission (LAEC-CNRS). The results obtained from this evaluation demonstrate the capability of our solution to generate a comprehensive data quality assessment plan for the given input dataset. Comments: 17 pages, 10 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.18833 [cs.LG] (or arXiv:2605.18833v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.18833 Focus to learn more arXiv-issued DOI via DataCite Journalreference: nternational Journal of Data Mining, Modelling and Management 17.4 (2025) 383-405 Related DOI: https://doi.org/10.1504/IJDMMM.2025.150987 Focus to learn more DOI(s) linking to related resources

[AI-159] Precision Tracked Transformer via Kalman Filtering Kriging and Process Noise

链接: https://arxiv.org/abs/2605.18832
作者: Bo Long,Deepak Agarwal,Jelena Markovic-Voronov,Yi Wang,Liuqing Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Transformer is the foundational building block of modern AI, yet offers no principled handling of \emphuncertainty, which is prevalent in real applications: cold-start tokens with sparse histories in sequential recommendation, heterogeneous signal quality in language models, and attention sinks induced by unconstrained softmax. Every token is treated with uniform confidence. We show this uniformity is a degenerate case of our \emphBayesian Filtering Transformer (BFT): attention becomes precision-weighted kriging, the residual connection becomes a Kalman update with adaptive gain, and the FFN becomes a dynamics model propagating precision via a Jacobian–plus–process-noise rule. Observation precision comes from a parameter-free Restricted Maximum Likelihood (REML) estimator with a conjugate Bayesian prior. BFT replaces any Transformer layer with negligible overhead. On sequential recommendation, BFT applied to three major architectures yields significant gains on six benchmarks, with the largest improvements on cold-start users and rare items where uncertainty is highest. On supervised fine-tuning of large language models with noisy data, BFT improves robustness in two regimes: noisy supervision (token-label corruption in question answering) and noisy context (retrieval-augmented QA with real RAG distractors). A single principled modification – restoring precision – unlocks substantial headroom across both classical sequence-modeling and modern LLM regimes.

[AI-160] he Routing and Filtering Structure of Attention

链接: https://arxiv.org/abs/2605.18826
作者: Shafayeth Jamil,Rehan Kapadia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:The attention interaction matrix QK^\top contains two entangled computations: a skew-symmetric component that redistributes information between positions (routing) and a symmetric component that scales mutual relevance (filtering). We decompose 1776 heads across five pretrained transformers and find routing operating at low rank, well below the routing capacity allocated by the weight kernel. We introduce S - D attention as a diagnostic parameterization that disentangles routing from filtering by construction with guaranteed stability ( \mathrmRe(\lambda) \le 0 ) and trains stably without layer normalization. When disentangled and unnormalized, routing self-organizes into a spectral cascade, effective rank 2 at the first layer, expanding with depth across six scales from 7M to 355M parameters. The cascade predicts where attention can be simplified: linearizing the first seven layers of 125M S - D attention costs 5% perplexity, whereas standard attention collapses under the same intervention. The linearizable region widens with depth. Replacing the first four layers with ELU+1 linear attention reaches within 1.4% of baseline at full head dimension. Cascade-allocated architectures trade attention parameters for perplexity ( 47%-65% fewer attention parameters at +3.9% to +8.4% PPL). The routing-filtering decomposition makes the spectral budget legible; the cascade makes it actionable.

[AI-161] Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

链接: https://arxiv.org/abs/2605.18822
作者: Chengqian Zhang,Wei Zhu,Kyumin Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training has become essential for adapting large language models (LLMs) to complex downstream behaviors, including instruction following, preference alignment, and multi-step reasoning. Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a particularly effective post-training paradigm for improving reasoning capabilities, with critic-free algorithms such as GRPO and GSPO enabling scalable optimization. However, RLVR post-training with full fine-tuning (FFT) requires substantial GPU memory and incurs high training costs. Although parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), effectively reduce computational costs, they often suffer from a noticeable performance gap compared to full fine-tuning in post-training for complex reasoning tasks. In this paper, we propose Hybrid-LoRA, an efficient hybrid post-training framework that selectively applies full fine-tuning to a small subset of modules less suited to low-rank adaptation, while adapting the remaining components with LoRA. We introduce a novel Hybrid-LoRA Score to rank candidate modules according to their sensitivity to low-rank adaptation under a fixed parameter budget. Experiments show that Hybrid-LoRA closely matches full fine-tuning performance under a 10% full fine-tuning module budget, with the remaining candidate modules adapted by LoRA, consistently outperforming four state-of-the-art PEFT post-training baselines, achieving improvements of up to 5.65% and on average 4.36% over the best baseline.

[AI-162] Emergence of Frontier Superposition: Möbius attractor and Cascade Supervision

链接: https://arxiv.org/abs/2605.18820
作者: Hongyu Gu,Jingwen Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 40 pages, 3 figures

点击查看摘要

Abstract:Superposition allows Transformers to reason in depth, carrying an entire reasoning frontier in parallel through a bounded-depth forward pass instead of unrolling serial chain-of-thought tokens. While Zhu et al. (2025) hand-crafted an equal-weight breadth-first frontier in a single residual stream for graph reachability, it remained open whether gradient descent could ever find this target amidst permutation-symmetric saddles. We close this gap on Reachability-by-Superposition over Erdős-Rényi graphs by isolating architectural and supervisional contributions. Architecturally, we identify a Möbius attractor: under S_n -symmetry in the tree regime, layerwise dynamics reduce to a 1D Möbius map whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state. On the supervision side, we identify Cascade Supervision: a loss class whose backward pass simultaneously delivers (A) selectivity bootstrap, (B) gradient persistence across depth, and © per-step discrimination (e.g., \mathcalL_sup and \mathcalL_node). End-to-end supervision fails condition (B) and is provably insufficient: internal gradients at layer c decay as (np)^-(D-c-2)/2 in the graph fan-out and stall before the manifold is reached. Our thesis: Möbius attractor + Cascade Supervision = emergence of superposition reasoning. The parameter-free decay law predicts a final-step cosine of 0.35 vs. 0.71 (end-to-end vs. cascade) at depth D=3; experiments confirm 0.37 vs. 0.69, matching within 0.02 at every step. Comments: 40 pages, 3 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.18820 [cs.LG] (or arXiv:2605.18820v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.18820 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Hongyu Gu [view email] [v1] Tue, 12 May 2026 14:35:46 UTC (610 KB)

[AI-163] Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

链接: https://arxiv.org/abs/2605.18818
作者: Yao Fehlis,Benjamin Bengfort,Zhangzhang Si,Vahid Eyorokon,Prema Roman,Patrick Deziel,Devon Slonaker,Steve Veldman,Ben Johnson,Joyce Rigelo,Michael Wharton,Steve Kramer
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.

[AI-164] Symmetry in the Wild: The Role of Equivariance in Neural Fluid Surrogates

链接: https://arxiv.org/abs/2605.18816
作者: Patryk Rygiel,Julian Suk,Kak Khee Yeung,Christoph Brune,Jelmer M. Wolterink
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural surrogates enable orders-of-magnitude acceleration of computational fluid dynamics (CFD) simulations, with the potential to transform engineering and healthcare workflows. Neural surrogate use in real-world applications requires addressing scalability to large, high-resolution surface and volume meshes, as well as to bespoke architectures, and accounting for limited training data through the use of inductive biases. Group-equivariant architectures are a principled way to introduce such bias, yet they can be detrimental when the learning problem itself breaks symmetry, for example, due to strong distributional alignment in the dataset. In this work, we investigate under which conditions equivariance improves generalization in neural CFD surrogates across tasks with increasing levels of distributional alignment and realism, covering automotive aerodynamics and blood flow (hemodynamics). To systematically assess the added value of equivariance at the limit of problem scaling, we introduce the Anchored-Branched Geometric Algebra Transformer (AB-GATr), a neural surrogate that integrates scalability and symmetry preservation to efficiently model coupled surface and volume quantities in an E(3) -equivariant manner. We find that on strongly aligned aerodynamics datasets, i.e., those that break symmetry, enforcing equivariance can degrade in-distribution performance. In contrast, across hemodynamic benchmarks with diverse geometries and varying alignment, equivariance is consistently beneficial. Moreover, across all benchmarks, the explicit equivariance of AB-GATr reliably outperforms implicit symmetry learning through data augmentation. Our findings showcase that equivariance is not universally beneficial across domains, yet it brings tangible advantages in problems lacking strong data regularities.

[AI-165] Composition of Memory Experts for Diffusion World Models

链接: https://arxiv.org/abs/2605.18813
作者: Sebastian Stapf,Pablo Acuaviva Huertos,Aram Davtyan,Paolo Favaro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:World models aim to predict plausible futures consistent with past observations, a capability central to planning and decision-making in reinforcement learning. Yet, existing architectures face a fundamental memory trade-off: transformers preserve local detail but are bottlenecked by quadratic attention, while recurrent and state-space models scale more efficiently but compress history at the cost of fidelity. To overcome this trade-off, we suggest decoupling future-past consistency from any single architecture and instead leveraging a set of specialized experts. We introduce a diffusion-based framework that integrates heterogeneous memory models through a contrastive product-of-experts formulation. Our approach instantiates three complementary roles: a short-term memory expert that captures fine local dynamics, a long-term memory expert that stores episodic history in external diffusion weights via lightweight test-time finetuning, and a spatial long-term memory expert that enforces geometric and spatial coherence. This compositional design avoids mode collapse and scales to long contexts without incurring a quadratic cost. Across simulated and real-world benchmarks, our method improves temporal consistency, recall of past observations, and navigation performance, establishing a novel paradigm for building and operating memory-augmented diffusion world models.

[AI-166] D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

链接: https://arxiv.org/abs/2605.18810
作者: Tianyu Wu,Yu Yao,Zhenting Qi,Han Zheng,Zhuohan Wang,Haoran Ma,Lawrence Liao,Himabindu Lakkaraju,Ju Li,Yilun Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Speculative decoding accelerates LLM inference by having a small drafter propose tokens that a larger target model verifies in parallel. Recent diffusion-based parallel drafters such as DFlash predict the full B-token block in one forward pass, enabling deeper drafters and longer accepted blocks. However, existing multi-token drafter objectives often use fixed position-dependent weighting schedules, such as head-dependent weights or block-position decays, which do not adapt as the positions limiting acceptance change during training. To address this, we derive per-position training weights from a differentiable surrogate of expected accepted draft length, matching the weight of each position to its log-probability gradient contribution. The resulting loss, D-PACE (Dynamic Position-Aware Cross-Entropy), shifts training signal toward positions that currently limit acceptance as the drafter improves. Across six benchmarks, two Qwen3-4B draft depths, two decoding temperatures, and two additional target models, D-PACE consistently improves both wall-clock speedup and average emitted length, with 2.3% measured training-time overhead and no changes to the drafter architecture or inference procedure.

[AI-167] Metric-Gradient Projection for Stable Multi-Agent Policy Learning

链接: https://arxiv.org/abs/2605.18809
作者: Zuyuan Zhang,Sizhe Tang,Mahdi Imani,Tian Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:General-sum multi-agent learning is often governed by a stacked update field in which each agent’s policy update changes the optimization landscape faced by the others. This coupling can entangle an integrable component of collective improvement with cyclic interaction dynamics, leading to slow or unstable multi-agent learning. Existing approaches, such as regularization, credit assignment, and consensus methods, stabilize MARL through local or algorithmic modifications; HPML complements them by projecting the joint update field onto a metric-gradient component. We introduce \textbfHPML (\textbfHodge-\textbfProjected \textbfMulti-agent \textbfLearning), which views the joint update field of a multi-agent system as an element of an L^2 space of vector fields and computes a Hodge-type projection onto the closest metric-gradient potential flow. HPML follows the projected component as the update direction, yielding the closest metric-gradient field under the chosen metric and sampling measure. The projection is defined variationally, characterized by a Poisson-type equation, and implemented through graph-based and amortized neural realizations that recover projected directions from samples. We show that the projected dynamics admit a Lyapunov potential and yield equilibrium-gap bounds with an explicit additive non-potentiality term. Controlled experiments validate the geometric mechanism, and CTDE benchmarks show improved stability and normalized return when HPML is used as a plug-in projection layer in MARL pipelines.

[AI-168] Block-Based Double Decoders

链接: https://arxiv.org/abs/2605.18807
作者: Asher Labovich,Benjamin Bradley,Vanessa Alexander,Chaitanya Harsha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages main, 13 pages total

点击查看摘要

Abstract:Encoder-decoder models offer substantial inference-time savings over decoder-only models, but their pretraining objectives suffer from sparse supervision and dynamic sequence lengths, keeping them out of practice at scale. We propose block-based double decoders, a novel transformer architecture that utilizes doubly-causal block-based attention masks to train with full loss supervision and static sequence packing, combining decoder-only training efficiency with encoder-decoder inference efficiency. In scaling law experiments, block-based double decoders strongly outperform encoder-decoders and closely track decoder-only models across scales. At inference time, they cut KV-cache memory and per-token compute by at least 2/3 without sacrificing prefill caching or other existing inference optimizations available to decoder-only models.

[AI-169] Adaptive Multi-Scale Goodness Aggregation for Forward-Forward Learning

链接: https://arxiv.org/abs/2605.18804
作者: Salar Beigzad,Vansh Verma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 tables, IEEE format

点击查看摘要

Abstract:We propose Adaptive Multi-Scale Goodness Aggregation (AMSGA), a novel extension of the Forward-Forward (FF) algorithm designed to improve stability, robustness, and generalization in local-learning neural networks. AMSGA addresses several limitations of the original FF framework by introducing multi-scale goodness aggregation across local, intermediate, and global representations; adaptive curriculum-guided hard negative mining; layer-dependent adaptive thresholds; and a warm-up cosine annealing learning-rate schedule for improved optimization stability. Together, these modifications strengthen the FF paradigm while preserving its biologically plausible and memory-efficient properties. Experiments on MNIST and Fashion-MNIST demonstrate consistent performance improvements over the baseline FF algorithm, achieving up to +1.45% improvement on MNIST and +1.50% improvement on Fashion-MNIST without significant computational overhead. Our results suggest that local learning methods can become substantially more competitive when goodness estimation and training dynamics are carefully designed.

[AI-170] PROWL: Prioritized Regret-Driven Optimization for World Model Learning

链接: https://arxiv.org/abs/2605.18803
作者: Ahmet H. Güzel,Jenny Seidenschwarz,Benjamin Graham,Jonathan Sadeghi,Jeffrey Hawke,Jack Parker-Holder,Ilija Bogunovic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern action-conditioned video world models achieve strong short-horizon visual realism, yet remain unreliable on rare, interaction-critical transitions that dominate downstream planning and policy performance. Because passive demonstration data systematically under-samples these high-impact regimes, improving robustness requires actively eliciting model failures rather than relying on their natural occurrence. We introduce a KL-constrained adversarial curriculum in which a policy is trained to expose high-error trajectories of a diffusion-based world model while remaining close to the behavior distribution. The world model is continuously fine-tuned on these adversarially discovered trajectories, yielding an adversarial training loop that converts rare failures into a stable, near-distribution training signal without drifting into out-of-distribution exploitation. To maintain pressure on unresolved weaknesses as the model improves, we propose a Prioritized Adversarial Trajectory (PAT) buffer that re-ranks trajectories based on prediction error, action fidelity, and learning progress, focusing training on unresolved failure modes rather than repeatedly revisiting solved cases. We implement our approach in the MineRL framework and evaluate it on held-out out-of-distribution trajectories; PROWL improves robustness over models trained on passive data alone, reveals reward-hacking behaviors under weak behavioral constraints, and demonstrates that effective adversarial world-model training critically depends on balancing exploratory failure discovery with explicit behavioral regularization. Our results suggest that scalable world models benefit not only from larger datasets, but also from selectively generating informative training data.

[AI-171] heory-optimal Quantization Based on Flatness

链接: https://arxiv.org/abs/2605.18800
作者: Xiusheng Huang,Zhe Li,Xuanwu Yin,Lu Wang,Yequan Wang,Dong Li,Emad Barsoum,Kang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures

点击查看摘要

Abstract:Post-training quantization has emerged as a widely adopted technique for compressing and accelerating the inference of Large Language Models (LLMs). The primary challenges in LLMs quantization stem from activation outliers, which significantly degrade model performance especially at lower bit precision. While recent approaches attempt to mitigate outliers through linear transformations across feature dimensions, our analysis reveals that the transformed weights and activations still exhibit persistent outlier patterns with concentrated magnitude distributions. In this paper, we first model the mathematical relationship between quantization error and outliers, and then introduce a new metric Flatness to quantify the distribution of outliers. Based on this, we derive the theoretical optimal solution with respect to Flatness. Building on these insights, we propose Bidirectional Diagonal Quantization (BDQ), a novel post-training quantization framework that effectively disperses outlier patterns through optimized matrix transformations. BDQ strategically distributes outlier magnitudes across matrix dimensions via learned diagonal operations. Extensive experiments demonstrate that BDQ establishes a new quantization benchmark. It achieves less than 1% accuracy drop in W4A4 quantization on the LLaMA-3-8B model. In the more challenging W2A4KV16 experiment, compared to state-of-the-art approaches, BDQ reduces the performance gap by 39.1% on the DeepSeek-R1-Distill-LLaMA-70B model.

[AI-172] Simply Stabilizing the Loop via Fully Looped Transformer

链接: https://arxiv.org/abs/2605.18797
作者: Rao Fu,Zixuan Yang,Jiankun Zhang,Jing Ma,Hechang Chen,Yu Li,Yi Chang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural mechanism for balancing performance and test-time compute. However, Looped Transformer still suffers from training instability when the number of loop iterations increases. Our analysis reveals that this instability stems from two sources: gradient oscillation and residual explosion. To address these two problems, we propose the Fully Looped Transformer, which introduces two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion; (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. These modifications stabilize training dynamics, enabling the Fully Looped Transformer to be trained stably up to 12 loop iterations, whereas other baseline looped models collapse in this regime. In milder settings where Looped Transformer does not collapse, Fully Looped Transformer still improves average downstream-task performance by up to 13.2%. Overall, our experiments demonstrate that Fully Looped Transformer improves training stability, enhances downstream performance, and provides preliminary adaptability under different test-time compute budgets by varying loop iterations at inference.

[AI-173] HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models

链接: https://arxiv.org/abs/2605.18795
作者: Jia Wei,Zhonghao Zhang,Ping Chen,Qianyang li,Yancheng Pan,Shaoxun Wang,Ziyi Qiu,Longxiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning of large language models, yet most variants target dense architectures. Mixture-of-Experts (MoE) models scale parameters at near-constant per-token compute, and their sparse activation patterns create untapped opportunities for more efficient adaptation. We propose Hot-Experts Layer-level Low-Rank Adaptation (HELLoRA), which attaches LoRA modules only to the most frequently activated experts at each layer. This simple mechanism reduces trainable parameters and adapter-induced FLOPs while improving downstream performance, an effect we attribute to a form of structured regularization that preserves pretrained expert specialization. To stress-test HELLoRA under extreme parameter budgets, we further compose it with LoRI to form HELLoRI, which freezes the up-projection and sparsifies the down-projection. Across three MoE backbones, namely OlMoE-1B-7B, Mixtral-8x7B, and DeepSeekMoE, and three task families covering mathematical reasoning, code generation, and safety alignment, HELLoRA consistently outperforms strong PEFT baselines. Relative to vanilla LoRA on OlMoE, HELLoRA uses 15.7% of the trainable parameters, reduces adapter FLOPs by 38.7%, achieves 1.9x the training throughput, and improves accuracy by 9.2%. On DeepSeekMoE, HELLoRA outperforms LoRA while using only 23.2% of its trainable parameters. These results demonstrate that activation-aware adapter placement is an effective and practical route to scaling PEFT for MoE language models.

[AI-174] Robust Basis Spline Decoupling for the Compression of Transformer Models

链接: https://arxiv.org/abs/2605.18794
作者: Joppe De Jonghe,Van Tien Pham,Mariya Ishteva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decoupling is a powerful modeling paradigm for representing multivariate functions as compositions of linear transformations and univariate nonlinear functions. A single-layer decoupling can be viewed as a fully connected neural network with a single hidden layer and flexible activation functions, providing a direct link with neural networks. Because of this, the use of decoupling methods has gained increasing attention in neural network domains, particularly compression, since it enables structured approximations with reduced parameter complexity. Existing tensor-based decoupling methods typically rely on polynomial or piecewise-linear parameterizations of the internal nonlinear functions, which can suffer from numerical instability or limited expressiveness. In this work, we introduce a B-spline-based decoupling framework that generalizes these existing approaches. By exploiting the local support and flexible smoothness control of B-splines, the proposed formulation yields a more numerically stable and expressive representation. We derive a constrained coupled matrix-tensor factorization and propose a robust alternating least-squares algorithm, called R-CMTF-BSD, incorporating normalization and Tikhonov regularization. The proposed method is validated through experiments on synthetic data and transformer model compression. Results on the Vision and Swin Transformer architectures demonstrate that B-spline decoupling enables substantial parameter reduction while maintaining competitive accuracy, making the R-CMTF-BSD algorithm a promising tool for structured neural network compression.

[AI-175] Dimensional Balance Improves Large Scale Spatiotemporal Prediction Performance

链接: https://arxiv.org/abs/2605.18793
作者: Jing Chen,Shixiang Pan,Yujie Fan,Haocheng Ye,Haitao Xu,Wenqiang Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate spatiotemporal pattern analysis is critical in fields such as urban traffic, meteorology, and public health monitoring. However, existing methods face performance bottlenecks, typically yielding only incremental gains and often exhibiting limited cross-domain transferability. We analyze this bottleneck through spatial and temporal entropy measures, which are used as diagnostic indicators of spatiotemporal complexity mismatch rather than as guarantees that entropy alignment alone yields better forecasting. Empirically, larger mismatch is often accompanied by higher prediction uncertainty, especially under a fixed model-capacity budget. Guided by this diagnostic, we propose a scalable, adaptive framework that harmonizes spatial and temporal feature representations. Spatial dimensionality is compressed via low-rank matrix embedding to preserve essential structure, while an extended temporal horizon captures long-range dependencies and mitigates cumulative errors arising from temporal heterogeneity. Extensive experiments on urban traffic, meteorological, and epidemic datasets demonstrate substantial accuracy gains and broad applicability across the evaluated domains, suggesting that the framework is promising for a wide range of spatiotemporal tasks beyond the current study. The code is available on GitHub at this https URL.

[AI-176] Can LLM s Emulate Human Belief Dynamics?

链接: https://arxiv.org/abs/2605.18781
作者: Adiba Mahbub Proma,Neeley Pate,James N. Druckman,Gourab Ghoshal,Hangfeng He,Ehsan Hoque
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Can LLMs simulate how humans form and change beliefs in social networks? We put this to the test by replicating an established study on belief dynamics, evaluating 12 LLMs across multiple model families and parameter sizes. The answer is a clear no, and in systematic ways. LLMs fail to capture initial human belief distributions and tend to be overall more conformist than humans, shifting their responses to align with those around them. They also take a nuanced approach to emulating human homophilic tendencies within networks. Our findings carry a double payoff: they highlight fundamental properties of LLM behavior, and they raise a sharp warning against deploying LLMs as human proxies in social simulations.

[AI-177] Decentralized autonomous organization and blockchain-based incentivization framework for community-based facilities management

链接: https://arxiv.org/abs/2605.18773
作者: Reachsak Ly,Alireza Shojaei,Xinghua Gao,Philip Agee,Abiola Akanmu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 29 pages, 17 figures, 3 tables

点击查看摘要

Abstract:Traditional facility management often relies on centralized decision-making structures that limit stakeholder participation, leading to misalignment with occupant needs and reduced satisfaction. This paper proposes a novel blockchain- and Decentralized Autonomous Organization (DAO)-based framework for community-based facilities management in smart buildings. The framework comprises two key components: a decentralized governance platform that facilitates transparent collective decision-making through blockchain-based voting, and a maintenance management platform with an incentivization mechanism that encourages building occupants to actively contribute to facility upkeep through tokenized rewards. System evaluation includes cost analysis, scalability, data security considerations, usability testing, and semi-structured interviews with facility managers and researchers to assess the platform’s usefulness, challenges, and adoption potential. The findings demonstrate the framework’s potential as a viable incentivization solution for engaging stakeholders in the collective upkeep and improvement of building infrastructure.

[AI-178] An Efficient Multilevel Preconditioned Nonlinear Conjugate Gradient Method for Incremental Potential Contact

链接: https://arxiv.org/abs/2604.19892
作者: Yu Zhang,Xing Shen,Kemeng Huang,Wei Chen,Yin Yang,Taku Komura,Tiantian Liu,Xingang Pan
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Incremental Potential Contact (IPC) guarantees intersection-free simulation but suffers from high computational costs due to the expensive Hessian assembly and linear solves required by Newton’s method. While Preconditioned Nonlinear Conjugate Gradient (PNCG) avoids Hessian assembly, it has historically struggled with poor convergence in stiff, contact-rich scenarios due to the lack of effective preconditioners; simple Jacobi preconditioners fail to capture the global coupling, while advanced hierarchy-based preconditioners like Multilevel Additive Schwarz (MAS) are computationally prohibitive to rebuild at every nonlinear iteration. We present MAS-PNCG, a method that unlocks the power of hierarchical preconditioning for nonlinear optimization. Our key technical innovation is a Sparse-Input Woodbury update algorithm that incrementally adapts the fine-level MAS components to rapidly evolving contact sets. This bypasses the need for full preconditioner rebuilds, reducing maintenance cost to near-zero while capturing the complex spectral properties of the contact system. Furthermore, we replace heuristic PNCG search directions with a Hessian-aware 2D subspace minimization that optimally combines the preconditioned gradient and previous direction. We also apply a fast per-subdomain conservative CCD method that ensures penetration-free trajectories while avoiding overly restrictive global step sizes. Experiments demonstrate that our MAS-PNCG outperforms state-of-the-art Newton-PCG solvers, GIPC and StiffGIPC, both preconditioned with MAS up to 5.66 \times and 2.07 \times respectively.

[AI-179] Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment

链接: https://arxiv.org/abs/2605.20127
作者: Ken Nakamura,Tomoya Nakai,Ryuto Yashiro,Ayumu Yamashita,Kaoru Amano
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 34 pages, 12 figures, 5 tables

点击查看摘要

Abstract:Artificial vision models are often evaluated against the human visual cortex by measuring how accurately their internal representations predict brain responses. However, prediction accuracy alone does not indicate which dimensions of the target brain’s response space are recovered. Here, we introduce a unified framework for evaluating both model-brain and brain-brain alignment by identifying the response dimensions recovered by prediction. Using repeated fMRI measurements, we first identify target-brain response dimensions that can be reproducibly predicted across independent trial splits. We then predict target-brain responses from either another subject’s brain responses or a vision model’s internal representations, and quantify how strongly each of these reproducible response dimensions is recovered. Applying this framework to a subset of the Natural Scenes Dataset, in which eight subjects viewed the same natural images during fMRI, we find that the early-to-intermediate visual-cortex responses contain a low-dimensional set of reproducible dimensions. Brain-to-brain comparisons identify which of these dimensions are consistently recoverable from other subjects’ brains, providing a diagnostic human reference rather than only a scalar benchmark. In some cases, pretrained and randomly initialized models achieve similar prediction accuracy while showing distinct recovery profiles across these response dimensions. These results show that prediction accuracy alone can mask model-brain mismatches. By making explicit which reproducible brain response dimensions are recovered by prediction, our framework provides a more diagnostic evaluation of alignment between artificial vision models and the human visual cortex.

[AI-180] Brain alignment of reasoning and action representations from vision-language and action models during naturalistic gameplay

链接: https://arxiv.org/abs/2605.19352
作者: Subba Reddy Oota,Anant Khandelwal,Khushbu Pahwa,Satya Sai Srinath Namburi,Tanmoy Chakraborty,Bapi S. Raju,Manish Gupta
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 11 figures

点击查看摘要

Abstract:Understanding how humans and artificial intelligence systems predict and plan by interacting with their environment is a fundamental challenge at the intersection of neuroscience and machine learning. Most brain-encoding studies focus on aligning artificial models with brain activity during language comprehension or passive visual processing, while interactive brain-alignment studies have to date been largely limited to reinforcement-learning (RL) agents and theory-based models. To address this gap, we study brain alignment of representative models from two foundation-model families, namely vision-language models (VLMs) and large-action models (LAMs), using fMRI recordings from participants playing naturalistic Atari-style video games. Specifically, we examine how action-focused and reasoning-focused prompts shape model’s internal representations and align with fMRI brain activity. First, we find that both VLMs and LAMs exhibit significantly exhibit voxel-wise encoding performance than RL baselines, with the advantage holding even under matched feature dimensionality. Second, prompt-driven gains scale with the cortical processing hierarchy: the largest improvements appear in frontal-parietal and motor-planning regions, while early visual cortex gains roughly half as much. Third, variance partitioning reveals a qualitatively different representational organization: VLM is prompt-symmetric (12.5% unique action vs. 13.6% unique reasoning), whereas LAM is prompt-asymmetric (27% unique action vs. -5% unique reasoning), with the asymmetry strongest in frontal-motor cortex. Together, these results demonstrate that action-specialized fine-tuning reorganizes multimodal representations toward action-relevant neural computations even when whole-brain prediction accuracy is statistically equivalent between VLM and LAM.

[AI-181] Markov Chain Decoders Overcome the Heavy-Tail Limitations of Lipschitz Generative Models

链接: https://arxiv.org/abs/2605.18931
作者: Abdelhakim Ziani(MICS),Andras Horvath(UNITO),Paolo Ballarini(MICS)
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Heavy-tailed distributions are prevalent in performance evaluation, network traffic, and risk modeling. This behavior poses a fundamental challenge for modern deep generative models. Standard Variational Autoencoders (VAEs) employ Gaussian decoder likelihoods and Lipschitz-constrained neural networks, a combination that is structurally incapable of producing heavy-tailed outputs: the Gaussian tail decays exponentially, and Lipschitz continuity prevents the decoder from amplifying rare events from the latent space input to sufficiently overcome this decay. We provide both a theoretical characterization of this limitation and a controlled empirical demonstration using synthetic Pareto data across a grid of tail indices \alpha \in 2, 3, 5, 30 and dimensions d \in 1, 5, 10. As a solution, we replace the Gaussian decoder with a Phase-Type (PH) distribution based on Markov chains, while keeping the encoder, latent space, and training procedure identical. PH distributions allow for arbitrarily precise approximations of any positive-valued distributions, including heavy-tailed families. Experiments showed that the PH-based model reduces tail Kolmogorov-Smirnov distance by up to x6 and extreme quantile error by up to x10 compared to the Gaussian baseline for heavy-tailed data. These results demonstrate that integrating Markov chain-based distributions into the decoder of a generative model institutes a principled and practically effective solution to the heavy-tail generation problem.

[AI-182] Cross-Subject Intracranial EEG Reconstruction from Scalp Recordings Using Multi-Scale Cross-Attention Transformers

链接: https://arxiv.org/abs/2605.18897
作者: Tien-Dat Pham,Xuan-The Tran
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Intracranial EEG (iEEG) provides high-fidelity neural recordings essential for clinical and brain-computer interface applications, but acquiring these signals requires invasive surgery. While recent studies have attempted to estimate iEEG from non-invasive scalp EEG, most rely on patient-specific models, creating a circular dependency: if surgery is required to collect training data, the non-invasive model offers limited practical benefit. In this study, we address the challenge of cross-subject iEEG reconstruction by predicting intracranial signals for unseen patients using models trained on other individuals. We propose CAST (Cross-Attention Spatial-Temporal Transformer), a machine learning framework that translates scalp EEG into multi-channel iEEG waveforms through a two-stage transfer learning strategy. First, a temporal encoder extracts multi-scale neural representations at three different resolutions. Then, because electrode placements vary substantially across patients, a channel-aware decoder is calibrated using only a few minutes of data from the target subject. We evaluated the proposed method using leave-one-subject-out cross-validation on two public datasets comprising 1,282 iEEG channels. Experimental results demonstrate that CAST reconstructs cortical signals located near the scalp surface substantially better than deep subcortical activity. In highly observable sensorimotor regions, the model achieved peak correlations of up to r=0.864 in the precentral gyrus. Furthermore, with a channel selection strategy, CAST obtained a mean correlation of r=0.545 on viable subjects, outperforming previous within-subject baselines. These findings indicate that cortical iEEG signals can be reconstructed for unseen subjects from scalp EEG without extensive patient-specific training, and that only a brief calibration phase is sufficient to adapt the model to new hardware configurations.

[AI-183] A Nonlinear Complexity Index for Wearable PPG Cardiovascular Stability: Multiscale Validation Systematic Evaluation Correction and Bayesian Parameter Optimization

链接: https://arxiv.org/abs/2605.18802
作者: Timothy Oladunni,Farouk Ganiyu Adewumi
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cardiovascular stability estimation from wearable photoplethysmography (PPG) requires a principled nonlinear framework, yet major gaps persist in heuristic parameter selection and evaluation protocols that inflate reported performance. We introduce a Stability-Constrained Cardiovascular Stability Index (SCSI) grounded in Cardiac Stability Theory and validate it across 176,742 segments from four heterogeneous PPG datasets at three temporal scales. Cross-dataset analysis demonstrates a large Kruskal-Wallis effect size (eta2 = 0.351, p 0.001), strong cross-scale consistency (kappa 0.97), and significant correlation with respiratory rate across 53 ICU records (Spearman r = 0.346, p = 0.011). We identify three evaluation artifacts that inflate heuristic AUC from a true baseline of 0.573 to 0.752: segment-level cross-validation leakage, test-set normalization leakage, and pooled-AUC overweighting that conceals per-patient failure. Correcting these artifacts and applying Bayesian optimization over 15 joint parameters yields SCSI with cross-validation AUC of 0.720. On 18 held-out records, SCSI achieves pooled AUC of 0.757 (95% CI: 0.686-0.828) and negative predictive value of 0.966 for tachypnea screening, while per-record AUC of 0.497 +/- 0.207 is disclosed for transparency. External validation on 42 elective-surgery records yields AUC of 0.621, confirming cross-population generalization. Ablation analysis identifies the nonlinear complexity module as the dominant component. A sparse three-component architecture is proposed as the minimal deployable configuration. The corrected protocol provides a reproducible benchmark for future wearable cardiovascular stability indices.

[AI-184] Features have life history. And we should care

链接: https://arxiv.org/abs/2605.18789
作者: Philipp Stecher,Sandro Radovanović,Vlasta Sikimić,Reinhard Kahle
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:Features in language models have life history: they emerge, persist, and die during training, yet the importance of that history remains largely unexplored. We find evidence of a persistent representational backbone, which we identify in Pythia-160M and -410M as the carrier scaffold: \sim50 sparse features with stable life histories, around which the model’s representational structure organises. It has four properties. \emph(i)~\emphIt assembles early: features emerge, die, and reorganise \sim40!\times faster in the first 1% of training than afterwards, and the scaffold is already largely fixed by then. \emph(ii)~\emphIt is load-bearing: joint cross-layer ablation identifies the carriers as far more load-bearing than any count-matched non-scaffold population, a gap invisible to per-firing single-feature methods. \emph(iii)~\emphFunction precedes direction: which features will become carriers is already predictable from training-onset firing patterns alone, correctly distinguishing future carriers from non-carriers in 4 of 5 cases, before the geometry has settled. \emph(iv)~\emphIt seeds subsequent development: by the end of training, scaffold carriers have recruited 64% of all active features into the scaffold hierarchy. Life history is consistent with a two-phase account of training: selection appears to largely determine the scaffold in the first 1% ; the remaining 99% appears to calibrate geometry around a substrate already set.

[AI-185] he Insurability Frontier of AI Risk: Mapping Threats to Affirmative Coverag e Silent Exposures and Exclusions

链接: https://arxiv.org/abs/2605.18784
作者: Alex Leung,Rex Zhang,Ervin Ling,Kentaroh Toyoda,SiewMei Loh
机构: 未知
类目: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:The rapid diffusion of agentic AI has created a new coverage problem for commercial insurance: some AI-mediated losses are now affirmatively insured, some create silent-AI exposure under legacy cyber, technology errors-and-omissions (EO), directors-and-officers (DO), employment practices liability (EPLI), crime, and media policies, and others are being actively excluded. This paper maps that emerging boundary by coding 55 AI threat classes against 26 insurance products, endorsements, and exclusion regimes using public carrier materials and OWASP/MITRE threat catalogs. We identify a four-tier insurability frontier: affirmatively insured perils, silent-AI exposures, actively excluded perils, and perils outside conventional private insurance structures. Our coding measures publicly claimed positioning rather than executed contract wording; the headline statistics describe what carriers publicly state about coverage, not what would be paid in any specific claim. Three patterns emerge. First, affirmative AI coverage is beginning to differentiate by primary risk emphasis: public materials often position Munich Re around model performance and drift, Armilla and parts of the Lloyd’s market around hallucination and broader AI liability, Tokio Marine Kiln and CFC around IP and technology EO concerns, Apollo ibott around emerging autonomous system liability, and Coalition around deepfake and AI-enabled cyber response. Second, legacy lines retain silent-AI exposure where AI is an instrumentality rather than the legal cause of loss. Third, foundation model concentration is the clearest genuinely novel insurability frontier because upstream model failure can correlate losses across many cedents at once; the relevant market design question is which insurability constraint each candidate structure relaxes, not merely which systemic risk template exists. Subjects: Risk Management (q-fin.RM); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); General Economics (econ.GN) Cite as: arXiv:2605.18784 [q-fin.RM] (or arXiv:2605.18784v1 [q-fin.RM] for this version) https://doi.org/10.48550/arXiv.2605.18784 Focus to learn more arXiv-issued DOI via DataCite

机器学习

[LG-0] When Does Model Collapse Occur in Structured Interactive Learning?

链接: https://arxiv.org/abs/2605.20151
作者: Yuchen Wu,Kangjie Zhou,Weijie Su
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 57 pages, 12 figures

点击查看摘要

Abstract:The proliferation of generative artificial intelligence has given rise to an interactive learning environment, where model parameters are continuously updated using not only data generated by natural processes, but also synthetic outputs produced by other models. This paradigm introduces two major challenges: (1) training data are no longer drawn exclusively from the target population, undermining a core assumption of classical statistical learning, and (2) model training processes become inherently correlated, as models interact with one another through repeated exposure to each other’s synthetic outputs in a potentially complex manner. Establishing reliable statistical inference in such structured interactive learning environments therefore remains an important open problem. In particular, there is growing concern about model collapse, a phenomenon in which the performance of generative models progressively degrades as they are trained on synthetic data produced by earlier model generations. Prior work on model collapse primarily focuses on a single model trained on its own output, failing to capture model performance in multi-model interactive settings. In this work, we fill this gap by investigating the performance of generative models in an interactive learning environment with general interaction patterns. In particular, we formalize model interactions using directed graphs and show that the occurrence of model collapse depends critically on the topology of the interaction graph. We further derive an explicit necessary and sufficient condition characterizing when model collapse occurs, and establish finite-sample results for linear regression and asymptotic guarantees for general M-estimators. We support our theoretical findings through extensive numerical experiments. Comments: 57 pages, 12 figures Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2605.20151 [cs.LG] (or arXiv:2605.20151v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.20151 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] rajTok: Adaptive Spatial Tokenization for Trajectory Representation Learning

链接: https://arxiv.org/abs/2605.20134
作者: Zhen Xiong,Shang-Ling Hsu,Cyrus Shahabi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning generalizable trajectory representations from raw GPS traces remains difficult because the data is continuous, noisy, and irregularly sampled. Spatial tokenization is also challenging: fine grids yield sparse cells with weak embeddings, while coarse grids merge heterogeneous movement patterns into the same token. We present TrajTok, a trajectory encoder with a simple pretraining recipe for transferable trajectory embeddings. TrajTok first learns a multi-resolution hexagonal cell partition from the spatial distribution of GPS points, converting noisy GPS sequences into discrete cell tokens. To capture both geometry and kinematics, it uses a factorized transformer encoder with early per-modality self-attention blocks, cross-attention fusion layers, and spatiotemporal rotary position embeddings, ST-RoPE, to encode where and when each token occurs. TrajTok is pretrained with masked-token modeling that recovers both geometric structure and kinematic patterns from partial trajectory observations. On the Porto dataset, a frozen TrajTok encoder with lightweight task adapters achieves strong performance across trajectory similarity search, classification, estimated time of arrival, and full travel-time regression, outperforming multiple task-specific methods. The same frozen encoder supports both geometry-dominated and kinematics-dominated tasks, suggesting that TrajTok learns transferable trajectory structure rather than task-specific shortcuts. These results indicate that learned multi-resolution spatial tokenization combined with masked-token pretraining is a promising direction for general-purpose trajectory foundation models.

[LG-2] Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

链接: https://arxiv.org/abs/2605.20105
作者: Valentina Njaradi,Clémentine Dominé,Rachel Swanson,Marco Mondelli,Andrew Saxe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error showcasing their dependence on representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly influence downstream generalisation, and we characterize the optimal representation size as a function of task parameters: with abundant pretraining data but scarce downstream data, maximally compressed representations are optimal, whereas with limited pretraining data, higher-dimensional representations generalise better. Furthermore, we establish an exact trade-off between pretraining and supervision, quantifying how much unlabelled data is required to replace a single labelled sample. Beyond our idealised model, we observe similar phenomenology in autoencoders and pretrained LLMs. Altogether, we highlight that optimising representation size is critical, giving conditions for when compression during pretraining improves generalisation.

[LG-3] owards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization

链接: https://arxiv.org/abs/2605.20074
作者: Thien Le,Melanie Weber
类目: Machine Learning (cs.LG)
*备注: 22 pages

点击查看摘要

Abstract:Distillation transfers knowledge from a large model trained on broad data to a smaller, more efficient model suitable for deployment. In structured prediction settings, prior knowledge about the task can guide the choice of a target architecture that is algorithmically aligned with the underlying problem. Building on recent learning-theoretic analyses of decision-tree (DT) distillation (Boix-Adsera, 2024), we study when distillation succeeds for combinatorial optimization tasks. We focus on the case where the target model is a graph neural network whose architecture is aligned with a dynamic programming (DP) algorithm for the task. Assuming that the source model is sufficiently rich, formalized through the linear representation hypothesis (LRH) (Elhage et al., 2022; Park et al., 2024), we show that the distillation problem can be solved efficiently in the complexity parameters of the DP transition function, represented as a DT. Our results provide a rigorous sufficient condition for successful distillation in the flavour of algorithmic alignment.

[LG-4] Smooth Partial Lotteries for Stable Randomized Selection

链接: https://arxiv.org/abs/2605.20069
作者: Alexander Goldberg,Giulia Fanti,Nihar B. Shah
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Competitive selection processes, from scientific funding to admissions and hiring, use evaluations to score candidates, and eventually choose a subset of them based on those scores. Recently, many organizations have adopted partial lotteries, which randomize selection based on evaluation scores. However, existing lottery designs are inherently unstable, as a small change to a single candidate’s score can cause large shifts in their selection probabilities. This instability undermines a key goal of lotteries: reducing the influence of fine-grained score distinctions near the decision boundary. We propose smoothness as a design principle for partial lotteries, formalizing it as a Lipschitz condition on the mapping from review scores over candidates to selection probabilities. We introduce the Clipped Linear Lottery, a simple mechanism in which selection probabilities scale linearly with estimated quality between an upper threshold, above which we always accept, and a lower threshold, below which we always reject. We prove that the Clipped Linear Lottery’s worst-case regret matches a lower bound for any smooth selection rule up to a factor of (1 - k/n) , where k/n is the acceptance rate. We compare smooth selection to other stability notions like Individual Fairness and Differential Privacy, showing that the Clipped Linear Lottery achieves a better smoothness-regret tradeoff than alternatives. Experiments on real peer review data from ICLR 2025, NeurIPS 2024, and the Swiss National Science Foundation demonstrate that existing lottery designs are highly unstable in practice even under perturbations to a single score. Our experiments also confirm the tightness of our theoretical analysis and show that our proposed Clipped Linear Lottery achieves a better smoothness-utility tradeoff than alternatives in practice.

[LG-5] Active Context Selection Improves Simple Regret in Contextual Bandits

链接: https://arxiv.org/abs/2605.20040
作者: Mohammad Shahverdikondori,Jalal Etesami,Negar Kiyavash
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the contextual multi-armed bandit problem with a finite context space (a.k.a. subpopulations), where the learner recommends a best action for each context and is evaluated by context-weighted simple regret. Our guarantees are worst-case over the reward distributions, while remaining instance-dependent with respect to the context distribution vector p . Akin to experimental design problems where the population of interest is fixed but the sampled subpopulation can be controlled, we allow the learner to actively choose which context to sample from. For a known p , we characterize tight regret rates: passive sampling where contexts are randomly revealed achieves regret of order \sqrtn/T , \lVert p \rVert_1/2 , whereas active sampling with allocation q_j \propto p_j^2/3 achieves the tight rate \sqrtn/T , \lVert p \rVert_2/3 . The resulting improvement can be as large as \Theta(k^1/4) , where k is the number of contexts. We further extend the analysis to budgeted active sampling, characterize the corresponding tight rate, and identify when a limited active budget suffices to recover the fully active rate. When p is unknown, we propose the Explore-Explore-Then-Commit (EETC) algorithm, which optimally balances estimating the context distribution and the time to switch to active allocation, such that for large horizons, it matches the known- p active rate up to constants. Experiments on synthetic and real-world data support our theoretical findings.

[LG-6] D3-Subsidy: Online and Sequential Driver Subsidy Decision-Making for Large-Scale Ride-Hailing Market

链接: https://arxiv.org/abs/2605.20036
作者: Taijie Chen,Rui Su,Siyuan Feng,Laoming Zhang,Hongyang Zhang,Haijiao Wang,Zhaofeng Ma,Jintao Ke
类目: Machine Learning (cs.LG)
*备注: 14 pages, 14 figures

点击查看摘要

Abstract:Ride-hailing platforms like DiDi Chuxing operate in highly dynamic environments where balancing driver supply and passenger demand is critical. Although driver-side subsidies serve as a primary lever to align these forces and improve key KPIs like completed rides (\textttRides) and gross merchandise value (\textttGMV), optimizing them in production requires simultaneously meeting three constraints: (i) responsiveness to stochastic shocks, (ii) strict subsidy-rate caps, and (iii) low-latency execution at city scale. These requirements rule out expensive per-order optimization, calling for a forward-looking, constraint-aware city-level controller for online sequential decision making. To meet these requirements, we introduce D ^3 -Subsidy (Dynamic Driver-side Diffusion-based Subsidy), a hierarchical diffusion-based framework for deployable city-wide subsidy control. To bridge the train-inference gap, D ^3 -Subsidy employs a prefix-conditioned diffusion model that samples plausible future trajectories from immutable historical observations, ensuring the training protocol aligns with the fixed-history nature of online deployment. These generated plans are then decoded by a context-conditioned inverse module into low-dimensional city-level control signals. For scalable execution, we bridge the gap between city-level planning and fine-grained dispatch via a Lagrangian-dual-derived mapping, which embeds subsidy-rate caps directly into order-driver incentives without iterative optimization. Additionally, a multi-city pretraining strategy with parameter-efficient fine-tuning enables robust transfer across heterogeneous cities. Extensive offline evaluations demonstrate that D ^3 -Subsidy improves \textttRides and \textttGMV while enhancing cap compliance, and a real-world A/B test confirms significant uplift while keeping budget-related violation metrics within operational thresholds.

[LG-7] CAMERA: Adapting to Semantic Camouflage in Unsupervised Text-Attributed Graph Fraud Detection IJCAI2026

链接: https://arxiv.org/abs/2605.20032
作者: Junjun Pan,Yixin Liu,Yu Zheng,Lianhua Chi,Alan Wee-Chung Liew,Shirui Pan
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Accepted by IJCAI 2026

点击查看摘要

Abstract:Text-attributed graph fraud detection (TAGFD) plays a critical role in preventing fraudulent activities on online social and e-commerce platforms. However, to evade detection, fraudsters continuously evolve their camouflaging strategies by deliberately mimicking textual responses of benign users, thereby concealing their malicious purposes. This phenomenon, referred to as semantic camouflage, fundamentally undermines commonly relied assumptions on how structural and attribute cues can be exploited to identify fraudsters, and makes it difficult to spot fraudsters with unsupervised TAGFD. To bridge the gaps, we propose a Case-Adaptive Multi-cue Expert fRAmework (CAMERA) for unsupervised TAGFD. CAMERA employs an ego-decoupled mixture-of-experts architecture, where each expert specializes in modeling a distinct type of fraud-indicative cue. A context-informed gating model is introduced to jointly consider the ego node representation and its local neighborhood context for adaptive integration of cues learned by different experts. Furthermore, CAMERA leverages the inherent rarity of fraudsters to support unsupervised one-class learning with expert-level objectives that encourage modeling dominant benign patterns, thereby enabling reliable unsupervised detection of camouflaged fraudsters. Experiments on 4 challenging datasets show that CAMERA consistently outperforms competitors, showing its effectiveness against semantically camouflaged fraudsters. Code available at this https URL

[LG-8] ake It or Leave It: Intent-Controlled Partial Optimal Transport

链接: https://arxiv.org/abs/2605.20030
作者: Salil Parth Tripathi,Bertrand Chapron,Fabrice Collard,Nicolas Courty,Ronan Fablet
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While optimal transport (OT) enforces a rigid constraint by requiring two measures to be matched exactly, partial optimal transport relaxes this requirement by allowing mass to remain unmatched through a global budget, scalar rebate, or uniform rejection rule. However, many applications call for more structured, pointwise rejection mechanisms, where the decision to leave mass unmatched depends on side-specific reliability, support geometry, or external information about which components should participate in the comparison. We introduce \emphintent-controlled partial optimal transport (IC-POT), a targeted generalization of partial transport that replaces the global rejection paradigm with pointwise rejection costs over both measures. We show that the resulting optimization problem admits a dual interpretation in terms of local acceptance thresholds and can be solved by recasting it as a balanced Kantorovich OT problem on an augmented support. Beyond theoretical analysis, we demonstrate the practical relevance of IC-POT in settings where rejection is driven by side information. In positive-unlabeled learning and open-partial domain adaptation, incorporating pointwise rejection rules that encode statistical structure improves fixed baseline pipelines. Finally, we motivate the use of IC-POT with a geophysical practical case: multi-modal satellite ocean measurements, for which physical and sensors priors naturally inform the rejection mechanism and define the retrieved comparable signal information.

[LG-9] raining-Free Bayesian Filtering with Generative Emulators

链接: https://arxiv.org/abs/2605.20028
作者: Thomas Savary,François Rozet,Gilles Louppe
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Accepted as a spotlight paper at the International Conference on Machine Learning 2026

点击查看摘要

Abstract:Bayesian filtering is a well-known problem that aims to estimate plausible states of a dynamical system from observations. Among existing approaches to solve this problem, particle filters are theoretically exact for non-linear dynamics and observations, but suffer from poor scalability in high dimensions. In this work, we show that diffusion-based emulators of dynamical systems can be used to implement, without additional training, an optimal variant of particle filters that has remained largely unexplored due to implementation challenges with classical numerical solvers. Experiments on nonlinear chaotic systems, including atmospheric dynamics, demonstrate that the proposed approach successfully scales particle filtering to high-dimensional settings.

[LG-10] Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates

链接: https://arxiv.org/abs/2605.20005
作者: Parjanya Prajakta Prashant,Jiongli Zhu,Aldan Creo,Babak Salimi
类目: Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:Fine-tuning large language models on new data improves task performance but degrades capabilities learned during pretraining, a phenomenon known as catastrophic forgetting. Existing methods mitigate this by modifying the fine-tuning objective to suppress high-loss tokens or sequences, but these tokens are essential for learning new tasks, especially those with poor pretraining coverage. In such settings, hard tokens should still contribute to learning, so forgetting must be controlled without suppressing them. We identify a simple mechanism for doing so: per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss. This suggests that high-loss batches are especially prone to inducing forgetting. Motivated by this observation, we introduce FINCH, a loss-adaptive learning-rate schedule that reduces the learning rate on high-loss batches and increases it as the model converges, while leaving the fine-tuning objective unchanged. Across knowledge acquisition, science, and low-resource language adaptation benchmarks, FINCH reduces forgetting by 93% on average while matching the task performance of standard fine-tuning. On Qwen3-4B knowledge acquisition, FINCH cuts TruthfulQA degradation by 5x and reverses HaluEval degradation, while better preserving confidence calibration. Overall, our results show that learning-rate schedules are an effective tool to shape model behavior during fine-tuning, beyond just target-task optimization.

[LG-11] Your Neighbors Know: Leverag ing Local Neighborhoods for Backdoor Detection in Decentralized Learning

链接: https://arxiv.org/abs/2605.19969
作者: Sayan Biswas,Antoine Boutet,Davide Frey,Romaric Gaudel,Rachid Guerraoui,Maxime Jacovella,Anne-Marie Kermarrec,Dimitri Lerévérend,François Taïani,Martijn de Vos
类目: Machine Learning (cs.LG)
*备注: 41 pages, 10 figures

点击查看摘要

Abstract:Decentralized learning (DL) is an emerging machine learning paradigm where nodes collaboratively train models without a central server. However, the collaborative nature of DL makes it vulnerable to backdoor attacks, where a model is taught to behave normally on standard inputs while executing hidden, malicious actions when encountering data with specific triggers. Backdoor attacks in DL remain understudied and existing defenses often overlook DL constraints. We introduce Argus, a novel backdoor detection framework native to DL that requires neither a central coordinator nor prior knowledge of the trigger. In Argus, honest nodes locally analyze received model updates to identify potential backdoor triggers. Nodes then collectively share their triggers with their neighbors and use a structural similarity metric to separate true backdoors from false alarms induced by data heterogeneity. A key insight is that false positive triggers exhibit inconsistencies across participants while true positive ones show consistent patterns. Model updates that fail this collaborative test are rejected, and persistently malicious senders are eventually evicted. We provide the first theoretical convergence guarantees for a DL-specific backdoor detection mechanism, showing that filtering out suspicious model updates with high probability preserves a convergence rate comparable to standard DL. We implement and evaluate Argus on three standard datasets and against three state-of-the-art baselines. Across settings, Argus reduces attack success rates by up to 90 points compared to no defense, while preserving model utility within 5 percentage points of an omniscient oracle. Furthermore, the effectiveness of Argus compared to baselines improves as data heterogeneity increases.

[LG-12] Normative Networks for Source Separation via Local Plasticity and Dendritic Computation

链接: https://arxiv.org/abs/2605.19965
作者: Bariscan Bozkurt,Efe Ali Gorguner,Francesco Innocenti,Rafal Bogacz
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Blind source separation (BSS) is a natural framework for studying how latent causes may be recovered from sensory mixtures, but deriving online and biologically plausible algorithms for structured (i.e., constrained to known domains) and potentially correlated sources remains challenging. Recent work has derived neural networks for BSS from maximization of an entropy measure, yet its online implementations involve complex and nonlocal recurrent dynamics. Motivated by this perspective, we propose Predictive Entropy Maximization, which achieves competitive performance in BSS, using only local weight updates. The method employs a close approximation of an entropy measure, yielding an objective function with easily interpretable components. Minimizing this objective leads to a predictive neural architecture in which feedforward synapses follow an error-driven rule (that can be realized through dendritic mechanisms), lateral inhibitory connections are learned with local Hebbian plasticity, and source-domain constraints are enforced through simple output nonlinearities. We derive explicit spectral bounds on the surrogate error, characterizing when the approximation is accurate. Empirically, Predictive Entropy Maximization remains robust under increasing source correlation and observation noise, outperforms biologically plausible algorithms that rely on stronger independence or decorrelation assumptions, and remains competitive with exact determinant- and correlative-information-based baselines. These results show how local plasticity and adaptive lateral inhibition can emerge from maximizing a regularized second-order entropy over structured source domains. Our implementation code is available at this https URL.

[LG-13] Learning Orthonormal Bases for Function Spaces

链接: https://arxiv.org/abs/2605.19959
作者: Hamidreza Kamkari,Mohammad Sina Nabizadeh,Justin Solomon
类目: Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:Infinite-dimensional orthonormal basis expansions play a central role in representing and computing with function spaces due to their favorable linear algebraic properties. However, common bases such as Fourier or wavelets are fixed and do not adapt to the structure of a given problem or dataset. In this paper, we aim to represent these bases with neural networks and optimize them. Our key idea is that any target infinite-dimensional orthonormal basis can be viewed either as a point on the Lie manifold of the orthogonal group, or equivalently, as the endpoint of a continuous path on that manifold that connects a reference basis, e.g. Fourier, to that target. Paths on the Lie manifold satisfy ordinary differential equations (ODEs) governed by skew-adjoint integral operators. Using neural networks to define finite-rank generators of such ODEs allows us to parameterize and optimize orthonormal bases in function space. While relying on finite-rank generators to model infinite operators might seem restrictive, we prove a universality result: even with a rank-2 generator, the integrated solutions of the ODE are dense in the orthogonal group under the appropriate operator topology. In other words, for any target orthonormal basis, there exists a path originating from a reference basis and driven by finite-rank generators that gets arbitrarily close to that target basis. We demonstrate the flexibility of our framework by transforming the Fourier basis into the principal components of a functional dataset, eigenfunctions of linear operators, or dynamic modes of energy-preserving physical simulations.

[LG-14] Exploiting Non-Negativity in DAG Structure Learning

链接: https://arxiv.org/abs/2605.19947
作者: Samuel Rey,Madeline navarro,Gonzalo Mateos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work addresses the problem of learning directed acyclic graphs (DAGs) from nodal observations generated by a linear structural equation model. DAG learning is a central task in signal processing, machine learning, and causal inference, but it remains challenging because acyclicity is a global combinatorial property. Continuous acyclicity constraints have led to important algorithmic advances by replacing the discrete DAG constraint with smooth equality constraints. However, existing formulations still involve difficult non-convex optimization landscapes and may suffer from degenerate first-order optimality conditions. Here, we restrict attention to DAGs with non-negative edge weights and exploit this additional structure to obtain a simpler characterization of acyclicity. Building on this characterization, we formulate a regularized non-negative DAG learning problem and develop an algorithm based on the method of multipliers. We further analyze the benign optimization landscape induced by non-negativity. In the population regime, we show that the true DAG is the unique global minimizer of the proposed augmented-Lagrangian formulation; moreover, the landscape contains no spurious interior stationary points, and the true DAG is the only acyclic KKT point. Numerical experiments on synthetic and real-world data show that the proposed method improves over state-of-the-art continuous DAG-learning alternatives.

[LG-15] JAXenstein: Accelerated Benchmarking for First-Person Environments

链接: https://arxiv.org/abs/2605.19926
作者: Ruo Yu Tao,George Konidaris
类目: Machine Learning (cs.LG)
*备注: Main paper: 5 pages, supplementary material: 3 pages

点击查看摘要

Abstract:The progression of reinforcement learning algorithms have been driven by challenging benchmarks. The rate in which a researcher can iterate on a problem setting directly impacts the speed of algorithm development. Modern machine learning has produced tools that allow for fast and scalable algorithm development like the JAX library. With the availability of these tools, a serious bottleneck in algorithm development is the availability of large and complex domains for experimentation. Most notably, the JAX reinforcement learning ecosystem does not have any benchmarks that test visual first-person tasks; these domains are crucial for testing both exploration and an agent’s ability to overcome partial observability. We introduce JAXenstein: an open-source JAX-based benchmark that implements the Wolfenstein 3D rendering engine for fast and scalable experimentation in visual first-person tasks. JAXenstein is several times faster than comparable vision-based benchmarks, and is easily extensible to more complex first-person domains.

[LG-16] Hierarchical Contrastive Learning for Multi-Domain Protein-Ligand Binding

链接: https://arxiv.org/abs/2605.19902
作者: Shuo Zhang,Rongqi Hong,Huifeng Zhang,Jian K. Liu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Accepted by ISBRA2026

点击查看摘要

Abstract:Predicting protein-ligand binding affinity remains intractable for multi-domain proteins, where inter-domain dynamics govern molecular recognition. Existing geometric deep learning methods typically treat proteins as monolithic static graphs, suffering from rigid-body assumptions and aleatoric noise in flexible regions. To address this, we introduced HCLBind, a self-supervised framework that decouples geometric representation learning from affinity regression. HCLBind leverages a general-to-specific pre-training paradigm on the Q-BioLiP database to learn a robust physical grammar of binding. We propose a novel hierarchical decoy strategy: the model learns local physicochemical constraints through protein coordinate perturbation in single-domain proteins and global conformational geometry through inter-domain rotation in multi-domain complexes. Our hybrid architecture integrates a domain-gated graph attention network and cross-modal attention to explicitly prioritize domain interfaces. Furthermore, we employ LoRA on protein and ligand foundation models, ensuring efficient optimization while preserving evolutionary knowledge. Experiments on PDBBind demonstrate that HCLBind effectively learns discriminative interface features and provides robust uncertainty estimation, overcoming the limitations of standard supervised learning. The code is available at this https URL.

[LG-17] Fast Tensorization of Neural Networks via Slice-wise Feature Distillation

链接: https://arxiv.org/abs/2605.19842
作者: Safa Hamreras,Sukhbinder Singh,Román Orús
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a scalable tensorization framework for neural network compression based on slice-wise feature distillation. Unlike conventional tensor decomposition methods that rely on costly global finetuning, our approach decomposes the network into slices consisting of either individual layers or blocks (e.g., convolutional layers or MLPs), or small groups of consecutive layers, and tensorizes each slice independently to reproduce the intermediate representations of the original pretrained model. This modular strategy improves accuracy recovery, reduces data requirements, and enables efficient parallel optimization. Experiments on ResNet-34 show significant gains over conventional global tensorization, achieving near-lossless compression at moderate compression rates with faster optimization. Results on GPT-2 XL further demonstrate the scalability of the method and its applicability to large-scale models, particularly in distributed settings.

[LG-18] Set-Valued Policy Learning

链接: https://arxiv.org/abs/2605.19830
作者: Laura Fuentes-Vicente,Mathieu Even,Gaëlle Dormion,Antoine Chambaz,Uri Shalit,Julie Josse
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Conventional treatment policies map patient covariates to a single recommended intervention in order to maximize expected clinical outcomes. Although a rich body of causal inference methods has been developed to estimate such policies, point-valued recommendations can be highly sensitive to estimation uncertainty, model specification, and finite-sample variability, while typically providing little guidance about how confident one should be in the recommended action. In this work, we propose a set-valued policy learning paradigm for the multiple-treatment setting, in which policies output a set of plausible treatments rather than a single recommendation. This formulation enables intrinsic uncertainty quantification, with the size of the predicted set reflecting the degree of decision ambiguity. We extend the learning-to-defer framework to multiple treatments via a novel \textitgreatest Lower Bound method, and introduce \textitconformal policy learning, which bridges the gap between unobserved ground-truth optimal treatments and estimated optimal treatment rules. Drawing on insights from the noisy-label literature, we develop a randomness-injection approach that guarantees marginal coverage without requiring assumptions on underlying black-box optimal treatment rules. Through experiments on synthetic data and a real-world application to In-Vitro Fertilization (IVF), we demonstrate that our methods produce robust and actionable policies that naturally incorporate clinical considerations while effectively balancing performance and reliability.

[LG-19] General Lower Bounds for Differentially Private Federated Learning with Arbitrary Public-Transcript Interactions

链接: https://arxiv.org/abs/2605.19813
作者: Yicheng Li
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We prove a general lower bound for differentially private federated learning protocols with arbitrary public-transcript interactions. The protocol may use any number of adaptive rounds, and each client’s local samples may be reused across rounds. For parameter estimation under squared (\ell_2) loss, we establish a federated van Trees lower bound for every estimator satisfying a total clientwise sample-level zero-concentrated differential privacy (zCDP) constraint. The main technical ingredient is a privacy-information contraction inequality for complete public transcripts. We illustrate the bound through applications to mean estimation, linear regression, and nonparametric regression.

[LG-20] LionMuon: Alternating Spectral and Sign Descent for Efficient Training

链接: https://arxiv.org/abs/2605.19811
作者: Arman Bolatov,Artem Riabinin,Nikita Kornilov,Andrey Veprikov,Samuel Horváth,Martin Takáč,Aleksandr Beznosikov
类目: Machine Learning (cs.LG)
*备注: 38 pages, 13 figures, 4 tables

点击查看摘要

Abstract:In large-scale optimization, the cheapness and effectiveness of update steps are the most crucial factors for a successful optimizer. Sign-based optimizers like Lion or Signum produce cheap per-step updates, whereas Muon’s spectral matrix-sign update gives a much stronger direction at a substantially higher per-step cost. In this work, we propose LionMuon, which retains the effectiveness of Muon steps while considerably cutting the averaged iteration cost, similar to sign-based methods. It alternates between Lion’s and Muon’s updates on a fixed period P, sharing a single dual-EMA momentum buffer between them. The optimizer state memory therefore matches Lion and is exactly half of AdamW’s. A simpler single-EMA variant, SignMuon, by itself already outperforms pure Muon. At P = 2, LionMuon Pareto-dominates Muon, Lion, Signum, and AdamW on every dataset and architecture we tested at 124M model size, reaching lower validation loss at lower compute, and the same advantage persists at 355M and 720M scale. On the theory side, we prove sharp complexity bounds under heavy-tailed noise which are governed by period-averaged smoothness and noise that interpolate between Muon’s and Lion’s constants. These bounds predict the compute-optimal period and the conditions under which LionMuon outruns Muon and Lion. Code: this https URL

[LG-21] B-cos GNNs: Faithful Explanations through Dynamic Linearity

链接: https://arxiv.org/abs/2605.19778
作者: Joschka Groß,Mohammad Shaique Solanki,Verena Wolf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce B-cos GNNs, an inherently explainable class of graph neural networks whose predictions decompose exactly into per-node, per-feature contributions via a single input-dependent linear map. B-cos GNNs use linear (sum-based) aggregation and replace non-linear message and update functions with B-cos transforms. This induces meaningful, task-specific weight-input alignment that is directly accessible through the model’s dynamic linearity. Instance-level explanations follow from a single forward and backward pass, requiring no auxiliary explainer, modified learning objective, or perturbation procedure. Instantiated as a GIN, our approach trades small losses in predictive accuracy for state-of-the-art explainability across diverse synthetic and real-world benchmarks, producing explanations orders of magnitude faster than post-hoc baselines.

[LG-22] MSAlign: Aligning Molecule and Mass Spectra Foundation Models for Metabolite Identification

链接: https://arxiv.org/abs/2605.19752
作者: Paul Krzakala,Gabriel Melo,Camille Lançon,Charlotte Laclau,Rémi Flamary,Etienne Thévenot,Florence d’Alché-Buc
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately identifying metabolites i.e. small molecules from mass spectrometry data remains a core challenge in metabolomics, with broad applications in drug discovery, environmental analysis, and clinical research. We address the Molecule Retrieval task, which consists in recovering the chemical structure of a metabolite from its MS/MS spectrum given a set of candidate molecules. While the recent release of benchmark datasets such as MassSpecGym and Spectraverse has considerably accelerated the development of novel machine learning approaches, the complexity of data preprocessing pipelines and the lack of unified implementations make methods and results difficult to reproduce and compare. We make three contributions. First, we propose a unified framework encompassing recent approaches based on representation alignment and contrastive learning. Second, we introduce MSAlign, inspired by multimodal alignment in vision-language models, which learns a shared representation space by aligning two frozen foundation models (DreaMS for mass spectra and ChemBERTa for molecules) through lightweight MLP projections trained with a candidate-based contrastive objective. MSAlign is simple to implement, fast to train and consistently outperforms existing approaches across all benchmarks. Third, we investigate a long-standing evaluation problem: data splitting strategies in molecule retrieval implicitly trade off data leakage against domain shift. We formalize this tension by introducing a quantitative measure of distribution shift, and use it to evaluate splitting strategies in existing benchmarks. All datasets, splits, candidate sets, and a unified implementation of MSAlign and baselines are publicly released to support reproducible research.

[LG-23] Graph Neural Networks for Community Detection in Graph Signal Analysis

链接: https://arxiv.org/abs/2605.19733
作者: Roberto Cavoretto,Alessandra De Rossi,Enrico Montini
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Community detection is a central problem in graph analysis, with applications ranging from network science to graph signal processing. In recent years, Graph Neural Networks (GNNs) have emerged as effective tools for learning low-dimensional representations of graph-structured data and have shown strong performance in clustering tasks, particularly on large and high-dimensional graphs. This paper investigates the use of GNN-based community detection within a graph signal interpolation framework. After reviewing the main classes of GNN architectures for community detection according to a standard taxonomy, we integrate the resulting graph communities into a Partition of Unity Method (PUM) for interpolation with Graph Basis Functions (GBFs). In this approach, GNN-derived communities are used to construct local subdomains on which GBF interpolants are computed and subsequently combined into a global approximation. Numerical experiments on benchmark %graph datasets, including geometric and urban network examples demonstrate that the proposed combination of GNN-based clustering and GBF-PUM interpolation yields accurate signal reconstructions. The results indicate that deep learning-based community detection can provide effective graph partitions for localized interpolation schemes, supporting its use in scalable graph signal analysis.

[LG-24] Awakening the Hydra: Stabilizing Multi-Concept Backdoor Injection in Text-to-Image Diffusion Models

链接: https://arxiv.org/abs/2605.19698
作者: Kai Wang,Jiale Zhang,Chengcheng Zhu,Chuang Ma,Songze Li
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Preprint. 18 pages

点击查看摘要

Abstract:Text-to-image diffusion models are increasingly developed through open-source reuse and repeated downstream fine-tuning, where reused checkpoints are difficult to verify and thus more susceptible to hidden backdoor behaviors. In such ecosystems, a single pretrained model may be sequentially adapted and redistributed by multiple independent parties, allowing multiple concept-specific trigger-target associations to accumulate in the same model. When these associations coexist, semantic conflicts can be amplified in the shared representation space, leading to cross-concept entanglement and degraded generation quality. Notably, instead of strengthening the attack, such accumulation can destabilize previously injected behaviors and reduce attack reliability. In this work, we systematically investigate backdoor attacks under this interference-prone setting and propose Hydra, a unified framework for robust and controlled multi-concept backdoor injection under cumulative and decentralized reuse. Our core insight is that stable backdoor injection under large-scale multi-concept settings requires explicitly constraining trigger semantics while coordinating cross-task interactions during optimization. Specifically, Hydra performs evolutionary trigger search in the text encoder space to identify triggers that are semantically aligned with their target concepts while remaining stable across other injected concepts. It further combines multi-task fine-tuning with trigger-clean regularization to improve training stability under dense multi-concept injection. Extensive experiments across multiple diffusion backbones under rigorous multi-concept settings show that Hydra maintains effective backdoor activation while preserving clean generation fidelity and image quality. For instance, across 8 attackers and 500 concept pairs, Hydra maintains ~95% ASR and strong clean generation. Comments: Preprint. 18 pages Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2605.19698 [cs.CR] (or arXiv:2605.19698v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.19698 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-25] Agent ic Discovery of Cryomicroneedle Formulations

链接: https://arxiv.org/abs/2605.19677
作者: Hao Li,Lifu Du,Nurul Hameed,Shemonti Saha Authai,Zlata Stefanovic,Chenjie Xu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Cryomicroneedles offer a route to minimally invasive intradermal delivery of living cells, but their cryogenic formulations must reconcile cell protection with constraints on toxicity and device fabrication. Here we report an AI-assisted, closed-loop workflow for cryomicroneedle cryoprotectant discovery that combines literature curation, Gaussian-process surrogate modelling, Bayesian optimization, and sequential wet-lab validation. A curated dataset of 198 mesenchymal stem-cell cryopreservation formulations from 42 studies was converted into 21 ingredient features and used to train an uncertainty-aware literature prior. This model captured moderate structure in the literature data but failed prospectively, motivating iterative wet-lab correction. Across ten validation iterations and 106 wet-lab observations, the model progressively adapted to cryomicroneedle-specific outcomes: batch RMSE decreased from 41.21 to 6.86 percentage points, later-stage rank correlations became consistently positive, and the cumulative wet-lab predicted-versus-measured summary reached R^2 = 0.942 . The best validated formulation achieved 95.15% post-thaw viability with low DMSO, ectoin, ethylene glycol, and fetal bovine serum. However, high viability alone did not ensure intact cryomicroneedle formation, highlighting the need for future multi-objective optimization. These results demonstrate that agent-assisted computational infrastructure can make data-efficient formulation discovery more accessible to labs with minimal data expertise in-house. Project code is available at this https URL.

[LG-26] Inferring Sensitive Attributes from Knowledge Graph Embeddings: Attack and Defense Strategies

链接: https://arxiv.org/abs/2605.19644
作者: Yasmine Hayder(PETSCRAFT)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge Graphs (KGs) are a powerful representation of linked data, offering flexibility, semantic richness, and support for knowledge enrichment and reasoning. They help data owners organize and exploit heterogeneous data to provide insightful services (e.g., recommendations), yet real-world KGs are often incomplete, hiding true facts or missing valuable insights. Knowledge graph embedding techniques are commonly used to infer valuable missing information. However, reasoning over KGs can inadvertently expose sensitive user information, even when such data is not explicitly stored. In this work, we investigate the privacy risks associated with KGE-based reasoning, focusing on attribute inference attacks where adversaries attempt to deduce sensitive user attributes from seemingly non-sensitive outputs. We propose and evaluate a framework that mitigates these privacy risks by applying post processing sanitization techniques to KGE outputs. Preliminary results demonstrate the effectiveness of these attacks on the outputs of KGE models, and explore the trade-off between recommendation quality and privacy protection when applying randomization based approaches, highlighting the need to experiment with more advanced techniques in future work to address this issue.

[LG-27] Optimal Reconstruction from Linear Queries COLT2026

链接: https://arxiv.org/abs/2605.19625
作者: Yuval Filmus,Shay Moran,Elizaveta Nesterova
类目: Machine Learning (cs.LG)
*备注: Accepted to COLT 2026. 46 pages, 4 figures

点击查看摘要

Abstract:We study the problem of reconstructing an unknown point in \mathbbR^d from approximate linear queries. This setting arises naturally in applications ranging from low-dimensional remote sensing and signal recovery to high-dimensional data analysis and privacy-sensitive inference. Our main goal is to characterize the optimal reconstruction error as a function of the number of queries T , the ambient dimension d , and the noise parameter \delta . We first analyze the limit T \to \infty and show that the optimal reconstruction error converges to the explicit value \sqrt2d/(d+1) \delta , which plays a role analogous to the Bayes optimal error in supervised learning. When the dimension is fixed, we show that the excess error above this limit decays doubly exponentially fast as T \to \infty , a rate that is significantly faster than those typically encountered in learning curves. When the dimension grows, we show that a number of queries on the order of \exp(d) is necessary and sufficient to achieve vanishing excess error. Finally, we introduce and analyze an improper variant of the reconstruction problem. From a technical perspective, our main contribution is a generalization of Jung’s theorem (1901). The classical theorem bounds the maximum possible radius of a set of diameter 1 and characterizes extremal bodies. Our generalization provides a robust variant that characterizes near-extremal bodies and is proved via geometric and dynamical arguments exploiting symmetry and Lie group actions. Comments: Accepted to COLT 2026. 46 pages, 4 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.19625 [cs.LG] (or arXiv:2605.19625v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.19625 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] A Family of Divergence Measures for Evaluating the Reconstruction Quality of Explainable Ensemble Trees

链接: https://arxiv.org/abs/2605.19618
作者: Massimo Aria,Agostino Gnasso,Carmela Iorio
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Validating interpretable surrogate models for ensemble learners requires measuring agreement between the ensemble’s internal representation and its surrogate approximation, rather than mere association. Correlation-based approaches are scale-invariant and fail to detect systematic discrepancies in co-occurrence structure. We propose a statistical framework grounded in the agreement-association distinction, centered on the normalized Loss of Interpretability (nLoI). Rooted in the Cressie-Read power divergence family with lambda equal to 2, the nLoI admits a closed-form decomposition into within-node and between-node components, providing a unique diagnostic capability to identify precisely where and why reconstruction fails. The framework incorporates four complementary measures capturing distinct structural facets of approximation quality. A unified permutation testing procedure delivers valid inference for all measures within a single resampling pass. Theoretical properties, including boundedness and symmetry, are established for each metric. Monte Carlo simulations and empirical evaluations confirm exact Type I error control and demonstrate that these measures detect reconstruction fidelity gradients invisible to correlation-based alternatives. The framework is developed and illustrated in the context of Explainable Ensemble Trees (E2Tree), and empirical evaluation on three benchmark datasets illustrates the practical utility of the framework.

[LG-29] Physics-Informed Graph Neural Network Surrogates for Turbulent Nanoparticle Dispersion in Dental Clinical Environments

链接: https://arxiv.org/abs/2605.19589
作者: Takshak Shende,Viktor Popov
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注: 40 pages, 12 figures,

点击查看摘要

Abstract:Dental aerosol procedures produce sub-50 micrometre nuclei that can remain airborne for long periods in enclosed clinics, creating pathways for airborne pathogen transmission. Reynolds-Averaged Navier-Stokes (RANS) simulations with Euler-Lagrange particle tracking capture this transport accurately but require very long run times per scenario, which precludes real-time clinical decision support in 3D. We present the Eulerian-Lagrangian Graph Interaction Network (ELGIN), a physics-informed graph surrogate that jointly predicts carrier-flow dynamics on the OpenFOAM polyhedral mesh and the per-parcel motion of the polydisperse spray cloud. ELGIN couples a multi-head Graph Transformer with Jacobi-preconditioned learnable pressure projection and a turbulence-closure head to a sigmoid-gated Lagrangian Interaction Network through differentiable inverse-distance mesh-parcel coupling, and advances parcels with a symplectic Stormer-Verlet integrator. A four-stage physics-informed curriculum stabilises 260-step autoregressive rollouts without gradient explosion. A parameter sweep with foam-extend 4.1 OpenFOAM reactingParcelFoam across clinically relevant ventilation rates and handpiece spray speeds provides CFD ground truth. This article reports a single-case demonstration in which both ELGIN and a Lagrangian-only baseline (M0) are trained and evaluated on Sweep_Case_03 of a twenty-case sweep; full 16/2/2 retraining is in progress and will replace all reported metrics. On this case, ELGIN tracks the foam-extend particle cloud much more closely than M0: mean parcel displacement error falls from 19.56% to 16.20% of room width and cloud radius-of-gyration error from 9.85% to 6.58%. A 26-second rollout completes in ~64 s on a 4 GB GPU, approximately 37x faster than the foam-extend reference pipeline, toward per-appointment infection-risk screening once the multi-case checkpoint is in place.

[LG-30] Online Market Making and the Value of Observing the Order Book COLT2026

链接: https://arxiv.org/abs/2605.19584
作者: Davide Maran,Marcello Restelli
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at COLT2026

点击查看摘要

Abstract:We study an online market-making problem in which a learner sequentially posts bid and ask prices for a single asset while interacting with traders holding private valuations. Unlike existing online learning formulations that assume fully censored feedback, we introduce an action-dependent feedback model inspired by real limit order books: when a trade occurs, the trader’s valuation remains hidden, whereas when no trade occurs, informative feedback about supply and demand is revealed. We show that this additional information fundamentally changes the learnability of the problem. In the stochastic setting with i.i.d. market prices, we propose an elimination-based algorithm that achieves O(\sqrt T) regret with high probability, without requiring any smoothness assumptions on the distribution of trader valuations. We then extend this result to a broad class of mean-reverting price processes by considering both local, autoregressive dynamics and a weaker global drift condition based on cumulative deviations from the mean. Under either assumption, we establish high-probability O(\sqrt T) regret bounds, relying on a new concentration inequality of independent interest. Finally, in the adversarial setting with oblivious prices, we design an explore-then-perturb algorithm that guarantees O(T^2/3) regret in expectation. Our results quantify the value of observing the order book in online market making and demonstrate that even limited, action-dependent feedback can substantially improve regret guarantees compared to standard bandit feedback models. Comments: Accepted at COLT2026 Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2605.19584 [cs.LG] (or arXiv:2605.19584v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.19584 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] Learning-Accelerated Optimization-based Trajectory Planning for Cooperative Aerial-Ground Handover Missions

链接: https://arxiv.org/abs/2605.19562
作者: Jingshan Chen,Bochen Yu,Henrik Ebel,Peter Eberhard
类目: Robotics (cs.RO); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Preprint of a contribution accepted for publication in the RoManSy 2026 Springer proceedings

点击查看摘要

Abstract:This paper presents a learning-augmented trajectory planning framework for cooperative unmanned aerial vehicle (UAV) and unmanned ground vehicle (UGV) handover missions. While centralized trajectory optimization ensures dynamic feasibility and task optimality, its high computational cost limits real-time applicability. We propose a neural surrogate planner utilizing decoupled encoder-decoder long short-term memory (LSTM) networks to generate coordinated handover trajectory predictions from the task specifications. These predictions serve as informed warm starts for the downstream centralized optimizer, thereby accelerating convergence to dynamically feasible solutions. Benchmark evaluations demonstrate that the learning-augmented planning framework achieves more than a threefold speedup and 100% optimization success rate compared to cold start optimization. The results indicate that combining data-driven inference with model-based refinement enables fast and reliable trajectory generation for heterogeneous multi-robot systems.

[LG-32] Provable Fairness Repair for Deep Neural Networks

链接: https://arxiv.org/abs/2605.19549
作者: Jianan Ma,Jingyi Wang,Qi Xuan,Zhen Wang
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 15 pages, 6 figures, 7 tables. full version of the paper accepted by ASE 2025

点击查看摘要

Abstract:Deep neural networks (DNNs) are suffering from ethical issues such as individual discrimination. In response, extensive NN repair techniques have been developed to adjust models and mitigate such undesired behaviors. However, existing fairness repair methods are typically data-centric, which often lack provable guarantees and generalization to unseen samples. To overcome these limitations, we propose ProF, a novel fairness repair framework with provable guarantees. The key intuition of ProF is to leverage interval bound propagation (a widely used NN verification technique) to soundly capture model outputs over the whole set S(\mathbfx) around a biased sample \mathbfx . The derived bounds are utilized to guide fairness repair which encourages the model to produce consistent outputs on S(\mathbfx) . Specifically, we integrate fairness constraints and model modifications into a unified constraint-solving formulation, which can be transformed to a Mixed-Integer Linear Programming (MILP) problem solvable by off-the-shelf solvers. The solution to the MILP problem effectively induces a repaired model with guaranteed fairness over the whole set S(\mathbfx) . We evaluate ProF on four widely used benchmark datasets and demonstrate that it achieves provable fairness repair, with generalization of up to 95.93% on full datasets and 93.16% on the entire input space. Notably, ProF can be easily configured to support multiple sensitive attributes and more practical fairness definitions, while providing provable repair guarantees and delivering around 90% fairness improvement. Our code is available at this https URL.

[LG-33] he Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

链接: https://arxiv.org/abs/2605.19537
作者: David Pape,Jonathan Evertz,Lea Schönherr
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Progress in LLMs is increasingly measured through standardized benchmarks, where state-of-the-art improvements are often separated by fractions of a percentage point. At the same time, the computational cost of evaluating modern LLMs has driven widespread adoption of specialized inference backends, software systems that execute trained models efficiently at inference time. While critical for scalability, system-level optimizations, such as custom CUDA kernels and reduced-precision arithmetic, can alter token probabilities and introduce non-determinism, possibly cascading into divergent generation. In this work, we first survey the inference landscape, identifying 200 distinct engines, and analyze 35,000 ML publications, finding that the specific inference stack is rarely reported despite this widespread diversity. We then present a systematic empirical study of how inference backends affect LLM benchmark results. Holding model weights, decoding parameters, and hardware constant, we evaluate five widely used inference engines, including vLLM, SGLang, and this http URL, across multiple open-weight models and established benchmarks. We show that the choice of backend alone can shift benchmark scores by up to 16.6 percentage points and induce high rates of output disagreement. By isolating backend optimizations and tracing the execution pipeline, we find this divergence is driven by system-level optimizations like prefix caching and CUDA graphs, custom kernels, and engine-specific defaults in logit processing. Our findings identify the inference backend as a previously unreported but consequential hyperparameter in the evaluation of LLM and advocate standardized reporting of inference stacks to improve the reproducibility and interpretability of benchmark comparisons.

[LG-34] Adynamical systems view of training generativemodels and the memorization phenomenon

链接: https://arxiv.org/abs/2605.19483
作者: Siva Athreya,Chiranjib Bhattacharya,Vivek S. Borkar
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Using recent works of one of the authors (VSB) on collapse in generative models and two time scale dynamics in stochastic gradient descent in high dimensions, we give a system theoretic explanation of the memorization phenomenon in generative models. This relies purely on the dynamic aspects of the training phase. Specifically, we use a result of Austin [2016] to motivate a stylized model for the loss function for stochastic gradient descent (SGD) wherein the loss function has a strong dependence on some variables and weak dependence on the rest in a precise sense. This naturally leads to two distinct time scales in the constant step size SGD that is commonly used in machine learning. This fact has been used to explain the double descent phenomenon in SGD in Borkar [2026]. In conjunction with a mathematical model for collapse phenomenon in SGD developed in Borkar [2025a], we analyze the constant step size SGD using the recent results of Azizian et al. [2024] in order to explain the phenomenon of memorization wherein a generative model that is concurrently being tuned yields the same or similar outputs for significant stretches of time. This gives a novel perspective on the aforementioned phenomena reported in machine learning literature and their interrelationships, using a dynamical systems viewpoint.

[LG-35] Implicit Bias of Mirror Flow in Homogeneous Neural Networks: Sparse and Dense Feature Learning

链接: https://arxiv.org/abs/2605.19458
作者: Tom Jacobs,Guido Montufar
类目: Machine Learning (cs.LG)
*备注: 36 pages, 14 figures

点击查看摘要

Abstract:We study the max-margin solutions reached by mirror flow in deep neural networks with homogeneous activation functions. Extending classical results on gradient flow, we derive a novel balance equation for mirror flow from convex duality, enabling a characterization of the horizon function governing the induced margin. We further establish max-margin characterizations together with convergence rates and norm growth estimates. Finally, we support our theory through experiments on synthetic datasets and standard vision tasks. Concretely, we show that: (1) distinct non-homogeneous mirror maps can induce the same max-margin solution; (2) convergence can be extremely slow, including exponentially slow regimes; and (3) although all considered mirror maps exhibit feature learning, they can produce markedly different representations, ranging from sparse to dense neuron activations. Together, these results provide a unified perspective on sparse and dense feature learning in homogeneous neural networks, highlighting how mirror maps shape both optimization dynamics and the geometry of the learned classifiers.

[LG-36] IDE: Asymmetric Neural Circuits for Stabilized Temporal Inhibitory-Excitatory Dynamics

链接: https://arxiv.org/abs/2605.19403
作者: Alexander Kyuroson,Denis Kleyko,Marcus Liwicki
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent Continuous Thought Machine architecture decouples internal computation from external inputs via neural dynamics, but relies on multi-layer perceptrons without stability guarantees. We propose to model neural dynamics using asymmetric Excitatory-Inhibitory (E-I) networks, which can be stabilized via principles from network theory and can be expressed as energy-based systems optimized through a game-theoretic loss. Building on this perspective, we introduce Temporal Inhibitory-Excitatory Dynamic Engine (TIDE), a neuro-inspired architecture that computes internal representations through neural dynamics stabilized by incorporating the Wilson-Cowan dynamics and lateral inhibition. TIDE balances biological realism by, for instance, using Hierarchical Receptive Fields and enforcing Dale’s principle to ensure a realistic 80:20 E-I balance ratio with an end-to-end trainable architecture. The aim of this paper is to introduce a new architecture that brings neuro-inspired learning to the forefront. We present proofs of convergence, stability, and complexity bounds, along with empirical ablation studies. Overall, TIDE surpasses CTM with under 50% of the training time and improves \texttttop-1 accuracy by an average of +1.65% on ImageNet under various perturbations.

[LG-37] Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach

链接: https://arxiv.org/abs/2605.19392
作者: Yi Feng,Weiming Ou,Xiao Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The remarkable success of the Adam in training neural networks has naturally led to the widespread use of its descent-ascent counterpart, Adam-DA, for solving zero-sum games. Despite its popularity in practice, a rigorous theoretical understanding of Adam-DA still lags behind. In this paper, we derive ordinary differential equations (ODEs) that serve as continuous-time limits of the Adam-DA. These ODEs closely approximate the discrete-time dynamics of Adam-DA, providing a tractable analytical framework for understanding its behavior in zero-sum games. Using this ODE approach, we investigate two fundamental aspects of Adam-DA: local convergence and implicit gradient regularization. Our analysis reveals that the roles of the first- and second-order momentum parameters in zero-sum games are exactly the opposite of their well-documented effects in minimization problems. We validate these predictions through GAN experiments across multiple architectures and datasets, demonstrating the practical implications of this reversed momentum effect.

[LG-38] Accurate Efficient and Explainable Deep Learning Approaches for Environmental Science Problems

链接: https://arxiv.org/abs/2605.19366
作者: Jimeng Shi
类目: Machine Learning (cs.LG)
*备注: 161 pages

点击查看摘要

Abstract:Environmental science plays a pivotal role in safeguarding ecosystems, a domain driven by large-scale, heterogeneous data. In the big data era, artificial intelligence (AI) has emerged as a transformative tool for learning patterns and supporting decision-making. This dissertation develops AI-based approaches tailored to complex environmental science problems to achieve Environmental Intelligence, studying three specific challenges. First, we focus on flood prediction and management in coastal river systems. Conventional physics-based models are computationally intensive, limiting real-time application. To overcome this, we propose a deep learning (DL)-based model, WaLeF, for water level forecasting, and a forecast-informed DL model, FIDLAr, to manage water levels. Evaluated in a flood-prone coastal system in South Florida characterized by extreme rainfall and sea level fluctuations, FIDLAr outperforms baselines in accuracy and efficiency while providing interpretable outputs. Second, we target global weather prediction, which is challenged by massive data scale. Traditional physics methods are deterministic and computationally heavy. We propose CoDiCast, a conditional diffusion model tailored for probabilistic weather forecasting. Adapted from generative AI for predictive tasks, experiments show CoDiCast achieves accurate, efficient forecasts with explicit uncertainty quantification. Lastly, we address scientific question-answering in environmental science. When answering in-domain questions, large language models (LLMs) often suffer from hallucinations due to out-of-date or limited knowledge. While retrieval-augmented generation (RAG) retrieves domain-specific knowledge, existing methods trade off accuracy, efficiency, or explainability. We propose Hypercube-RAG, built on a structured text cube framework, which successfully exhibits all three properties simultaneously.

[LG-39] CompoSE: Compositional Synthesis and Editing of 3D Shapes via Part-Aware Control

链接: https://arxiv.org/abs/2605.19350
作者: Habib Slim,Shariq Farooq Bhat,Mohamed Elhoseiny,Yifan Wang,Mike Roberts
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Creating and editing high-quality 3D content remains a central challenge in computer graphics. We address this challenge by introducing CompoSE, a novel method for Compositional Synthesis and Editing of 3D shapes via part-aware control. Our method takes as input a set of coarse geometric primitives (e.g., bounding boxes) that represent distinct object parts arranged in a particular spatial configuration, and synthesizes as output part-separated 3D objects that support localized granular (i.e., compositional) editing of individual parts. The key insight that enables our method is our use of a diffusion transformer architecture that alternates between processing each part locally and aggregating contextual information across parts globally, and features a novel conditioning technique that ensures strong adherence to the user’s input. Importantly, our method learns to infer part semantics and symmetries directly from the user’s coarse layout guidance, and does not require part-level text prompts. We demonstrate that our method enables powerful part-level editing capabilities, including context-aware substitution, addition, deletion, and style-preserving resizing operations. We show through extensive experiments that our method significantly outperforms existing approaches on guided synthesis, as measured by objective metrics and LLM-based evaluations.

[LG-40] What Makes a Representation Good for Single-Cell Perturbation Prediction? ICML2026

链接: https://arxiv.org/abs/2605.19343
作者: Wenkang Jiang,Yuhang Liu,Yichao Cai,Erdun Gao,Jiayi Dong,Ehsan Abbasnejad,Lina Yao,Javen Qinfeng Shi
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026

点击查看摘要

Abstract:Single-cell perturbation modeling is fundamental for understanding and predicting cellular responses to genetic perturbations. However, existing approaches, from causal representation learning to foundation models, often struggle with an overlooked challenge: gene expression is dominated by perturbation-invariant information, while perturbation-specific signals are intrinsically sparse. As a result, learned representations either entangle invariant and perturbation-specific information, leading to spurious and non-generalizable predictors, or suppress perturbation-specific signals altogether, rendering them ineffective for prediction. To address this, we propose PerturbedVAE, a general framework designed to resolve this signal imbalance. The framework explicitly separates perturbation-specific information from dominant invariant structure and recovers causal representations to effectively utilize such information for prediction. We further provide an identifiability analysis that characterizes the conditions under which sparse perturbation effects can be reliably recovered, thereby clarifying how the framework can be concretely specified under such conditions. Empirically, PerturbedVAE achieves state-of-the-art performance on a widely used benchmark across multiple evaluation settings, yielding significant gains on out-of-distribution combinatorial predictions and uncovering interpretable perturbation-response programs.

[LG-41] An Exterior Method for Nonnegative Matrix Factorization ICML2026

链接: https://arxiv.org/abs/2605.19325
作者: Qiujing Lu,Tonmoy Monsoor,Ehsan Ebrahimzadeh,Kartik Sharma,Vwani Roychowdhury
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026

点击查看摘要

Abstract:Nonnegative matrix factorization (NMF) seeks a low-rank approximation X \approx UV^T with nonnegative factors and is commonly solved using interior methods that enforce feasibility throughout optimization. We show that such constraint-driven approaches can impede progress in the nonconvex landscape, leading to slow convergence or convergence to suboptimal stationary points. We propose an exterior framework for NMF (eNMF) that separates low-rank approximation from nonnegativity enforcement. Our method initializes from the optimal unconstrained factorization and introduces a rotation procedure that maps unconstrained factors to an exterior point closest to the nonnegative orthant. This viewpoint yields an algorithmic framework in which simple iterative updates converge to KKT-satisfying stationary points on the boundary of the positive orthant. The exterior formulation also enables a geometric interpretation of NMF solutions, clarifying equivalence classes of factorizations under permutation and orthogonal transformations. An intriguing numerical result, involving 400 NMF experiments across both real and synthetic datasets, show that in 99% of the cases, different algorithms tend to converge towards equivalent factor matrices. We benchmark eNMF against 9 state-of-the-art NMF algorithms with 9 initialization schemes across 3 real-world and 2 synthetic datasets. eNMF consistently outperforms all 81 competitors, achieving up to 30% lower reconstruction error under equal-time settings and up to 150% speedup under equal-error settings. The downstream experiments further demonstrate substantial performance gains in audio processing and recommendation tasks, corroborating the practical benefits of the proposed exterior optimization framework. Code is available at this https URL

[LG-42] BrainDyn: A Sheaf Neural ODE for Generative Brain Dynamics

链接: https://arxiv.org/abs/2605.19324
作者: Siddharth Viswanath,Panayiotis Ketonis,Chen Liu,Michael Perlmutter,Dhananjay Bhaskar,Smita Krishnaswamy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient neural network models that generate brain-like dynamic activity can be a valuable resource for generating synthetic data, analyzing differences in brain transients under conditions such as testing perturbation activity or inferring the underlying generative dynamics. However, large language models (LLMs) or standard recurrent neural networks (RNNs) ignore the anatomical organization and therefore do not produce components that align with brain regions. On the other hand, graph-based networks often have very simple message passing rules that are not sufficiently expressive for brain-like dynamics. To address this, we introduce BrainDyn, a sheaf neural ordinary differential equation (neural ODE) model for continuous-time dynamics on structured brain graphs. BrainDyn encodes the recent activity history of each brain region using a long short-term memory (LSTM) model over a sliding temporal window to produce hidden states, or stalks, that are projected through learnable restriction maps into edge-specific shared spaces. Discrepancies between neighboring nodes in these shared spaces are characterized by a sheaf Laplacian that can facilitate message passing between neuronal units. The output of these messages is then fed to a neural ODE that governs the continuous-time evolution of neuronal activity. We evaluated BrainDyn on resting-state fMRI (PNC dataset), scalp EEG with focal epilepsy (TUSZ dataset), and simulated activity from the NEST spiking network simulator. BrainDyn achieves strong forecasting ability across modalities, and the resulting representations support downstream tasks including in silico perturbation prediction.

[LG-43] An Objective Performance Evaluation of the LSTM Networks in Time Series Classification

链接: https://arxiv.org/abs/2605.19311
作者: Sooraj Sunil,Balakumar Balasingam
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted in 2026 29th International Conference on Information Fusion

点击查看摘要

Abstract:The rapid adoption of deep learning has increasingly led to data-driven models replacing classical model-based algorithms, even in domains governed by well-understood physical laws. While data-driven models, such as long short-term memory (LSTM) networks, have become a popular choice for time-series analysis, their performance relative to model-based approaches in structured environments is rarely evaluated objectively. This paper presents a performance evaluation framework comparing an LSTM classifier against a model-based expectation maximization (EM) classifier for binary time-series classification. The evaluation is conducted on two scalar linear Gaussian state space models differing only in their noise statistics, where the Kalman filter likelihood ratio test with true parameters serves as a reference for the best achievable classification this http URL Monte Carlo simulations, the classifiers are evaluated across three axes: task difficulty, controlled by the separation in process or measurement noise between the two models; sequence length; and training dataset size. The results show that the EM classifier, which exploits the known model structure, performs strongly when the data conform to the assumed model class. The LSTM classifier requires a larger separation in noise statistics to achieve reliable classification, and its performance saturates below the reference classifier when the models differ only in measurement noise, regardless of sequence length or training dataset size.

[LG-44] A Two-Phase Adaptive Balanced Penalty Method for Controllable Pareto Front Learning under Split Feasibility Conditions

链接: https://arxiv.org/abs/2605.19306
作者: Nguyen Viet Hoang,Dung D. Le,Tran Ngoc Thang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 36 pages, 18 figures, 12 tables. Submitted to Neural Networks (Elsevier)

点击查看摘要

Abstract:We address the open problem of training hypernetworks for Controllable Pareto Front Learning (CPFL) under split feasibility conditions with rigorous theoretical guarantees. We reformulate the constrained Pareto problem as a Bi-Level Scalarized Split Problem (BSSP) and propose the Adaptive Balanced Penalty (ABP) algorithm, whose three gradient components – optimality, set feasibility, and image feasibility – are blended through an adaptive indicator driven by a computable lower bound. Using a novel convex surrogate technique, we prove full-sequence convergence under standard convexity and Robbins-Monro step-size assumptions. The ABP penalty structure is then translated into a two-phase, feasibility-first training strategy for Hyper-MLP and HyperTrans architectures (ABP-HyperNet). To evaluate constrained CPFL, we introduce the Expected Feasible Hypervolume (EFHV), which jointly captures solution quality and constraint satisfaction. Experiments on five multi-objective benchmarks validate the ABP solver against ground truth, while three multi-task learning datasets demonstrate that ABP-HyperNet achieves up to 2.3x higher EFHV than unconstrained baselines by raising feasibility from 36-49% to 87-100%.

[LG-45] Cross-Paradigm Knowledge Distillation: A Comprehensive Study of Bidirectional Transfer Between Random Forests and Deep Neural Networks for Big Data Applications

链接: https://arxiv.org/abs/2605.19299
作者: Mahdi Naser Moghadasi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The exponential growth of big data has intensified the need for efficient and interpretable machine learning models that can handle diverse data characteristics while maintaining computational efficiency. Knowledge distillation has primarily focused on neural network-to-neural network transfer, leaving cross-paradigm knowledge transfer largely unexplored. This paper presents the first comprehensive study of bidirectional knowledge distillation between Random Forests (RF) and Deep Neural Networks (DNN), addressing critical gaps in ensemble learning and model compression for big data applications. We propose novel methodologies including progressive multi-stage distillation, multi-teacher ensemble distillation from diverse tree models, and uncertainty-aware cross-paradigm transfer mechanisms. Through 144 comprehensive experiments across 6 diverse datasets encompassing classification and regression tasks, we demonstrate that bidirectional RF-DL distillation achieves competitive performance while providing complementary benefits: interpretability from tree models and expressiveness from neural networks. Our results show that multi-teacher ensemble distillation consistently outperforms traditional approaches, with NN-COMPACT achieving 98.13% classification accuracy and NN-WIDE reaching 92.6% R^2 score in regression tasks. The proposed framework enables deployment flexibility in big data environments, allowing optimal model selection based on computational constraints and interpretability requirements. This work establishes a new research direction in cross-paradigm knowledge transfer with significant implications for interpretable AI and scalable model deployment in resource-constrained big data systems.

[LG-46] Domain-Adaptive Communication-Rate Optimization for Sim-to-Real Humanoid-Robot Wireless XR Teleoperation

链接: https://arxiv.org/abs/2605.19293
作者: Caolu Xu,Zhiyong Chen,Meixia Tao,Li Song,Feng Yang,Wenjun Zhang
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: submitted to IEEE journal

点击查看摘要

Abstract:Wireless extended reality (XR) teleoperation provides embodied interaction capability for collecting humanoid robot demonstrations, but the large-scale adoption is restricted by the overhead of high-frequency motion transmission. This paper develops a system framework that integrates sampling, transmission, interpolation, and reconstruction and formulates a communication-rate optimization that aims to minimize the communication energy while maintaining the reconstruction accuracy of robot motion trajectories through dimension-wise sampling-rate control. Since acquiring real-time feedback from physical robots is limited by hardware costs, it is necessary to solve the problem through simulator interaction with offline real-domain data correction. To guide sim-to-real adaptation, we provide a PAC-Bayes generalization characterization that reveals the effects of latent density-ratio estimation, finite-sample deviation, and encoder bias. Building on this analysis, we propose a proximal policy optimization (PPO) method with density-ratio weighting and trust-region regularization. Experiments on public humanoid teleoperation dataset show that the proposed method improves the tradeoff between reconstruction error and communication energy consumption under sim-to-real distribution shift. We further analyze the effectiveness of the proposed algorithm across various wireless channels and dynamic motion trajectories.

[LG-47] Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

链接: https://arxiv.org/abs/2605.19282
作者: Chongyu Fan,Gaowen Liu,Mingyi Hong,Ramana Rao Kompella,Sijia Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Muon is a matrix-aware optimizer that leverages Newton-Schulz (NS) iterations to enforce spectral gradient orthogonalization by driving all singular values of the momentum matrix toward 1. While this uniform spectral whitening enhances exploration and outperforms AdamW in LLM pretraining, we show it could lead to fundamental limitations beyond pretraining in two regimes: (i) cross-modality vision-language-action (VLA) training, where inherently low-rank action-module gradients cause amplification of noisy tail directions, and (ii) reinforcement learning with verifiable rewards (RLVR), where low-SNR gradients and the need to preserve per-head specialization from prior training make whitening unstable. To address these challenges, we propose Pion, a drop-in replacement for Muon that preserves its computational efficiency while replacing uniform spectral whitening with a two-stage Promotion+Suppression mechanism, which we call the high-pass NS iteration. This design induces a sharp spectral high-pass effect, anchoring dominant singular values at 1 while suppressing noisy tail components toward 0, with controllable filter strength. To preserve pretrained per-head heterogeneity, Pion also supports a per-head mode that applies updates independently across attention heads via a simple reshape, at no extra cost. In VLA training on LIBERO and LIBERO-Plus, Pion consistently outperforms both baselines across l_1-regression (VLA-Adapter) and flow-matching (VLANeXt) architectures, e.g., reaching 100% success rate on LIBERO Object after 1,500 training steps with VLA-Adapter, vs. 97.0% for Muon and only 32.2% for AdamW. The advantage of Pion further extends to a real Franka Research 3 robot with a pi_0.5 backbone under the DROID setup on three grasp-and-place tasks. In RLVR post-training on Qwen3-1.7B/4B with GRPO and GMPO, Pion also outperforms AdamW on MATH and GSM8K while Muon collapses to zero.

[LG-48] CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

链接: https://arxiv.org/abs/2605.19269
作者: Han Guo,Jack Zhang,Arjun Menon,Driss Guessous,Vijay Thakkar,Yoon Kim,Tri Dao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement an increasingly important bottleneck in otherwise highly optimized training stacks. We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs. CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives for scaling, reductions, pairwise transformations, and accumulation. This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover nearly all non-attention computation in the forward and backward pass of a standard Transformer block. Across representative Transformer workloads, both human- and LLM-authored CODA kernels achieve high performance, suggesting that GEMM-plus-epilogue programming offers a practical path toward combining framework-level productivity with hardware-level efficiency.

[LG-49] From Simple to Complex: Curriculum-Guided Physics-Informed Neural Networks via Gaussian Mixture Models

链接: https://arxiv.org/abs/2605.19263
作者: Jianan Yang,Yiran Wang,Shuai Li,Fujun Cao,Xuefei Yan,Junmin Liu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 23 pages, 15 figures

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) offer a mesh-free framework for solving partial differential equations (PDEs), yet training often suffers from gradient pathologies, spectral bias, and poor convergence, especially for problems with strong nonlinearity, sharp gradients, or multiscale features. We propose the Curriculum-Guided Gaussian Mixture Physics-Informed Neural Network (CGMPINN), which integrates Gaussian mixture modeling with dynamic curriculum learning. Specifically, a GMM is periodically fitted to the PDE residual distribution to quantify spatially varying learning difficulty. A smooth curriculum schedule progressively shifts training focus from easy to harder regions, while precision-based variance modulation suppresses unreliable clusters during early optimization. This dual curriculum is governed by a shared curriculum parameter and can be combined with self-adaptive loss balancing. We further establish theoretical guarantees, including sublinear convergence of the gradient norm for the induced time-varying loss, uniform equivalence between the curriculum-weighted and standard PDE losses, and a generalization bound with an explicit weighting-induced bias characterization. Experiments on six benchmark PDEs spanning elliptic, parabolic, hyperbolic, advection-dominated, and nonlinear reaction-diffusion types show that CGMPINN consistently achieves the lowest relative L_2 and maximum absolute errors among all compared methods, reducing relative L_2 error by up to 97.8% over the standard PINN at comparable cost. Our code is publicly available at this https URL.

[LG-50] Backdooring Masked Diffusion Language Models

链接: https://arxiv.org/abs/2605.19262
作者: Daniel Yiming Cao,Chengzhong Wang,Sheng-Yen Chou,Chengyu Huang,Pin-Yu Chen,Shengwei An
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Masked diffusion language models (MDLMs) are emerging as a compelling new paradigm for text generation, but their training-time security remains largely unexplored. Existing backdoor attacks on Gaussian diffusion models or autoregressive language models do not directly apply to MDLMs because MDLMs rely on discrete state corruption and iterative denoising rather than continuous noising or left-to-right prediction. In this work, we present the first systematic study of training-time backdoor attacks on MDLMs. We propose SHADOWMASK, a backdoor attack that modifies the MDLM forward corruption process by replacing the standard all-mask terminal distribution with a trigger-mask mixture prior. This creates a dedicated denoising pathway from trigger-corrupted states to attacker-specified targets while preserving clean denoising behavior. We further provide a principled mathematical formulation by defining the backdoored forward process, deriving the reverse-time posterior, and obtaining the continuous-time training objective. Evaluations on DiT-based MDLM and LLaDA-8B-Instruct across WikiText-103, OpenWebText, and Alpaca show that SHADOWMASK achieves near-100% attack success, substantially outperforms standard data poisoning, largely preserves clean utility, remains effective under full-model and parameter-efficient fine-tuning, and is robust against representative defenses.

[LG-51] Beyond Extrapolation: Knowledge Utilization Paradigm with Bidirectional Inspiration for Time Series Forecasting ICML2026

链接: https://arxiv.org/abs/2605.19249
作者: Liu Chong,Yingjie Zhou,Hao Li,Pengyang Wang,Qingsong Wen,Ce Zhu
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026. 18 pages, 6 figures

点击查看摘要

Abstract:Time-series forecasting is critical in various scenarios, such as energy, transportation, and public health. However, most existing forecasters rely primarily on one-way inference, \textiti.e., mapping \textbfhistory to \textbftarget, and overlook the structural information provided by a revised natural chain (``\textbfhistory (model input) – \textbftarget (ground-truth output) – \textbfpost-target continuation’'). The post-target continuation records how trajectories evolve after the target, which can help stabilize forecasting, but it is not observable at inference time. In this work, we aim to obtain an approximate proxy of the post-target continuation for the current input, providing structural knowledge for bidirectional forecasting. This idea is instantiated as KUP-BI (Knowledge Utilization Paradigm with Bidirectional Inspiration), a new time-series modeling paradigm that distills continuation-style knowledge (as an approximate post-target continuation proxy) from a \emphtrain-only historical library and integrates it into standard forecasting backbones. The input stream and the continuation-proxy stream are fused via a lightweight feature-level gating module. This design does not introduce information beyond what is already contained in the training trajectories; instead, it provides a structured inductive bias that helps backbones exploit typical continuation patterns rather than relying solely on parametric extrapolation. Experimental results on six public datasets show that KUP-BI consistently improves the forecasting performance of state-of-the-art models, with small additional overhead.

[LG-52] GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning

链接: https://arxiv.org/abs/2605.19235
作者: Zhiyuan Fan,Gabriele Farina
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Competitive multi-agent reinforcement learning in imperfect-information games requires agents to act under partial observability and against adversarial opponents, necessitating stochastic policies. While self-play reinforcement learning with Proximal Policy Optimization (PPO) has achieved strong empirical success, its standard advantage estimator, generalized advantage estimation, suffers from additional variance due to the sampling of stochastic future actions. This variance is amplified in equilibrium self-play because of the stochastic nature of the equilibrium policy and persists even when the critic is exact. We address this bottleneck by introducing Q -boosting, a variance-reduced advantage estimator based on a centralized action-value critic, and propose Variance-Reduced Policy Optimization (VRPO), incorporating this new estimator. The algorithm replaces sampled multi-step backups with a multi-step Expected SARSA (\lambda) trace, computing policy expectations at each step to average out action-sampling noise, while retaining PPO’s clipped objective and on-policy actor updates. Empirically, VRPO consistently achieves strong performance from mid-sized to large-scale games including Dou Dizhu and Heads-Up No-Limit Texas Hold’em.

[LG-53] Quantum Machine Learning for Cyber-Physical Anomaly Detection in Unmanned Aerial Vehicles: A Leakage-Free Evaluation with Proxy-Audited Feature Sets

链接: https://arxiv.org/abs/2605.19233
作者: Carlos A. Durán Paredes,Javier E. León Calderón,Nicolás Sánchez Perea,German Darío Díaz,Camilo Segura Quintero
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 10 pages, 7 figures, 1 table; open Qiskit 2.x implementation available at this https URL

点击查看摘要

Abstract:Unmanned aerial vehicles (UAVs) are cyber-physical systems whose attack surface spans networked avionics and on-board sensor fusion: a compromised GPS or battery module can mimic a benign mission segment and evade naive anomaly detectors. We present a leakage-free evaluation of quantum machine learning for UAV anomaly detection on the multi-sensor TLM:UAV benchmark. Three contributions support the study. (i) A group-aware temporal protocol (B2) partitions the dataset into ten contiguous TimeUS blocks and evaluates over ten seeds, eliminating the inflation produced by random stratified splits that mix neighbouring samples. (ii) A three-mode feature audit (full/loose/strict) quantifies how much accuracy stems from instantaneous physical signals versus contextual proxies (cumulative energy, battery state, GPS trajectory). (iii) A hybrid XGBoost + Data Reuploading (DRU) classifier is benchmarked against five paired non-linear controls (raw, PCA, polynomial-2, random-RBF, and an untrained DRU map) under identical budgets. The standalone DRU does not consistently match the strongest classical baseline across seeds; however, the trained-DRU hybrid is the only model whose mean F1 macro shifts upward from full to strict (+0.05), a directional signal that the per-seed standard deviations prevent from being interpreted as a statistically established difference. The trained-DRU hybrid also records the lowest mean false-alarm rate under proxy-free evaluation, subject to the inter-seed variance reported. We frame this as an incremental, reproducible quantum-enhanced hybrid benefit, and provide an open Qiskit 2.x implementation as a benchmark for cybersecurity analytics in NISQ-era aerospace systems.

[LG-54] DeRegiME: Deep Regime Mixtures for Probabilistic Forecasting under Distribution Shift

链接: https://arxiv.org/abs/2605.19231
作者: Kieran Wood,Stefan Zohren,Stephen J. Roberts
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce DeRegiME – Deep Regime Mixture of Experts – a direct multi-horizon probabilistic forecaster that separates latent uncertainty regimes from the underlying signal and softly assigns each forecast location to learned recurring regimes using a sparse variational Gaussian process (GP) whose nonstationary regime-mixing kernel and Student-t likelihood combine per-regime sub-kernels and noise processes via a shared gate. This yields a single sparse-GP posterior, not a mixture of GP experts. DeRegiME addresses a key limitation of neural forecasters: point forecasts discard residual uncertainty, and probabilistic heads – whether single marginals, uninterpreted mixtures, quantile sets, or diffusion samples – rarely expose the regime structure of the residual. Yet distribution shift in noisy heteroskedastic time series may be abrupt, gradual, or horizon-dependent and often appears in residual uncertainty rather than the conditional mean. DeRegiME yields an interpretable mean-residual-noise decomposition with a direct-sum feature-space representation that anchors regimes as clusters of residual similarity whose transitions surface as implicit changepoints. The effective number of regimes is pruned by the stick-breaking gate. We prove kernel validity and predictive-density propriety, and across ten benchmarks and three encoder grids DeRegiME improves negative log predictive density (NLPD) by 20.3% over the strongest encoder-matched baseline, a DeepAR/GluonTS-style dynamic Student-t head, with parallel gains on CRPS (3.0%) and MSE (4.7%). Improvements are consistent across all datasets, which span abrupt, gradual, and seasonal shifts.

[LG-55] Sequential Consensus for Multi-Agent LLM Debates: A Wald-SPRT compute governor with calibration-based failure detection

链接: https://arxiv.org/abs/2605.19193
作者: Andrea Morandi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-agent LLM debate improves factuality and reasoning, but most recipes pick a fixed round count, over-spending on easy items and under-spending on hard ones. We adapt Wald’s Sequential Probability Ratio Test (SPRT) as a plug-in compute governor for LLM debates. After each round, an LLM judge emits a [0,1] consensus score on the latest agent positions; a Wald monitor accumulates the log-likelihood ratio of “useful convergence” vs “not yet useful” under a Beta likelihood family, and stops when either boundary is crossed or returns a capped best-effort outcome at R_max. Under i.i.d. assumptions the rule inherits SPRT type-I/type-II error guarantees; in deployment the calibration itself is the more important object, since it estimates whether the judge score actually separates useful from unhelpful convergence in a given domain. We evaluate two tracks: (i) a Monte-Carlo study under calibrated Beta models characterising working curves, error rates, capping behaviour, and sensitivity; and (ii) a real-LLM evaluation on 200 attempted MMLU and 200 attempted GSM8K items with three heterogeneous agents (gpt-5, claude-opus-4-6, gemini-2.5-pro) and a claude-opus-4-6 judge, using disjoint 40-item calibration subsets. On GSM8K the rule stops in 1.01 average rounds (4.06 LLM calls) at 97.0% accuracy vs 99.0% for fixed-5 debate at 15 calls: a 3.7x call reduction at -2pp accuracy. On MMLU the calibrated KL collapses to about 0 and the rule caps on 99.5% of items at 2.1x cost. The takeaway is not that SPRT makes debate more accurate, but that a classical sequential test serves as a cheap compute-control and failure-detection layer for multi-agent LLM systems.

[LG-56] A Heuristic Approach for Performance Tuning in RL-based Quadrotor Control via Reward Design and Termination Conditions

链接: https://arxiv.org/abs/2605.19166
作者: Fausto Mauricio Lagos Suarez,Akshit Saradagi,Vidya Sumathy,George Nikolakopoulos
类目: Robotics (cs.RO); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted in the 34th Mediterranean Conference on Control and Automation

点击查看摘要

Abstract:Reinforcement learning (RL)-based quadrotor control policies have achieved impressive performance in tasks such as fast navigation in cluttered environments and drone racing, where the focus is on speed and agility. However, in several applications, such as infrastructure inspection, it is critical to achieve precise, controlled maneuvers with tunable performance. In this article, we present a novel heuristic approach to achieve tunable performance in RL-based Quadrotor control through reward design and termination conditions. We present a novel reward structure containing dual bandwidth exponentials that achieves a baseline critically damped response in setpoint tracking, with low steady-state errors. When trained with a Proximal Policy Optimization (PPO) algorithm, in conjunction with episode truncation conditions, the desired performance is achieved in 6 million time steps in a sample-efficient manner. In order to tune the performance about the baseline behavior, we present intuitive heuristic rules to adjust the reward weights and exponential coefficients to achieve faster (acrobatic-like) and slower (inspection-like) settling time performance, while retaining the baseline critically damped response and approximately 2% steady-state error. We evaluate the three RL policies (baseline, acrobatic, and inspection) across 100 trials and show accurate and tunable performance in position and yaw tracking from random initial conditions, thereby demonstrating the effectiveness of the proposed heuristic approach.

[LG-57] PMF-CL: Pareto-Minimal-Forgetting Continual Learner for Conflicting Tasks

链接: https://arxiv.org/abs/2605.19145
作者: Srijith Nair,Atilla Eryilmaz, Jia (Kevin)Liu
类目: Machine Learning (cs.LG)
*备注: 25 pages, 4 figures, 4 algorithms

点击查看摘要

Abstract:In the literature, many continual learning (CL) algorithms have been proposed to address the issue of catastrophic forgetting in ML models (i.e., learning new tasks leads to the loss of performance on previously learned tasks). Although all CL approaches use some form of memory to retain information about past tasks, a grounded understanding of what information needs to be stored to minimize catastrophic forgetting remains elusive. Recently, it has been recognized that under the strong assumption of the existence of a common global minimizer over all tasks, catastrophic forgetting can be completely avoided. However, in practice, tasks rarely have a common global minimizer, and a certain amount of forgetting is inevitable. In this paper, we propose a foundational framework for principled and systematic CL of conflicting tasks using a multi-task learning (MTL) perspective. The approach is based on finding Pareto-optimal solutions, i.e., the solutions which, by definition, minimally forget the previous tasks in the Pareto sense. We derive Pareto-minimal-forgetting CL algorithms for linear and basis-function regression, and general loss functions which have a quadratic upper bound, e.g., logistic regression. For quadratic problems, PMF-CL uses memory-efficient iterative updates with a static memory footage of \mathcalO(d^2) for models with d parameters.

[LG-58] Identifiable Multimodal Causal Representation Learning under Partial Latent Sharing

链接: https://arxiv.org/abs/2605.19135
作者: Manal Benhamza,Marianne Clausel,Myriam Tami
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal representation learning (CRL) seeks to uncover meaningful latent variables and their corresponding causal structure from high-dimensional observational data. Although its significance, CRL identifiability remains a crucial property, as it ensures the recovery of the mechanisms behind the data generation process, and hence the interpretability and robustness of the representation. Proving identifiability in CRL is intrinsically difficult, and we address in this work an even more challenging setting: multimodality. We consider multimodal observed data with a latent partially shared structure. Each modality is generated, through non linear mixing functions, from a specific subset of causal latent variables. Under flexible assumptions and without imposing any parametric distribution on the latent variables, we establish component-wise identifiability guarantees for the causal latent representation. Our identifiability results, furthermore, apply to the undercomplete scenario where we have, for each modality, more observed than latent variables. To instantiate our theoretical analysis, we introduce a Wasserstein-based module to recover the partially shared latent structure. Due to its differentiability, the latter can be easily integrated into all types of architecture, only requiring minimal changes. Extensive experiments on synthetic and realistic datasets validate the superiority of our approach over SOTA methods.

[LG-59] CLIC: Contextual Language-Informed Cardiac Pathology Classification ICLR2026

链接: https://arxiv.org/abs/2605.19132
作者: Giovani D. Lucafo,Rafael da Costa Silva,João Lucas Luz Lima Sarcinelli,Andre Guarnier De Mitri,Diego Furtado Silva
类目: Machine Learning (cs.LG)
*备注: 6 pages, 2 figures, accepted at the ICLR 2026 Workshop on Time Series in the Age of Large Models (TSALM)

点击查看摘要

Abstract:The electrocardiogram (ECG) is the gold standard for non-invasive diagnosis of cardiac pathologies and is a fundamental pillar of cardiovascular medicine. Recent progress in deep learning has led to the development of robust automated classifiers that achieve high performance by processing raw physiological signals. However, in clinical practice, diagnosis is rarely based solely on the signal. Cardiologists commonly support their interpretation with the patient’s characteristics and the specific data-acquisition context. Despite this, most current algorithms remain restricted to signal-only analysis, failing to integrate technical metadata and demographic variables. This paper proposes Contextual Language-Informed Cardiac pathology classification (CLIC), a multimodal framework that significantly enhances diagnostic precision by encoding these variables through natural language. We demonstrate that translating patient-level contextual data into descriptive text provides an informative anchor that helps the model disambiguate complex physiological patterns. We further investigate the use of Large Language Models to synthesize richer clinical descriptions and observe that, while these generated texts remain competitive, controlled template-based contextual clinical text leads to consistent improvements in downstream classification performance.

[LG-60] Performance Monitoring of Proton Exchange Membrane Water Electrolyzer by Transformers-Based Machine Learning Model

链接: https://arxiv.org/abs/2605.19107
作者: Bingqing Chen,Ivan Batalov,Qiu Chen,Weiqi Ji,Lei Cheng
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Green hydrogen plays an essential role in decarbonization, with capacity projected to scale to 560 GW by 2030 (vs. 1.39 GW in 2023) in net-zero settings. Proton exchange membrane (PEM) electrolysis is one of the most promising technology routes to green hydrogen production, and real-time system health monitoring of PEM electrolyzers is essential for their scalable deployment. In lab settings, performance degradation can be characterized through electrochemical testing protocols by periodic pauses of normal operation. Such interruption is not practical for full-scale stack deployments, limiting system operators’ ability to make real-time assessments of state-of-health (SoH). We present a machine learning (ML) framework that performs virtual electrochemical characterization during normal operation. The method uses an encoder-decoder transformer, conditioned on operational data, to reconstruct characterization outputs, focusing here on polarization curves. Inspired by patch-based sequence tokenization, we segment the inputs into patches and encode them to form meaningful tokens, which substantially improves learning efficiency. Across four longitudinal runs, lasting up to 478 hours on different test cells and loading cycles, the model accurately reconstructed polarization curves and achieved 10x reduction in mean squared error (MSE) compared to a vanilla transformer. This proof-of-concept demonstrates that ML models can enable continuous performance monitoring for PEM electrolyzers and that the encoder captures meaningful latent representations of SoH, opening up opportunities to derive interpretable indicators in future work.

[LG-61] Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training

链接: https://arxiv.org/abs/2605.19101
作者: Yanru Wu,Jianning Wang,Chongxin Gan,Yang Li
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training general-purpose Audio Large Language Models (ALLMs) across diverse datasets is essential for holistic audio understanding, yet it faces significant challenges due to dataset heterogeneity, which often leads to conflicting gradients and slow convergence. Despite its impact, how to explicitly manage this heterogeneity during training remains underexplored, with current practices relying primarily on uniform mixture. In this work, we analyze multi-dataset AudioQA training from a convergence perspective and propose Grouped Sequential Training (GST). GST strategically organizes datasets into affinity-aware groups and introduces them via a progressive scheduling protocol, effectively balancing the stability of parallel training with the efficiency of sequential optimization. To ensure scalability, we develop gradient-based affinity metrics that capture inter-dataset relationships without the prohibitive cost of empirical transferability estimation. Extensive evaluations on 14 AudioQA datasets spanning speech, music, and environmental sounds demonstrate that GST achieves 30–40% faster convergence than standard parallel training while maintaining or even surpassing the performance of mix-all training. Our results provide both theoretical insights and a practical, model-agnostic framework for efficient large-scale ALLM optimization.

[LG-62] Chessformer: A Unified Architecture for Chess Modeling

链接: https://arxiv.org/abs/2605.19091
作者: Daniel Monroe,George Eilender,Philip Chalmers,Zhenwei Tang,Ashton Anderson
类目: Machine Learning (cs.LG)
*备注: International Conference in Learning Representations (2026)

点击查看摘要

Abstract:Chess has long served as a canonical testbed for artificial intelligence, but modeling approaches for its central tasks have diverged. Maximizing playing strength, predicting human play, and enabling interpretability are typically solved with disparate architectures, and these designs are often misaligned with the geometry of the domain. This raises the natural question of whether these objectives require separate modeling paradigms, or if there exists a single architecture that supports them simultaneously. We introduce Chessformer, a unified architecture that advances the state of the art on all three central goals in chess modeling. Chessformer is an encoder-only transformer that represents board squares as tokens, augments self-attention with a novel dynamic positional encoding called Geometric Attention Bias (GAB) that adapts to domain-specific geometry, and predicts actions with an attention-based source-destination policy head. We evaluate Chessformer on each front. First, we develop \maiathree, a family of models for human move prediction that reaches 57.1% move-matching accuracy, significantly surpassing the previous state of the art with fewer than a quarter of the parameters. Second, we integrate Chessformer into Leela Chess Zero, a leading open-source engine, adding over 100 Elo of playing strength and resulting in tournament victories over Stockfish in major computer chess competitions. Third, we show that Chessformer’s square-token design makes attention patterns and activations directly attributable to board squares, enabling granular interpretability analyses that prior architectures do not naturally support. More broadly, our results demonstrate that aligning a model’s tokenization, positional encoding, and output design with the underlying structure of a domain can yield simultaneous gains in performance, human compatibility, and interpretability.

[LG-63] he impact of observation density on Bayesian inversion of latent dynamics in shock-dominated flows

链接: https://arxiv.org/abs/2605.19076
作者: Bipin Tiwari,Muhammad Abid,Omer San
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Inferring unknown initial states in shock-dominated compressible flows from sparse and noisy measurements is a challenging ill-posed inverse problem due to nonlinear wave interactions and limited sensing. In this work, we develop a non-intrusive reduced-order modeling framework for efficient Bayesian initial-state inversion with uncertainty quantification. The framework combines a convolutional autoencoder with a learned latent-space forward operator. The autoencoder compresses high-dimensional flow fields into a compact nonlinear latent representation, while the forward operator predicts final-time latent states from encoded initial conditions. This AE-ROM surrogate enables rapid forward evaluations and is embedded within a No-U-Turn Sampler (NUTS) for posterior exploration. The framework is demonstrated using 500 high-fidelity Sod shock tube simulations generated through Latin hypercube sampling and solved using a fifth-order WENO scheme. The inverse problem seeks to recover unknown left and right density and pressure states from sparse noisy observations of final-time density and pressure fields. Results show that the AE-ROM accurately reconstructs key shock-tube structures, including the rarefaction wave, contact discontinuity, and shock front. A latent dimension of 32 provides an effective balance between reconstruction accuracy and reduced-space compactness, while 250 training simulations are sufficient for accurate reconstruction. Increasing observation density significantly contracts posterior uncertainty, reducing the mean posterior standard deviation by approximately 78% for density and 76% for pressure. Overall, the proposed framework provides a computationally efficient and uncertainty-aware approach for inverse analysis of shock-dominated flows, with potential extensions to multidimensional compressible-flow and digital-twin applications.

[LG-64] Mapping Uncharted Symmetries: Machine Discovery in Combinatorics

链接: https://arxiv.org/abs/2605.19063
作者: Eugenio Cainelli,Lorenzo Luccioli,Alessandro Iraci,Michele D’Adderio,Giovanni Paolini
类目: Machine Learning (cs.LG)
*备注: 20 pages

点击查看摘要

Abstract:Inspired by long-standing open problems in algebraic combinatorics, we show that modern machine learning can meaningfully contribute to verifiable mathematical discoveries. In particular, we focus on the construction of simple mathematical functions under exact distributional constraints, a setting we formalize as Simple Learning Under Rigid Proportions (SLURP). We tackle this problem by introducing two methods: MapSeek-Functional, which models the desired function alternating pseudo-labeling and supervised training steps; and MapSeek-Symbolic, designed to directly produce symbolic formulas. We successfully apply both methods to a research problem in algebraic combinatorics, discovering a new combinatorial interpretation of the q,t -Narayana polynomials arising from representation theory. To our knowledge, this is the first such interpretation based on noncrossing partitions. Using one discovered statistic, we find a combinatorial proof of the symmetry of these polynomials in a previously unsolved case. To streamline verification and reproducibility, we release all code, including a formalization of all the mathematical discoveries of this paper in Lean 4.

[LG-65] Generative Pseudo-Force Fields for Molecular Generation

链接: https://arxiv.org/abs/2605.19050
作者: Stefaan Simon Pierre Hessmann,Khaled Kahouli,Stefan Gugler,Michael Plainer,Frank Noé,Klaus-Robert Müller,Niklas Wolf Andreas Gebauer
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Generating stable molecular conformations typically forces a tradeoff between the physical realism of energy-based relaxation and the sampling efficiency of data-driven generative models. While machine learning force fields (MLFFs) can sample stable conformations by relaxing molecular geometries according to physical forces, they require costly ab-initio training data. Conversely, diffusion models (DMs) learn from equilibrium data alone but are dependent on noise schedules and time-step conditioning. In this work, we propose generative pseudo-force fields (GPFFs) to bridge these paradigms by training an MLFF on a quadratic pseudo-potential energy surface relative to reference equilibrium structures. Because no ab-initio calculations are required for the perturbed geometries, non-equilibrium training data can be generated on the fly by perturbing the equilibria with Gaussian noise. We show that GPFFs constitute a time-step-agnostic variant of variance exploding DMs: the score comes from the predicted pseudo-forces but because force magnitudes implicitly encode the noise level, no time-step conditioning is needed. Our GPFF can hence be used as a drop-in replacement in standard diffusion sampling (ancestral, Heun) but also facilitates more efficient, adaptive variants and an MLFF inspired direct denoising scheme. Our proposed sampling algorithms support arbitrary structural priors and geometric constraints. On QM9, GPFF has 100 % validity at 256 neural function evaluations (NFE) and over 50 % at just 6 NFE, outperforming diffusion baselines across all samplers. Combined with custom priors, we showcase the fast and accurate generation process of our method in a molecular editor for a drug design setting, where a molecule is generated in real time.

[LG-66] Guiding Neuro-Symbolic Scenario Generation with Spatio-Temporal Logic

链接: https://arxiv.org/abs/2605.19038
作者: Lorenzo Bonin,Francesco Giacomarra,Luca Bortolussi,Jyotirmoy V. Deshmukh,Francesca Cairoli
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advancement of autonomous driving (AD) technologies has outpaced the development of robust safety evaluation methods. Conventional testing relies on exposing AD systems to vast numbers of real-world traffic scenes – a brute-force approach that is prohibitively expensive and statistically ineffective at capturing the rare, safety-critical edge cases essential for validating real-world robustness. To address this fundamental limitation, we introduce STRELGen, a scalable framework for the targeted generation of safety-critical driving scenarios. STRELGen synergistically combines a multi-agent trajectory-generation diffusion model (DM) with Spatio-Temporal Logic (STREL) specifications that encode complex safety and realism properties through a highly interpretable formalism. Crucially, monitoring satisfaction levels of these specifications is differentiable, enabling gradient-based search. At inference time, we optimize directly over the DM latent space to maximize STREL formula satisfaction. The result is efficient generation of highly plausible yet safety-critical multi-agent scenarios that lie within the learned data distribution. STRELGen thus provides a flexible, interpretable, and powerful tool for stress-testing autonomous driving systems, moving beyond the limitations of brute-force data collection.

[LG-67] Learning When to Adapt

链接: https://arxiv.org/abs/2605.19028
作者: Ali Zindari,Xiaowen Jiang,Rotem Mulayoff,Sebastian U. Stich
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method, yet its learned correction is static: the same low-rank update is applied to every input. This input-agnostic approach creates an inevitable compromise between adapting to the fine-tuning distribution and preserving pre-trained behavior on inputs outside that distribution, contributing to catastrophic forgetting. We introduce DISeL (Dynamic Input-Sensitive LoRA), which augments LoRA modules with lightweight input-dependent gates over individual rank-one components. The gating mechanism is designed to preserve the pre-trained model’s behavior by default, while training learns to activate selected components that reduce the fine-tuning loss. DISeL adds only a small number of parameters and preserves the low-rank structure. Across RoBERTa on GLUE, and Llama and Mistral models fine-tuned for mathematical reasoning and code generation, DISeL reduces forgetting relative to LoRA and related variants while maintaining competitive fine-tuning accuracy. In addition, the learned gate activations provide an interpretable diagnostic view of which layers and rank components are most activated during fine-tuning, giving insight into where task-specific adaptation is concentrated. Code available at this https URL .

[LG-68] Deep Neural Sheaf Diffusion ICML2026

链接: https://arxiv.org/abs/2605.19021
作者: Remi Bourgerie,Sarunas Girdzijauskas,Viktoria Fodor
类目: Machine Learning (cs.LG)
*备注: Under review at GFM@ICML2026

点击查看摘要

Abstract:Deep Graph Neural Networks (GNNs) are essential for capturing complex dependencies in graph-structured data. However, scaling GNNs to depth remains challenging, as stacking layers leads to representation collapse and diminishing sensitivity due to repeated aggregation. While Neural Sheaf Diffusion (NSD) provides strong theoretical guarantees against such collapse, these guarantees do not translate to practice: as depth increases, the disagreement signal of the sheaf Laplacian vanishes, limiting the contribution of deeper layers. We identify mechanisms that hinder NSD effectiveness at depth and propose \emphDeep Neural Sheaf Diffusion (DNSD), which replaces the sheaf Laplacian with a sheaf adjacency operator to maintain informative signals across layers. This is complemented by normalization, odd nonlinearities, and gating. To provide a principled explanation of the expected performance improvement, we contrast sheaf diffusion to graph attention mechanisms, highlighting that DNSD replaces scalar attention scores with matrix-valued edge functions and normalizes node representations rather than attention scores. We demonstrate empirically that DNSD effectively utilizes deep aggregation in graph tasks, outperforming GNN and NSD baselines with up to 30pp accuracy on synthetic long-range datasets, and consistently outperforming them on real-world benchmarks. These results position sheaf-based architectures as a promising building block for graph foundation models by supporting effective deep architectures.

[LG-69] LoRA vs. Full Fine-Tuning: A Theoretical Perspective

链接: https://arxiv.org/abs/2605.19018
作者: Ali Zindari,Rotem Mulayoff,Sebastian U. Stich
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Fine-tuning adapts a pre-trained model to downstream tasks using a small amount of labeled data. Low-Rank Adaptation (LoRA) is an efficient fine-tuning method that reduces memory and computation costs while often achieving performance close to full fine-tuning. Despite its widespread use, the theoretical behavior of LoRA is not yet well understood. In this paper, we study LoRA in a simple linear regression setting and compare its excess risk with that of full fine-tuning. Our analysis identifies regimes in which LoRA achieves lower excess risk than full fine-tuning in both overdetermined and underdetermined settings. Specifically, our theory predicts that LoRA can outperform full fine-tuning when the difference between the pretraining and the downstream tasks is effectively low-rank. We further show how the choice of LoRA rank affects generalization performance, explaining why using a very small rank can improve test accuracy in certain settings, even though it limits model expressivity. Finally, we support our theoretical results with experiments on practical tasks, suggesting that the identified tradeoffs and insights extend beyond linear regression.

[LG-70] SAGA: A Sequence-Adaptive Generative Architecture for Multi-Horizon Probabilistic Forecasting with Adaptive Temporal Conformal Prediction

链接: https://arxiv.org/abs/2605.19014
作者: Gustav Olaf Yunus Laitinen-Fredriksson Lundström-Imanov,Hafize Gonca Cömert
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注: 14 pages, 3 figures, 12 tables, 5 appendices, 45 references. Submitted to IEEE TPAMI. Source code at this https URL (archived: doi: https://doi.org/10.5281/zenodo.20260366 ). Synthetic equivalent dataset: doi: https://doi.org/10.5281/zenodo.20260287 . Empirical work conducted on the Swedish LISA register via SCB MONA (project SCB-MONA-2026-147); ethical approval Swedish Ethical Review Authority 2026-04127-01

点击查看摘要

Abstract:Microsimulation models used by ministries of finance and central banks rely on parametric processes for lifetime earnings that capture only first and second moments of the conditional distribution and miss long-range nonlinear structure. We propose SAGA, a decoder-only transformer for irregular tabular panel sequences, paired with a split conformal calibration wrapper that delivers individual-level prediction intervals with finite-sample marginal coverage guarantees. Trained on the longitudinal Swedish LISA register over 1990 to 2022, comprising 2,143,817 individuals and 61,284,903 person-years, the model forecasts annual labor earnings at horizons of one to thirty years and aggregates them by Monte Carlo into present-discounted lifetime earnings distributions. Against the canonical Guvenen, Karahan, Ozkan, and Song parametric process and tabular and recurrent baselines, SAGA reduces continuous ranked probability score by 31.9 percent at the ten-year horizon and mean absolute error by 37.7 percent at the twenty-year horizon. Conformal intervals achieve nominal coverage to within 0.4 percentage points marginally and within 2.4 percentage points on the worst-case demographic subgroup. The reconstructed lifetime earnings Gini coefficient is 0.327 against the partially observed truth of 0.341 and the GKOS estimate of 0.378. Model weights, calibration tables, and a synthetic equivalent dataset are released for replication outside the protected SCB MONA environment.

[LG-71] Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization

链接: https://arxiv.org/abs/2605.18999
作者: Yury Demidovich,Abhishek Chakraborty,Grigory Malinovsky,Angelia Nedić,Peter Richtárik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Muon and related normalized optimizers decouple the choice of update direction from the choice of step scale, but their practical performance remains sensitive to the scale of the normalized step. We study adaptive scaling rules for Muon in general norm geometries and develop three complementary algorithms. For smooth non-convex objectives, we introduce Distance-Adaptive Muon, whose trust-region radius is set from the radius explored by the trajectory, and prove a stationarity guarantee under a bounded-trajectory assumption. We then turn to star-convex objectives, a tractable model of the favorable global geometry often used to reason about the empirical loss landscapes of deep neural networks, where objective-gap guarantees are possible. In this setting, we first introduce Scale-Calibrated Muon, which keeps Muon’s exponential moving average but sets the step length from a local descent certificate computed from the current gradient and momentum. For this method, we prove a last-iterate O(1/T) objective-gap bound under a bounded initial sublevel-set assumption, where the corresponding radius parameter appears only in the analysis and not in the algorithm. Finally, we develop Distance-Free Muon, a recentered trust-region method that uses a scalar distance certificate and a majorized one-dimensional search to select the trust-region radius without requiring the unknown distance from the initialization to a global minimizer. Experiments on Transformer language modeling (GPT-124M/WikiText-103) and image classification (ViT-Tiny/CIFAR-100) show that the proposed adaptive scaling rules reduce sensitivity to manual scale tuning and match or improve tuned fixed-scale Muon baselines under the tested budgets.

[LG-72] abQL: In-Context Q-Learning with Tabular Foundation Models

链接: https://arxiv.org/abs/2605.18979
作者: Qisai Liu,Zhanhong Jiang,Timilehin Ayanlade,Ashutosh Kumar Nirala,Yang Li,Aditya Balu,Soumik Sarkar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Tabular Q-Learning (TabQL), a reinforcement learning framework that replaces the conventional parametric Q-network in Deep Q-Learning (DQN) with a tabular foundation model endowed with in-context learning capabilities. The key idea is to represent Q-values through a sequence-to-sequence foundation model operating over a tabularized representation of state-action-Q-value tuples, enabling rapid adaptation from limited online interaction by conditioning on recent experience. TabQL departs from classical DQN by leveraging (i) zero- or few-shot Q-value inference via in-context updates, and (ii) a warm-up phase using standard DQN to bootstrap high-quality context. Particularly, to enhance the context quality, new transitions are generated by executing actions output by TabQL with predicted Q values from DQN. We formalize TabQL, analyze its convergence and sample complexity under mild assumptions, and show that TabQL interpolates between vanilla Q-learning and DQN with in-context learning. Our analysis demonstrates that TabQL achieves improved efficiency compared to DQN by amortizing Bellman updates through in-context learning. Extensive numerical experiments with several benchmarks showcase the effectiveness and efficacy of the proposed TabQL.

[LG-73] A Geometric Analysis of Sign-Magnitude Asymmetry in a ReLU RMSNorm Block under Ternary Quantization

链接: https://arxiv.org/abs/2605.18933
作者: Lei Dong
类目: Machine Learning (cs.LG)
*备注: 53 pages, 2 figures, 21 tables, 7 appendices

点击查看摘要

Abstract:Pre-norm Transformers with RMSNorm tolerate ternary -1,0,+1 weight quantization with surprisingly small loss (Ma et al., 2024). We give a geometric explanation via sign-magnitude decomposition of weight perturbations. In a two-layer ReLU + RMSNorm model with i.i.d. Gaussian weights, sign-flips produce \pi/(\pi-2) \approx 2.75 times more transverse output energy than sign-preserving magnitude perturbations of equal Frobenius norm, as the flip rate p \to 0 (Theorem 3). The mechanism: ReLU creates a hidden-space directional asymmetry between the two perturbation types, which RMSNorm’s transverse-projection Fréchet derivative selectively exposes. Sign-quantization error is itself a sign-preserving perturbation with angular alignment \cos^2 \to 2/\pi (Theorem 4); its post-ReLU radial fraction ( 0.365 ) matches the pre-ReLU value 1-2/\pi within 0.4% , so ReLU is approximately transparent to ternary error. Multi-layer compounding of the 2.75\times factor is not experimentally supported; the gap to real-model sign sensitivity arises from outlier features violating delocalization. For an input dimension with amplitude \alpha , a single sign-flip produces post-ReLU energy amplified by R \approx n\alpha^2 relative to a delocalized entry. On TinyLlama-1.1B, at linear response ( p \leq 0.5% ), count-matched NLL leverage stabilizes at \sim 10\times \approx n\mathbbE[\alpha^2] , matching the per-entry theory; the all-column NLL ratio of 5.0\times falls within R_\mathrmcol \leq 19 ( 67\times PPL gap reflects metric nonlinearity). Measured outlier \alpha at layer 12 (median 0.024 , max 0.26 ) confirms heavy-tailed concentration. The Bussgang constant 2/\pi , RMSNorm geometry, and ReLU half-space structure together explain sign-magnitude asymmetry in pre-norm models, with R \propto n\alpha^2 accounting for real-model deviations.

[LG-74] Descriptive versus Regulatory Uncertainty in Bounded Predictive Systems

链接: https://arxiv.org/abs/2605.18909
作者: Ahmed Gamal Eldin
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Any system that models the world under finite representational capacity must compress; any compression entails a prior; and the prior is the system’s bias. What has not been established is whether uncertainty participates in the dynamics governing future behavior, or merely describes the output distribution without consequence. We introduce a structural distinction between descriptive uncertainty, which does not recursively modulate the system’s policy, and regulatory uncertainty, which directly enters the optimization landscape and drives persistent adaptive restructuring. We prove formally that current transformer architectures are confined to descriptive uncertainty at inference. We ground this in thermodynamics via Landauer’s principle: for uncertainty to be regulatory, epistemic error must cost real energy; in a decoupled system, hallucinations and correct derivations dissipate identical energy. We test this empirically across three locally-deployed language models (3B, 8B, 70B parameters). Token-level Shannon entropy is statistically invariant across tasks spanning pattern retrieval, causal operator application, and out-of-distribution causal generalization in all three models (all pairwise p = 0.568; within-model ranges 0.011-0.028 nats), while task accuracy varies substantially across the same conditions (0%-100%). Entropy and accuracy are orthogonal. The decoupling is scale-invariant: larger models achieve higher accuracy but identical entropy flatness. This structural incapacity is not resolvable by additional parameters or training data. Genuine epistemic grounding requires physical coupling between thermodynamic substrate state and information processing cost.

[LG-75] Variational Diffusion Channel Decoder

链接: https://arxiv.org/abs/2605.18902
作者: Chengwei Zhang,Yifan Du,Siyu Liao
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural channel decoder, as a data-driven channel decoding strategy, has shown very promising improvement on error-correcting capability over the classical methods. However, the success of those deep learning-based decoder comes at the cost of drastically increased model storage and computational complexity, hindering their practical adoptions in real-world time-sensitive resource-sensitive communication and storage systems. To address this challenge, we propose an efficient variational diffusion model-based channel decoder, which effectively integrates the domain-specific belief propagation process to the modern diffusion model. By reaping the low-cost benefits of belief propagation and strong learning capability of diffusion model, our proposed neural decoder simultaneously achieves very low cost and high error-correcting performance. Experimental results show that, compared with the state-of-the-art neural channel decoders, our model provides a feasible solution for practical deployment via achieving the best decoding performance with significantly reduced computational cost and model size.

[LG-76] A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions

链接: https://arxiv.org/abs/2605.18898
作者: Tiexin Ding
类目: Machine Learning (cs.LG)
*备注: 27 pages, 14 figures. Companion library npm-weibull-py and benchmark database available at this https URL

点击查看摘要

Abstract:We apply the Weibull distribution – a two-parameter family from extreme-value theory – as a diagnostic framework for element-wise weight magnitude distributions in transformers. At initialization, i.i.d. Gaussian weights give |w| ~ HalfNormal, yielding k ~ 1.20 via middle-80% probability-plot fit (the protocol used throughout this work). This anchor makes k a principled, architecture-independent measuring stick for training dynamics; fitting each weight matrix independently at every layer at every checkpoint enables per-component, per-layer, and per-step diagnostics that aggregate statistics cannot resolve. Applying this framework to 12 model entries spanning 7 architectural families (Pythia, OLMo-1/2, LLaMA-3, Mistral, Qwen2.5/3) reveals three findings. First, FFN modules and the attention output projection W_o – the Transmission Class – fall in a narrow k band: median terminal k in [1.186, 1.204] across 12 entries (cross-family CV = 0.51%), shared across SwiGLU/GeLU activations, Pre-LN/QK-Norm placements, and 70M-14B sizes. Second, the attention input projections W_q, W_k – the Selection Class – depart from the Weibull family, with severity shaped by storage: separately-stored Q/K (OLMo-1, OLMo-2) yields k in [0.76, 0.99] (deep); GQA models yield k in [1.10, 1.16] (mild); Pythia’s merged W_qkv occupies a transitional zone tracking training budget T/tau monotonically. Third, lambda grows substantially during training and scales with sqrt(eta/lambda_wd) within the Pythia family (Pearson r = 0.94, three Transmission kinds), directionally consistent with Fan et al. (2025). The two parameters carry independent information: k labels the functional class, lambda labels training progress. We release npm-weibull-py v0.4 (Python library) and DATABASE_v9_1 at this https URL . Comments: 27 pages, 14 figures. Companion library npm-weibull-py and benchmark database available at this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.18898 [cs.LG] (or arXiv:2605.18898v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.18898 Focus to learn more arXiv-issued DOI via DataCite

[LG-77] Position: Graph Condensation Needs a Reset – Move Beyond Full-dataset Training and Model-Dependence

链接: https://arxiv.org/abs/2605.18893
作者: Mridul Gupta,Samyak Jain,Vansh Ramani,Hariprasad Kodamana,Sayan Ranu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their scalability is increasingly strained by the size of real-world graphs in domains like recommender systems, fraud detection, and molecular biology. Graph condensation – the task of generating a smaller synthetic graph that retains the performance of models trained on the original – has emerged as a promising solution. However, the dominant approach of gradient matching introduces a fundamental contradiction: it requires training on the full dataset to create the compressed version, thereby undermining the goal of efficiency. Worse still, these methods suffer from high computational overhead, poor generalization across GNN architectures, and brittle reliance on specific model configurations. Equally concerning is the community’s reliance on misleading evaluation protocols such as node compression ratios, which fail to reflect true resource savings, condensation overhead, and illusory application to neural architecture search. These shortcomings are not incidental – they are systemic, and they obstruct meaningful progress. In this position paper, we argue that graph condensation, in its current form, needs a reset. We call for moving beyond full-dataset training and model-dependent design, and instead advocate for methods that are lightweight, architecture-agnostic, and practically deployable. By identifying key methodological flaws and outlining concrete research directions, we aim to reorient the field toward approaches that deliver on the true promise of condensation: efficient, generalizable, and usable GNN training at scale. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.18893 [cs.LG] (or arXiv:2605.18893v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.18893 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-78] Emergence of a Flow-Assisted Casting Strategy for Olfactory Navigation via Memory-Augmented Reinforcement Learning

链接: https://arxiv.org/abs/2605.18881
作者: Changxu Zhao,Dongxiao Zhao,Xin Bian,Gaojin Li
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:In dynamic flow fields, various animals exhibit remarkable odor search capabilities despite relying on stochastic detections. Interestingly, there exists an optimal time window for integrating these detections that maximizes search efficiency. To understand the underlying mechanism, we investigate the navigation performance of Reinforcement Learning (RL) agents in unsteady flows under varying memory lengths and flow conditions. Without any predefined models, the agents develop a flow-assisted casting strategy and adaptively adjust both the geometry of their search trajectories and the concentration threshold for initiating casting to maximize the success rate. The agent’s average speed toward the odor source exhibits a non-monotonic dependence on memory length, which can be explained by the “sector-search” model.

[LG-79] Multi-Headed Transformer Architectures as Time-dependent Wasserstein Gradient Flows

链接: https://arxiv.org/abs/2605.18870
作者: Alex Massucco,Leonardo Del Grande,Marcello Carioni,Christoff Brune,Carola-Bibiane Schönlieb
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Functional Analysis (math.FA)
*备注:

点击查看摘要

Abstract:In recent years, transformer architectures have revolutionized the field of language processing, opening the door to previously unforeseen possibilities. However, from a theoretical point of view, the mathematical models proposed in the literature often lack direct contact with the actual architectures and depend on strong simplifying assumptions. In this paper, we reduce this gap by modelling the data flow in multi-headed transformer architectures as time-dependent gradient flows for a suitable interaction energy capturing the design of the attention mechanism. The explicit dependence on time allows us to consider different weights for each head and for each layer, without imposing constraints on the initialization method. Moreover, we prove that, under a suitable integrability assumption on the evolution of the weights, each element of the \omega -limit set of the gradient flows is a stationary point of the interaction energy at a limiting weight distribution. Finally, we analyse the stability of the gradient flows considering perturbations of both the initial data and the weights. Specifically, on the one hand, we study the robustness of the proposed models with respect to noisy inputs, establishing a continuous dependence of the gradient flows on the initial data and uniqueness of the flows. On the other hand, we prove the \Gamma -convergence of the perturbed interaction energy to the unperturbed one, leading to the convergence of the corresponding gradient flows. We complement these theoretical results with numerical experiments that confirm the predicted energy-dissipation identity and clarify the asymptotic behavior of the dynamics in both the autonomous-like (Ornstein–Uhlenbeck) and the genuinely non-autonomous (oscillating-weights) regimes.

[LG-80] Evaluating Memory Condensation Strategies for Coding Agents in Data-Driven Scientific Discovery

链接: https://arxiv.org/abs/2605.18854
作者: Renuka Chintalapati,Sid Raskar,Anurag Acharya,Jared Willard,Patrick Emami,Sameera Horawalavithana
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Coding agents accumulate extensive context during long-running tasks, yet fixed context windows force practitioners to choose between truncation and task failure. While numerous memory condensation strategies have been proposed, from simple sliding windows to LLM-generated summaries, no systematic comparison exists to guide strategy selection, especially in scientific discovery tasks. We evaluate eight memory condensation strategies using GPT-4o on sixty DiscoveryBench tasks spanning six scientific domains (480 total evaluations). We find that no condenser significantly alters hypothesis quality, while LLM-based condensers increase token costs by 24-94 percent, and masking tool-call outputs achieves an 8.6 percent net savings. We also observe that the optimal condenser for data-driven scientific discovery varies by scientific domain and task length.

[LG-81] STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning

链接: https://arxiv.org/abs/2605.18851
作者: Junjie Zhang,Guozheng Ma,Shunyu Liu,Zetian Hu,Yongcheng Jing,Ting-En Lin,Yongbin Li,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in Reinforcement Learning (RL) have underscored its potential for incentivizing reasoning capabilities of Large Language Models (LLMs). However, existing step-level efforts suffer from costly annotations that limit domain coverage, while scalar scores further impose an information bottleneck, offering insufficient semantic bandwidth to improve intermediate decisions. Alternative language-critique approaches, which rely on frozen or external critics, provide richer textual feedback but lack the scalability needed for sustained policy improvement. In this work, we propose language-driven stepwise trajectory redirection, termed as STRIDE, a novel training framework that shifts process supervision from scalar rewards to learnable stepwise language feedback. Specifically, we co-train a generator and a generative verifier using only outcome-based rewards, eliminating external annotations, while delivering sustained policy improvement through jointly aligned verifier training. The verifier’s stepwise language critiques explicitly localize and explain failures, enabling the generator to redirect reasoning trajectories at intermediate steps toward alternative decisions. The trajectory redirection design guarantees harmless policy improvement, even under noisy or suboptimal verifier feedback. Experiments on diverse reasoning benchmarks show that STRIDE significantly outperforms state-of-the-art baselines, as well as achieving breakthroughs on zero-pass-rate problems where scalar methods yield no learning signal in our ablation studies, demonstrating the effectiveness of learnable stepwise language feedback for enhancing LLM reasoning.

[LG-82] EMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting

链接: https://arxiv.org/abs/2605.18843
作者: Zeyu Zhang,Bradly C. Stadie
类目: Machine Learning (cs.LG)
*备注: 9 pages in main context

点击查看摘要

Abstract:Backtesting large language models on historical events requires reasoning exclusively from information available before a specified cutoff date. Yet models routinely leak post-cutoff knowledge from pre-training into their reasoning, inflating apparent accuracy and undermining evaluation validity. Prompt-based constraints fail when suppressed content is causally related to the prediction, and knowledge unlearning cannot address this problem because temporal compliance is instance-specific: the same fact may be legitimate evidence for one cutoff date and a violation for another. Rather than erasing knowledge, the model must learn temporal discipline: selecting evidence conditioned on each instance’s cutoff date. We propose TEMPO (Temporal Enforcement via Mode-separated Policy Optimization), which trains this discipline via two contributions: (1) a two-mode reward where a leakage mode drives post-cutoff claims to zero as a hard prerequisite before a performance mode optimizes task performance; and (2) a GRPO-based training pipeline that enables the model to discover temporally valid reasoning strategies. We prove that training monotonically decreases leakage, converges to the leak-free optimum, and improves task performance once compliance is achieved. On three prediction tasks and two models, TEMPO reduces leakage from 2~13% to 0.6~3.7% across all conditions, with task performance improving 6~13% where strong pre-cutoff signals exist and maintained where the prediction task is inherently difficult from valid information alone.

[LG-83] Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

链接: https://arxiv.org/abs/2605.18842
作者: Timofey Tomashevskiy
类目: Machine Learning (cs.LG)
*备注: Preprint version

点击查看摘要

Abstract:Safe reinforcement learning in nonstationary environments requires safety mechanisms that adapt as environmental conditions change. Standard safe reinforcement learning methods often assume fixed constraints or stable environmental conditions, which can become inadequate under distribution shift. We propose LILAC+, a framework for safe continual reinforcement learning under nonstationarity that combines three adaptive safety mechanisms: context-based safety constraints, adaptation-speed constraints, and budget-to-state safety enforcement. Context-based constraints adjust safety requirements using inferred and predicted environmental context. Adaptation-speed constraints tighten safety requirements when the rate of environmental change exceeds the agent’s ability to adapt safely. Budget-to-state enforcement converts cumulative safety requirements into local state-level control constraints that can be enforced at decision time. Together, these mechanisms provide a unified approach for proactive and reactive safety adaptation in continual reinforcement learning. We evaluate the framework in simulated driving environments under stationary, seen nonstationary, and unseen nonstationary conditions. The results show that adaptive safety constraints substantially reduce safety violations under distribution shift while maintaining competitive task performance compared with unconstrained and fixed-constraint baselines. These findings suggest that safe continual reinforcement learning requires adaptive constraint mechanisms that respond not only to current state information but also to predicted environmental context, adaptation demand, and remaining safety budget.

[LG-84] From Cumulative Constraints to Adaptive Runtime Safety Control for Nonstationary Reinforcement Learning

链接: https://arxiv.org/abs/2605.18841
作者: Timofey Tomashevskiy
类目: Machine Learning (cs.LG)
*备注: 13 pages. Preprint version

点击查看摘要

Abstract:Safety in reinforcement learning is often specified through cumulative cost constraints, but these trajectory-level guarantees do not directly prevent unsafe individual decisions, especially under nonstationarity. In continual and nonstationary settings, the difficulty is amplified because the risk associated with the same action can vary across contexts, while a fixed state-level threshold may be either too conservative or too weak. We propose Constraint Projection Safety Shield (CPSS), a runtime mechanism that converts a cumulative safety budget into adaptive state-level control constraints during execution. CPSS tracks the remaining safety budget, projects it into a time-varying admissible risk threshold, and filters policy actions whose predicted safety cost exceeds the active threshold. The threshold is adjusted online using contextual signals so that enforcement becomes stricter in more demanding or rapidly changing regimes and less restrictive when the available safety budget is sufficient. We analyze the resulting shielded policy and show that the mechanism guarantees per-state threshold satisfaction for executed actions, induces finite-horizon cumulative cost bounds, and yields a performance degradation bound in terms of intervention frequency and per-step reward distortion. We evaluate CPSS in nonstationary highway merging scenarios using highway-env. Across multiple seeds, CPSS substantially reduces proximity-based safety violations and increases separation margins while intervening selectively rather than dominating the learned policy. These results support adaptive budget-to-threshold projection as a practical way to transform cumulative safety specifications into effective local safety control for continual reinforcement learning systems.

[LG-85] StampFormer: A Physics-Guided Material-Geometry-Coupled Multimodal Model for Rapid Prediction of Physical Fields in Sheet Metal Stamping

链接: https://arxiv.org/abs/2605.18835
作者: Jiajie Luo,Mohamed Mohamed,Osama Hassan,Haosu Zhou,Yingxue Zhao,Haoran Li,Xinrun Li,Zhutao Shao,Yang Long,Nan Li,Jichun Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional sheet metal forming relies on time-consuming and expensive Finite Element Analysis (FEA) for design validation, a process that significantly prolongs design cycles. While surrogate models offer faster iteration, current approaches have limitations: scalar-based methods cannot capture comprehensive field-based FEA results, while existing image-based models often ignore the critical role of material properties by focusing solely on geometry. To address this gap, we develop a physics-guided deep learning framework, namely StampFormer, which simultaneously uses component geometry and material stress-strain responses to predict FEA outcomes. The StampFormer framework uses three core components to process data. A Material-Augmented Geometric Network (MAGN) first fuses geometric and material data. This information is then integrated at various levels by a Hierarchical Material Embedding Injection Unit (HMEIU) before being processed by the primary network backbone, an adapted Swin-UNet. We evaluated our model on the stamping of a crossmember panel with two simulation datasets for steel and aluminium panels, and results demonstrate that StampFormer provides high-fidelity predictions of critical physical fields - including thinning, major strain, minor strain, plastic strain, and displacement - in under a second. Compared with ground truth FEA, our model achieved an average relative error of less than 8.5% on the four 2D fields and a mean squared error of less than 1.2 mm2 for the 3D displacement field. In summary, we introduce a practical and efficient framework that integrates multimodal information, namely geometry and material properties, to provide fast and accurate predictions, enabling designers to perform real-time manufacturability assessments.

[LG-86] In-Context Learning Operates as Concept Subspace Learning

链接: https://arxiv.org/abs/2605.18830
作者: Wei Tang,Xinyan Jiang,Fakhri Karray,Lijie Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regression and Bayesian accounts of in-context learning (ICL) explain how demonstrations can induce predictors, while mechanistic analyses often identify compact activation directions that steer prompted behavior. However, it remains unclear whether structured demonstrations induce low-dimensional concept inference. We study this question through a concept-subspace view of ICL, in which tasks vary only along intrinsic concept coordinates, although inputs are observed in a high-dimensional ambient space. For ridge and least-squares ICL proxies, prediction decomposes exactly into concept-coordinate regression and off-subspace leakage. Under block-diagonal or near-block-diagonal covariance assumptions, the leading estimation and nuisance-sensitivity terms scale with the dimension of the concept subspace, while residual effects are controlled by cross-subspace coupling. This separation gives a mechanistic prediction: recoverable task information should concentrate in a low-dimensional, task-aligned activation subspace. On CounterFact-derived multi-relation prompts with Llama-3-8B, a 68–73-dimensional subspace of the 4096-dimensional residual stream restores 78.8% of the clean–corrupted accuracy gap, whereas patching the complementary subspace restores 0%. Concept swaps redirect predictions toward injected relations, while random and cross-task matched-rank controls are largely ineffective. Additional experiments on Qwen2.5-7B and a controlled cross-lingual rule task show the same qualitative pattern. These results support concept subspaces as compact, task-aligned mediators of recoverable ICL behavior in structured task families, without implying full-circuit recovery.

[LG-87] Lossless Anti-Distillation Sampling

链接: https://arxiv.org/abs/2605.18829
作者: Zibo Diao,Jingchu Gai,Xinyue Ai,Zhang Zhang,Zhenyu He,Di He
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Frontier commercial generative models face a growing threat from distillation, whereby a distiller harvests generated responses and trains a competing model of its own at drastically lower cost. Existing defenses either rely on modifying the models outputs, thereby sacrificing response quality for benign users, or on behavioral detection methods, which can be readily circumvented by distributing queries across multiple accounts. In this work, we propose Lossless Anti-Distillation Sampling (LADS), a novel sampling scheme specifically designed to counter multi-account distillation while maintaining a lossless experience for benign users. Concretely, LADS derives the randomness underlying each generation from a private seed determined by the semantic content of the query and the number of times the user has queried the model. By construction, every benign user receives a response independently sampled from the original model at each visit, and thus experiences no distortion. In contrast, for a distiller, different accounts share latent randomness whenever their queries fall in the same semantic bucket. As a result, the harvested data becomes correlated, potentially reducing sample diversity and degrading generalization. Using uniform convergence theory, we show that LADS provably degrades the convergence rate of the distillers generalization gap relative to standard i.i.d. sampling in both unconditional and conditional generation settings. Experiments on image generation, mathematical reasoning, and code generation confirm that LADS substantially degrades the performance of distilled students while preserving exact statistical fidelity for individual users.

[LG-88] Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches

链接: https://arxiv.org/abs/2605.18825
作者: Shaoke Fang,Ziang Li,Wenfei Wu,Jiatong Ji,Qingsong Liu,Ruizhi Pu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Prefix caching is a key optimization in Large Language Model (LLM) serving, reusing attention Key-Value (KV) states across requests with shared prompt prefixes to reduce expensive prefill computation. However, its benefit depends critically on the eviction policy as GPU memory is scarce, and existing policies such as LRU largely treat cached blocks uniformly. This view ignores a fundamental property of LLM prompts: not all tokens are equally worth caching. We show that different token types within a prompt, including system prompts, user queries, tool outputs, model responses, and chain-of-thought reasoning, exhibit up to 756x variation in reuse rates, yet no existing eviction policy exploits this signal. In this paper, we present SAECache (Semantic-Adaptive Eviction for prefix caches), a semantic-adaptive prefix cache eviction policy that addresses this gap through three innovations: (1) a multi-queue architecture that routes KV blocks to task-specific queues with tailored priority metrics, capturing both session reuse in multi-turn requests and structural reuse in templated single-turn requests; (2) a semantic-aware token weighting mechanism that learns the reuse value of different token types online through eviction feedback; and (3) a fully adaptive online learning schema for all parameter updates, including log-normal timing parameters, position decay power, queue weights, and meta-parameters, which eliminates manual tuning and enables automatic adaptation to deployment-specific workload characteristics. Through extensive evaluation across heterogeneous workloads, we demonstrate that SAECache achieves 1.4x-2.7x TTFT improvement over production-style baselines, while fixed-parameter alternatives can degrade by up to 2.7x under workload mismatch – a failure mode our adaptive approach avoids entirely.

[LG-89] Multi-Pedestrian Safety Warning at Urban Intersections Use Case of Digital Twin

链接: https://arxiv.org/abs/2605.18823
作者: Yongjie Fu,Qi Gao,Mahshid Ghasemi Dehkordi,Gil Zussman,Xuan Di
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Digital twins (DTs) for urban transportation systems have gained increasing attention; however, their systematic evaluation in safety-critical scenarios remains limited. This paper presents a multi-pedestrian safety warning system at urban intersections enabled by a tightly coupled physical-digital twin framework. Built upon the COSMOS city-scale wireless testbed in New York City, the proposed system integrates camera and ultra-wideband (UWB), edge-cloud computing, predictive trajectory modeling, and MQTT-based communication to deliver real-time safety alerts to vulnerable road users (VRUs). The system is evaluated through both field deployment and virtual reality (VR) experiments. Results demonstrate high warning generation accuracy, localization accuracy, efficient end-to-end latency under different model configurations, and significant reductions in user response time when warnings are issued. The proposed DT framework provides a scalable, modular, and generalizable solution for real-time multi-pedestrian safety enhancement at complex urban intersections.

[LG-90] Quantum Adversarial Machine Learning: From Classical Adaptations to Quantum-Native Methods

链接: https://arxiv.org/abs/2605.18821
作者: Roozbeh Razavi-Far,Mohammad Meymani,Erfan Mahmoudinia,Dorsa Vazirzade,Peyman Paknezhad,Fateme Ghasemi,Saeed Saravani,Somayeh Nikkhoo,Kimia Haghjooei
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Machine learning has revolutionized numerous industrial domains. Despite recent advances, machine learning models remain vulnerable to adversarial threats. Adversarial machine learning is a field that studies these vulnerabilities to build robust machine learning models. Quantum machine learning is an interdisciplinary field that bridges quantum computing and classical machine learning. While quantum machine learning shows potentials to outperform classical machine learning in complex tasks such as regression, classification, and generative modeling, it remains vulnerable to adversarial attacks. Given the recent advancements in quantum computing and machine learning, the quantum adversarial machine learning field has emerged to study the vulnerabilities of quantum machine learning, possible attacks, and novel quantum-enhanced defense strategies. In this survey, we provide a detailed overview on quantum adversarial machine learning and explore the existing attacks and countermeasures. We also review the theoretical underpinnings of this area, emerging trends, and critical challenges.

[LG-91] Efficient Conditioning Why Pseudo Observation Batch Bayesian Optimization Works When It Does not

链接: https://arxiv.org/abs/2605.18819
作者: Kumbha Nagaswetha,Rabi Pathak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Constant Liar (CL), Kriging Believer (KB), and fantasy models are widely used for batch selection in parallel Bayesian Optimization, yet a unified theory explaining their effectiveness and conditions under which they fail has been lacking. We identify efficient conditioning as the key surrogate property the ability to update predictions in closed form when data is augmented. We prove that Gaussian Processes satisfy this requirement, producing provably distinct batch points with separation of order l, and that this holds for any acquisition function monotonically non decreasing in posterior uncertainty (EI, UCB, PI), with qualitatively similar behavior for Thompson Sampling. We unify CL, KB, and fantasy models as instances of a single conditioning mechanism differing only in the lie value distribution, and draw quantitative connections to Local Penalization (LP) and qualitative connections to Determinantal Point Processes (DPPs). To disentangle model structure from optimizer randomness, we introduce the Structural Diversity Diagnostic (SDD), a reusable methodology for testing surrogate compatibility. Experiments on Hartmann6D, Ackley 8D, Levy10D, and SVM hyperparameter tuning validate all theoretical predictions: CL or KBs implicit penalty matches or outperforms explicit LP greedy conditioning achieves convergence on par with joint qEI efficient conditioning extends to Multiquadric RBF networks; and parametric surrogates produce degenerate batches even when fully retrained (random forests), while neural networks regain diversity only at 15x the wall clock cost of GP conditioning. Robustness is confirmed across multiple initial datasets and under observation noise.

[LG-92] Multi-Token Residual Prediction

链接: https://arxiv.org/abs/2605.18817
作者: Yufeng Xu,Zishuo Bao,Qian Wang,Zeshen Zhang,Haoqi Zhang,Bowen Peng,Ang Li,Rahul Chalamala,Yucheng Lu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion Language Models (DLMs) generate text by iteratively denoising masked token sequences, offering a tradeoff between parallelism and quality compared to autoregressive models. In current practice, the number of tokens decoded per step is controlled by a confidence threshold, and quality degrades monotonically as more tokens are denoised per step. We introduce Multi-token Residual Prediction (MRP), a lightweight module that enables dependency-aware multi-token denoising within a single backbone forward pass. MRP exploits a key property of the denoising process: the logit distributions at adjacent denoising steps are remarkably similar. Rather than running the backbone a second time to obtain the next-step logits, MRP predicts the residual between steps from the backbone’s hidden states, effectively denoising more tokens per backbone forward at a fraction of the cost. We deploy MRP in two inference modes: direct decoding, which uses the corrected logits without verification for a tunable quality–speed tradeoff; and speculative decoding, which verifies MRP’s proposals against the backbone for lossless acceleration. Experiments on SDAR models at the 1.7B, 4B, and 8B scales across reasoning and code generation benchmarks demonstrate up to 1.42\times lossless speedup in SGLang.

[LG-93] DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training ICME

链接: https://arxiv.org/abs/2605.18815
作者: Yuanqing Wang,Yuchen Zhang,Hao Lin,Junhao Hu,Chunyang Zhu,Quanlu Zhang,Boxun Li,Guohao Dai,Zhi Yang,Daning Cheng,Yunquan Zhang,Yu Wang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: GitHub Repo: this https URL

点击查看摘要

Abstract:Modern large language model (LLM) training is inherently dynamic: resource fluctuations, RLHF phase shifts, and cluster elasticity continually reshape the optimal parallelism layout, posing a significant challenge to existing training frameworks built around a static execution model. We present DynaTrain, a distributed training system for sub-second, online reconfiguration across arbitrary multi-dimensional parallelism. At its core, we propose a Virtual Parameter Space (VPS) abstraction that unifies all distributed training states under one logical coordinate space, turning any parallelism configuration into a deterministic mapping and collapsing complex transition into manageable geometric intersections. On top of VPS, a state routing-and-transition layer executes rank-local transfers under a memory-aware, deadlock-free schedule, and an Elastic Device Manager overlaps new-world construction with ongoing training to mask topology-change cost. On dense and MoE models up to 235B parameters, DynaTrain reconfigures a 70B dense model in under 2s and a 235B MoE model in 4.36s, outperforming state-of-the-art checkpoint-based and elastic systems by up to three orders of magnitude while preserving correctness.

[LG-94] How Faithful Is Trajectory-Based Data Attribution? Error Sources Remedies and Practical Guidelines

链接: https://arxiv.org/abs/2605.18814
作者: Junwei Deng,Pingbang Hu,Suliang Jin,Hao Lu,Jiachen T. Wang,Shichang Zhang,Jiaqi W. Ma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trajectory-based data attribution methods estimate the influence of training samples on model predictions by unrolling the training trajectory. They are widely used in applications such as data selection, data valuation, and model diagnosis, but there is a lack of comprehensive error analysis of these methods, raising concerns about method faithfulness and hindering reliable deployment. In this work, we provide the first systematic analysis of error sources in trajectory-based data attribution, together with concrete remedies to mitigate them and practical guidelines for downstream use. We organize the total error into three categories, config-level, algorithm-level, and system-level. We make three contributions. First, we identify optimizer mismatch as the dominant config-level error: existing methods derive their attribution under the assumption of SGD, even for models trained with the modern de facto optimizer AdamW. We propose AdamW-influence to fully account for AdamW’s optimization dynamics, yielding improvements from 10% to over 300% in Spearman correlation between estimated and ground-truth influence across four settings spanning MLP, CNN, GPT-2, and Llama 3.2-1B. Second, we isolate the remaining algorithm-level error arising from the first-order Taylor approximation, identify the learning rate and trajectory length as factors governing the error magnitude, and derive a closed-form error proxy that can be evaluated along the original trajectory without retraining. Third, we translate these insights into practical guidelines for data selection by unifying offline and online strategies under a K-step look-ahead framework. Under this framework, online selection with a short horizon often matches or exceeds offline, and the optimal horizon can be tuned jointly with the learning rate. Together, these results turn the framework into an actionable selection recipe for practitioners.

[LG-95] Accurate Evaluation of Quickest Changepoint Detectors via Non-parametric Survival Analysis ICML2026

链接: https://arxiv.org/abs/2605.18798
作者: Taiki Miyagawa,Akinori F. Ebihara
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: Accepted to ICML 2026. GitHub: this https URL

点击查看摘要

Abstract:We propose non-parametric estimators for the average run length (ARL) and average detection delay (ADD) in quickest changepoint detection (QCD) under finite and irregular sequence lengths. Although ARL and ADD are widely used as optimality criteria in theoretical and simulation studies, their application to real-world datasets is hindered by limited and irregular sequence lengths. To address this issue, we propose non-parametric estimators for the ARL and ADD, termed KM-ARL and KM-ADD, by drawing an analogy between QCD and survival analysis to model detection probabilities under sequence truncation. We derive estimation bias bounds and prove that they are asymptotically unbiased unless extrapolation is required. Experiments on simulated and real-world datasets demonstrate their practical utility, enhancing robustness against limited and irregular sequence lengths, improving interpretability, and facilitating empirical, intuitive model selection. Our Python code is provided at this https URL, offering ready-to-use implementations for practitioners.

[LG-96] Goal-Oriented Lower-Tail Calibration of Gaussian Processes for Bayesian Optimization

链接: https://arxiv.org/abs/2605.20145
作者: Aurélien Pion,Emmanuel Vazquez
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) selects evaluation points for expensive black-box objectives using Gaussian process (GP) predictive distributions. Kernel choice and hyperparameter selection can lead to miscalibrated predictive distributions and an inappropriate exploration-exploitation trade-off. For minimization, sampling criteria such as expected improvement (EI) depend on the predictive distribution below the current best value, so lower-tail miscalibration directly affects the sampling decision. This article studies goal-oriented calibration of GP predictive distributions below a low threshold t in the noiseless setting, for standard GP models with hyperparameters selected by maximum likelihood. A framework for predictive reliability below t is introduced, based on two notions of spatial calibration: occurrence calibration over the design space and thresholded \mu -calibration on sublevel sets of the form \x\in\mathbbX, f(x)\le t\ . Building on this framework, we propose tcGP, a post-hoc method that calibrates GP predictive distributions below~ t , and we show that the resulting EI-based global optimization algorithm remains dense in the design space. Experiments on standard benchmarks show improved lower-tail calibration and BO performance relative to standard GP models and globally calibrated GP models.

[LG-97] FiLark: a streaming-first software framework for end-to-end exploration annotation and algorithm integration in distributed acoustic sensing

链接: https://arxiv.org/abs/2605.20132
作者: Jintao Li,Weichang Li,Kai Tong,Xaingyu Guo
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Distributed acoustic sensing (DAS) systems generate continuous, ultra-high-channel-count data streams at rates that exceed the capabilities of conventional batch-oriented analysis frameworks. As a result, essential tasks such as interactive exploration of long-duration recordings, scalable event annotation, and real-time algorithm-in-the-loop monitoring remain inadequately supported by workflows built around manually selected data segments and offline processing. This paper presents FiLark (Fiber Lark), a Python framework that applies a \emphstreaming-first principle uniformly across data access, signal processing, visualization and monitoring for DAS. Instead of operating on manually selected data segments, FiLark presents any DAS sources-including continuous multi-file recordings-as a unified stream and builds all system components around that abstraction. An OpenGL-based ring-buffer renderer enables interactive browsing and visualization of arbitrarily long recordings with constant memory usage. An integrated annotation interface supports event labeling directly within continuous data streams, facilitating the creation of reproducible machine-learning-ready labeled datasets without offline preprocessing. The signal processing library includes temporal, spatial, spectral, and decomposition-based operators, with both CPU implementations and GPU-accelerated variants via PyTorch, alongside stateful chunked execution that preserves processing continuity and application semantics across segment boundaries. A standardized monitor interface further integrates streaming detectors and learning-based models into the visualization workflow. By sharing a common streaming abstraction across all layers, FiLark allows processing configurations and workflows developed interactively to transfer directly to scalable production pipelines without modification.

[LG-98] Optimizing Computational-Statistical Runtime for Wasserstein Distance Estimation

链接: https://arxiv.org/abs/2605.20122
作者: Peter Matthew Jacobs,Jeff M. Phillips
类目: Machine Learning (stat.ML); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Squared Wasserstein distance is a frequently used tool to measure discrepancy between probability distributions. This distance is typically computed between empirical measures of size n from two underlying random samples. Unfortunately, even in lower dimensional Euclidean space problems \left( d \in \2,3\ \right) , algorithms for Wasserstein distance computation with approximate or exact precision guarantees scale poorly in the runtime as a function of n and the desired precision. In response, we consider the computational-statistical runtime, where the goal is to estimate from samples the Wasserstein distance between potentially smooth measures up to \epsilon -additive error in expectation with respect to the sampling; we allow O(1) computational cost for collecting a sample. Towards this, we develop a Sample-Sketch-Solve paradigm where we introduce a regular cartesian grid sketch of the samples. We show that (especially under \alpha -Hölder smooth distributions) this can compress the data without increasing asymptotic error, and also regularizes the structure which enables faster exact algorithms. Ultimately, we approximate W_2^2(P,Q) within \epsilon error in \epsilon^-\max(2,\fracd+1+o(1)1+\alpha) time for 0 \alpha 1 Hölder smooth distributions P,Q on (0,1)^d ; an optimal \Theta(\epsilon^-2) for \alpha 1/2 when d=2 and nearly optimal as \alpha \to 1 when d = 3 .

[LG-99] ail Annealing for Heavy-Tailed Flow Matching

链接: https://arxiv.org/abs/2605.20068
作者: Jean Pachebat
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:Standard generative models struggle with heavy-tailed data: Lipschitz architectures cannot produce power-law tails from Gaussian noise, and interpolating between heavy-tailed data and Gaussians is ill-posed. We propose a simple fix: apply the soft-log transform \phi(x) = \mathrmsign(x) \cdot \log(1 + |x|) coordinate-wise to data before training, then exponentiate samples after generation. A Hill diagnostic decides per-coordinate whether to transform, leaving light-tailed margins untouched at no added complexity. This compresses heavy tails into a range where standard flow matching succeeds, without heavy-tailed base distributions or architectural modifications. We provide theoretical intuition for why this works: the log-transform maps Pareto tails to exponentials, and the induced dynamics implement a form of tail annealing via power transformations. On a 144-configuration multivariate benchmark (3 copulas, d up to 100, 4 tail indices), Log-FM dominates specialized baselines on W_1 , CVaR _99 , and extreme-quantile metrics, and is the only method with zero severe divergences across 2,880 runs.

[LG-100] Variance-Reduced Manifold Sampling via Polynomial-Maximization Density Estimation

链接: https://arxiv.org/abs/2605.19938
作者: Serhii Zabolotnii
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 5 figures, 3 tables. Code supplement: this https URL

点击查看摘要

Abstract:Uniform sampling on implicitly defined manifolds is a core primitive in motion planning, constrained simulation, and probabilistic machine learning. MASEM addresses this problem by entropy-maximizing resampling, but its resampling weights depend on a local k-nearest-neighbour density estimate whose errors can be amplified by aggressive resampling temperatures. We ask whether a polynomial-maximization moment estimator can replace the plug-in density rule without changing the surrounding MASEM architecture. The proposed PMM-MASEM module computes shell spacings from nested k-nearest-neighbour radii, estimates their standardized cumulants, and uses a gated PMM2/PMM3 estimator only when the spacing distribution departs from the flat Exp(1) regime; otherwise it falls back to the plug-in/MLE rule. This fallback is essential: on a flat homogeneous manifold the plug-in estimator is already the MLE, so PMM should not outperform it. A local Known-DGP Monte Carlo experiment confirms this gate: the selector returns MLE on flat Exp(1) spacings and reduces density MSE by 22–36% on asymmetric gamma and boundary-spacing regimes. The evidence is not uniformly positive: PMM3 worsens a platykurtic uniform spacing law, and a lightweight resampling-proxy experiment improves seven-lobes coverage but degrades the sine and swiss-roll proxies. The current evidence therefore supports an applicability-boundary result rather than a general MASEM improvement claim.

[LG-101] Probabilistic Multivariate Time Series Forecasting with Diffusion Copulas ICLR2026

链接: https://arxiv.org/abs/2605.19685
作者: David Huk,Dongshan Wang,Miha Bresar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: ICLR 2026 Workshop Advances in Financial AI

点击查看摘要

Abstract:Accurately assessing financial risk requires capturing both individual asset volatility and the complex, asymmetric dependence structures that emerge during extreme market events. While modern diffusion-based models have advanced multivariate forecasting, they often suffer from a “normality bias” when trained end-to-end, sacrificing marginal calibration for joint coherence and consistently underestimating tail risk. To address this, we propose a Diffusion-Copula framework that explicitly decouples the learning of marginal distributions from their dependence structure. We employ deep Mixture Density Networks to capture heavy-tailed asset dynamics, followed by a Classification-Diffusion Copula to model the joint dependence. Applied to cryptocurrency markets, our approach demonstrates superior performance over state-of-the-art baselines in forecasting systemic extremes of both marginal and joint events. Crucially, we demonstrate that while baseline models classify simultaneous market crashes as statistically impossible “Black Swans” (high surprise), our framework identifies them as “Expected Crashes” (low surprise), successfully preserving the correlation structure necessary for robust risk management during contagion events.

[LG-102] Convergence of Consensus-Based Particle Methods for Nonconvex Bi-Level Optimization

链接: https://arxiv.org/abs/2605.19667
作者: Yutong Chao,Xudong Sun,Konstantin Riedl,Majid Khadiv,Jalal Etesami
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study a consensus-based optimization method for nonconvex bi-level optimization, where the objective is to minimize an upper-level function over the set of global minimizers of a lower-level problem. The proposed approach is derivative-free, and constructs its consensus point via smooth quantile selection combined with a Gibbs-type Laplace approximation. We establish convergence guarantees for both the associated \textitmean-field dynamics and its \textitfinite-particle approximation. In particular, under suitable assumptions on smooth quantile localization, error bounds, and stability, we show that the mean-field law reaches any arbitrary prescribed Wasserstein neighborhood of the target bi-level solution with an explicit exponential rate up to the hitting time. Numerical experiments on a two-dimensional constrained problem and neural network training further support the theoretical results.

[LG-103] Cross-View Attention Fusion Net: A Prior-Guided Dual-View Representation Learning for Cardiac Output Estimation from Short-Term PPG Signals

链接: https://arxiv.org/abs/2605.19666
作者: Yaowen Zhang,Bo Cui,Libera Fresiello,Peter H. Veltink,Dirk W. Donker,Ying Wang
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate cardiac output (CO) estimation from photoplethysmography (PPG) is promising for unobtrusive hemodynamic monitoring, but remains difficult since CO is jointly determined by cardiac function and vascular tone. Conventional feature-based models use physiologically meaningful PPG descriptors, yet depend on accurate pulse detection and may miss latent temporal relationships. In contrast, fully end-to-end deep learning models learn directly from raw PPG but often underuse established PPG-derived prior information. Here, we introduce the Cross-View Attention Fusion Network (CVAF-Net), a prior-guided dual-view deep learning model for CO estimation from short, fixed-length PPG segments. CVAF-Net processes raw PPG as a temporal view and a feature sequence map (FSM) as a structured prior-guided view, and fuses the two representations through cross-view attention. The model was independently evaluated using 5-, 15-, and 30-s segments from three datasets: simulated pulse waves (3323 subjects), vasoconstriction provocation (79 subjects), and resting/cycling activities (10 subjects), and was compared with multiple machine learning and deep learning benchmarks. CVAF-Net outperformed most benchmark methods and achieved performance comparable to a state-of-the-art Transformer-based model, with a mean absolute error (MAE) of 0.19 L/min (MAPE: 3.95%) on simulated data and high accuracy in real-world settings (minimum MAE: 1.20 L/min). Importantly, CVAF-Net reduced FLOPs by twelvefold compared with the leading Transformer-based model. Plausibility analysis showed physiologically consistent CO estimates, with expected correlations with age ( \rho = -0.274 ), heart rate ( \rho = 0.894 ), and systemic vascular resistance ( \rho = -0.740 ). These findings indicate that CVAF-Net provides an accurate, computationally efficient, and generalizable approach for continuous wearable-based CO monitoring.

[LG-104] BCI-sift: An automated feature selection toolbox for Brain Computer Interface applications

链接: https://arxiv.org/abs/2605.19646
作者: Elena C Offenberg,Dirk Keller,Mariska J Vansteensel,Zachary V Freudenburg,Nick F Ramsey,Julia Berezutskaya
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 19 pages, 12 figures

点击查看摘要

Abstract:Advancements in clinical Brain-Computer Interfaces (BCIs) depend on precise and reliable signal interpretation. However, the high-dimensional and noisy nature of data captured from both implanted and non-implanted BCIs poses significant challenges, motivating the use of feature selection algorithms. We introduce BCI-sift (BCI Systematic and Interpretable Feature Tuning), a Python-based toolbox designed to streamline the application of diverse optimization algorithms to BCI datasets for identifying the most relevant features in machine learning tasks. Our scikit-learn-compatible toolbox (this http URL) simplifies feature selection in BCI tasks by integrating advanced optimization methods. We validated the toolbox on high-density electrocorticography (HD ECoG) data from eight able-bodied participants with 64-128 electrodes implanted over the sensorimotor cortex, who repeatedly spoke 12 words. BCI-sift identified informative neural features across electrode, temporal, and frequency dimensions. The anatomical locations of electrode selections were consistent across participants and aligned with known functional organization of the sensorimotor cortex. Relevant time points clustered around speech production, and the high-frequency band was identified as most informative, in line with prior work. Feature selection improved classification accuracy compared to using all features. BCI-sift provides an accessible and versatile platform for feature selection in BCI research, enabling improved decoding performance, automated feature analysis, and enhanced interpretability. While validated on HD ECoG data, the approach is broadly applicable to other BCI modalities. By enhancing classification accuracy and interpretability, BCI-sift addresses key challenges in developing efficient and transparent BCI systems.

[LG-105] Increasing Missingness to Reduce Bias: Richardson-SGD with Missing Data

链接: https://arxiv.org/abs/2605.19641
作者: Ferdinand Genans(SU, LPSM),Erwan Scornet(SU, LPSM)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Stochastic gradient methods are central to modern large-scale learning, but their use with incomplete covariates remains delicate since imputation schemes generally introduce systematic gradient biases, as shown for linear models. In this work, we prove that all parametric models exhibit similar gradient bias for various imputation procedures and characterize exactly the dependence on the missingness ratio vector p , with O(|p|) as the leading term. We exploit this analysis to propose a simple debiasing procedure for stochastic gradient descent (SGD) with missing values based on Richardson extrapolation, which leverages the exact expression of the gradient bias. The key idea is to \emphdeliberately add missingness: from an already incomplete observation, we generate a further-thinned version at a higher, controlled missingness level, and combine the two resulting stochastic gradients to cancel the leading bias term. We prove that one Richardson step reduces the gradient bias from O(|p|) to O(|p|^2) under several missingness scenarios. Our proposed method is computationally efficient, model-agnostic and applies to any parametric loss whose stochastic gradient can be computed after imputation. Furthermore, when missing indicators are independent, the population gradient bias is a multilinear polynomial in p and depends only on population gradient errors induced by declaring a single coordinate missing. In this case, our method generalizes to a multi-step Richardson procedure which recursively cancels higher-order terms. Empirically, Richardson debiasing improves optimization and estimation across several generalized linear models and combines positively with widely used imputation procedures such as MICE. These results suggest that, somewhat counter-intuitively, adding controlled missingness on top of existing missing data can make stochastic learning from incomplete data more accurate.

[LG-106] Gaussian Approximation and Multiplier Bootstrap for Federated Linear Stochastic Approximation

链接: https://arxiv.org/abs/2605.19629
作者: Ilya Levin,Maksim Shuklin,Eric Moulines,Paul Mangold,Sergey Samsonov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this paper, we establish Berry-Esseen-type bounds for federated linear stochastic approximation (LSA). Our results provide the first federated Gaussian approximations for LSA that explicitly capture communication-computation trade-offs and heterogeneity-aware error terms, quantifying the effects of local step size, number of local updates, and heterogeneity on convergence rates. We present results for both (i) constant step size regime and (ii) decreasing step size with an increasing number of local iterations, recovering the recent rates of Bonnerjee et al. [2025] as a special case. As a primary application of our results, we develop an online multiplier bootstrap procedure for inference on the last iterate, which avoids explicit estimation of the asymptotic covariance matrix, and obtain non-asymptotic validity guarantees for this procedure.

[LG-107] Diffusion Graph Posterior Sampling for Nonlinear Inverse Problems with Application to Electrical Impedance Tomography

链接: https://arxiv.org/abs/2605.19621
作者: Giovanni S. Alberti,Damiana Lazzaro,Serena Morigi,Matteo Santacesaria,Shibo Wang
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Deep generative models have emerged as state-of-the-art for solving inverse problems, but applying them to inverse problems for PDEs, like electrical impedance tomography (EIT) remains challenging. Because physical domains are naturally discretized as unstructured meshes rather than regular grids, standard convolutional architectures are often inadequate. In this paper, we propose a novel framework that extends diffusion posterior sampling (DPS) to graph-structured data. We develop an unconditional score-based diffusion model directly on a 2D triangular mesh to learn an accurate prior over the physical solution space. Furthermore, we introduce a regularized variant, RDPS, which incorporates explicit regularization terms, such as total variation and generalized Tikhonov, to complement the implicit diffusion prior and mitigate severe ill-posedness. Extensive experiments on synthetic and real 2D EIT datasets demonstrate that RDPS produces stable, physically plausible reconstructions. Our approach generalizes well to out-of-distribution inclusion geometries, is highly robust to measurement noise, and outperforms current state-of-the-art solvers (e.g., GPnP-BM3D, DP-SGS) in reconstruction accuracy and artifact reduction.

[LG-108] Posterior Contraction of Lévy Adaptive B-spline Regression in Besov Spaces

链接: https://arxiv.org/abs/2605.19610
作者: Jeunghun Oh,Sewon Park,Jaeyong Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the asymptotic properties of the Lévy Adaptive B-spline (LABS) regression model, a Bayesian nonparametric method that incorporates B-spline kernels into the Lévy Adaptive Regression Kernel (LARK) model. LABS applies splines of varying degrees with independently defined knots, yielding a flexible model class capable of adapting to irregular and locally structured features of the true function. Within the nonparametric regression framework with univariate random design and Gaussian errors, we establish that the LABS posterior contracts around the true function in Besov classes at nearly minimax-optimal rates, up to a logarithmic factor, while adapting automatically to unknown smoothness. This study contributes to filling a gap in the literature, where theoretical results on posterior contraction of the LARK model in Besov spaces remain scarce. Simulation experiments on standard test functions in Besov spaces, including Blocks, Bumps, HeaviSine, and Doppler, complement the theoretical results and demonstrate the practical utility of LABS.

[LG-109] HiLiftAeroML: High-Fidelity Computational Fluid Dynamics Dataset for High-Lift Aircraft Aerodynamics

链接: https://arxiv.org/abs/2605.19565
作者: Neil Ashton,Adam Clark,Liam Heidt,Christopher Ivey,Sanjeeb Bose,Rahul Agrawal,Konrad Goc,Rishi Ranade,Corey Adams,Peter Sharpe,Sheel Nidhan,Semit Akkurt,Daniel Leibovici,Jean Kossaifi
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper describes the first-ever open-source high-fidelity CFD dataset of a high-lift aircraft for the purpose of AI surrogate model development. The dataset is composed of 1800 samples, arising from 180 geometry variants and 10 angles of attack for the high-lift NASA Common Research Model (CRM) geometry, used within the AIAA High-Lift Prediction Workshop series. One of the novelties of this dataset is the use of a GPU-accelerated high-fidelity explicit, wall-modeled LES approach for each simulation, using solution-adapted grids between 300M and 500M cells. This ensures the greatest possible accuracy given known challenges in steady-state RANS approaches for these portions of the flight envelope. The entire dataset (geometries, time-averaged volume and surface variables and integral forces) are available, free of charge with a permissive open-source license (CC-BY-4.0). By making this data publicly available, we aim to accelerate the research and development of AI surrogate modeling within the aerospace industry.

[LG-110] Density-Ratio Losses for Post-Hoc Learning to Defer

链接: https://arxiv.org/abs/2605.19557
作者: Alexander Soen,Ragnar Thobaben,Joakim Jaldén,Richard Nock
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:We study post-hoc Learning to Defer (L2D) through the lens of ideal distributions: divergence-regularized reweightings of the data distribution under which a model attains low loss. We define deferral via the density-ratio between a model’s and an expert’s ideals. Using the reduction from density-ratio estimation to class-probability estimation, we derive the DR CPE losses for post-hoc L2D scorers. Deferral decisions are then made by thresholding the scorer, allowing deferral rates to be adjusted without retraining. For KL-based ideal distributions, our deferral rules recovers Chow’s rule under the original distribution and a connection to an expert-tilted Bayes posterior – which incorporates the expert’s performance – depending on if the ideal distributions are joint or marginal distributions. Experimentally, our approach is competitive compared to common baselines and more robust across dataset settings. More broadly, our results cast post-hoc L2D as density-ratio learning between ideal distributions, bridging Chow-style rules, expert comparison, and elucidating connections to related learning settings including anomaly detection.

[LG-111] weedies Formulae and Diffusion Generative Models Beyond Gaussian

链接: https://arxiv.org/abs/2605.19391
作者: Wenpin Tang,Nizar Touzi,Zikun Zhang,Xun Yu Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages, 18 figures

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in generating samples from unknown data distributions. Most popular stochastic differential equation-based diffusion models perturb the target distribution by adding Gaussian noise, transforming it into a simple prior, and then use denoising score matching, a consequence of Tweedie’s formula, to learn the score function and generate clean samples from noise. However, non-Gaussian diffusion models with state-dependent diffusion coefficient have been largely underexplored, as have the corresponding Tweedie’s formulae. In this work, we extend Tweedie’s formula to important non-Gaussian processes, including geometric Brownian motion (GBM), squared Bessel (BESQ) processes, and Cox-Ingersoll-Ross (CIR) processes, thereby yielding the corresponding denoising score-matching objectives. We then apply the derived formulae to image and financial time series generation using GBM- and CIR-based diffusion models, and to empirical Bayes estimation under the BESQ setting. The reported experimental results demonstrate the potential of non-Gaussian models.

[LG-112] A Unified Framework for Structure-Aware Clustering and Heterogeneous Causal Graph Learning

链接: https://arxiv.org/abs/2605.19313
作者: Honglin Du,Muxuan Liang,Xiang Zhong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:In complex multivariate systems, interactions among variables are defined by dependency structures, often encoded as directed acyclic graphs ( \textDAGs ). However, dependency structures can vary across subjects, and ignoring this structural heterogeneity introduces bias and obscures subpopulation-specific dependencies. To address this, we propose Directed Acyclic Graph-based Dependency Clustering via Alternating Direction Method of Multipliers (DAG-DC-ADMM), a unified framework built upon Structural Equation Modeling (SEM) that jointly learns cluster assignments and cluster-specific dependency structures. We encode acyclicity via a smooth constraint and integrate a groupwise truncated Lasso fusion penalty (gTLP) to cluster subjects based on their structural similarity. This yields a nonconvex optimization problem that incorporates sparsity, acyclicity, and structural consensus constraints. We address the nonconvexity by using the augmented Lagrangian method and solve it with an adapted version of the Alternating Direction Method of Multipliers (ADMM) for difference-of-convex programs. For certain graph structures, such as upper triangular adjacency matrices, our algorithm is guaranteed to converge to a Karush-Kuhn-Tucker (KKT) point. Experiments demonstrate that our method recovers cluster-specific causal dependency structures with a high true positive rate and a low false discovery rate. This capability enables the robust discovery of heterogeneous dependencies across subjects where the subpopulation label is unknown.

[LG-113] Factor Augmented High-Dimensional SGD

链接: https://arxiv.org/abs/2605.19291
作者: Shubo Li,Yuefeng Han,Xiufan Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Stochastic gradient descent (SGD) is a fundamental optimization algorithm widely used in modern machine learning. In this paper, we propose Factor-Augmented SGD (FSGD), a new optimization method that leverages latent factor representations in high-dimensional learning tasks. Unlike standard two-stage dimension reduction approaches that rely on offline representation learning and full data storage, a key novelty of FSGD is that it operates purely on streaming data, making it scalable to large-scale and high-dimensional problems. Furthermore, we establish the first theoretical framework that explicitly incorporates latent factor estimation error into the analysis of SGD, and provide moment convergence in \ell^s norm under decaying step sizes and mini-batch updates. Our results provide a new foundation for employing SGD reliably and scalably in high-dimensional machine learning systems.

[LG-114] Do Better Volatility Forecasts Lead to Better Portfolios? Evidence from Graph Neural Networks

链接: https://arxiv.org/abs/2605.19278
作者: Rylan Wade
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper tests whether graph neural networks improve realized volatility forecasts and whether those forecasts improve portfolio performance. Using weekly realized volatility for 465 S\P 500 equities from 2015–2025, Heterogeneous Autoregressive and Long Short-Term Memory baselines are compared against GraphSAGE models built on rolling correlation, sector, and Granger-causal graphs, with and without macro regime features. The empirical finding is that the model with the lowest forecast MSE, the model with the highest cross-sectional ranking accuracy, and the model with the highest portfolio Sharpe ratio are three different models. Forecast accuracy, ranking quality, and portfolio performance are related but not interchangeable objectives. Graph volatility models add value only when the portfolio rule can exploit the cross-sectional structure they encode.

[LG-115] Precision Physical Activity Prescription via Reinforcement Learning for Functional Actions

链接: https://arxiv.org/abs/2605.19208
作者: Gefei Lin,Rui Miao,Jennifer Sacheck,Xiaoke Zhang
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Physical activity (PA) plays an important role in maintaining and improving health. Daily steps have been a key PA measure that is easily accessible with common wearable devices. However, methods are lacking to recommend a personalized optimal distribution of daily steps over a period of time for the best of certain health biomarkers. In this paper, we fill this void based on the data from the All of Us Research Program which includes months of step counts as well as repeated measurements of key health biomarkers. We develop a new offline reinforcement learning (RL) algorithm to learn personalized and optimal PA distributions associated with cardiometabolic risk, where the action is a function representing the daily step distribution over a period of time. Simulation studies demonstrate the advantage of the proposed approach over existing continuous-action RL methods. The learned optimal policy from the All of Us data generally suggests people take more daily steps and also follow a more consistent pattern of PA over time while offering tailored recommendations for subgroups in blood glucose level, body mass index, blood pressure, age, and sex.

[LG-116] A Cloud-Based Tool for Meteorite Recovery Using Drones and Machine Learning

链接: https://arxiv.org/abs/2605.19179
作者: Seamus L. Anderson,Hadrien A. R. Devillepoix,Lewis Lakerink,Sawitchaya Tippaya,Dale P. Giancono,Martin C. Towner,Iona Clemente,Martin Cupák,Ashley F. Rogers,John H. Fairweather,Mia Walker,Daniel Burgin,Michael A. Frazer,Sophie E. Deam,Veronika Pazderová,Eleanor K. Sansom,Benjamin A. D. Hartig,Hely C. Branco,Thomas Stevenson,Isabella Hatty,Anna Zappatini,Anthony Lagain,Tom Lovelock,Auriane Egal,Lucy Forman,David Belton,Simon Windsor,Shibli Saleheen,Asher Leslie,Gregory B. Poole,Andrew Langendam,Rachel S. Kirby,Andrew G. Tomkins
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 23 pages, 3 figures

点击查看摘要

Abstract:We present a cloud-based tool that uses drones and machine learning to help recover instrumentally observed meteorite falls. We showcase a collection of improvements made upon previous iterations of our system, as well as detail the successes and limitations of this technique when applied to observed meteorite falls in South and Western Australia. This tool is available to the meteoritics research community upon request at this https URL.

[LG-117] Activation Functions Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines

链接: https://arxiv.org/abs/2605.19178
作者: Giovanni di Sarra,Yasser Roudi
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 38 pages, 27 figures

点击查看摘要

Abstract:The great success of neural networks in recognizing hidden patterns and correlations in complex data lies in the way they take advantage of the large number of parameters and nonlinear single-unit activation, jointly. Restricted Boltzmann Machines (RBMs) provide a simple yet powerful framework for studying the impact of activation nonlinearities on performance and representation. In this work, we exploit the duality between RBMs and models of interacting binary variables to study the statistics of the interactions induced by RBM ensembles with different hidden unit activation functions. We characterize the space of representable models analytically in terms of moments of the distribution of induced interactions for four commonly used activation functions: Linear, Step, ReLU, and Exponential. Quantitative predictions of the analytical calculations on learning show a very good agreement with results of the simulations of the training process. In particular, our analysis shows that there are certain data structures, namely those generated by models of interacting variables with large interaction terms beyond pairwise, that are difficult to represent, and thus to learn, for any RBM. Yet, we find that rapidly increasing nonlinearities, such as the Exponential function, can facilitate the representation and learning of such data structures for a specific range of parameters that is determined analytically.

[LG-118] Reducing Diffusion Model Memorization with Higher Order Langevin Dynamics

链接: https://arxiv.org/abs/2605.19170
作者: Benjamin Sterling,Mónica F. Bugallo,Tom Tirer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion/score-based models have emerged as powerful generative models, capable of generating high-quality samples that mimic the training data distribution. However, it has been observed that they are prone to reproducing training samples-known as “memorization”-potentially violating copyright and privacy. In this paper, we study the effect of Higher-Order Langevin Dynamics (HOLD) on this phenomenon. HOLD diffusion processes introduce auxiliary variables; if the data variable is interpreted as “position,” then the auxiliary variables can be interpreted as “velocity” and “acceleration,” depending on the chosen order of the model. They were originally proposed based on the intuition that they regularize the trajectories of the data variable by implicitly imposing additional dynamical constraints. Our work provides, to our knowledge, the first theoretical characterization of the regularization effect of HOLD. Specifically, we show that in HOLD, the dynamics of the data variable are governed by a low-pass-filtered version of the learned score function, with smoothness increasing with the order of HOLD. We then analyze the optimal empirical score and the possibility of distribution collapse. Together, our results explain the mitigation of memorization as the model order increases. Finally, we present an empirical study on real-world data that supports our theory and highlights this distinct advantage of HOLD over standard diffusion in practice.

[LG-119] Information Processing Capacity of Stationary Physical Systems: Theory Data-efficient Estimation Methods and Photonic Demonstration

链接: https://arxiv.org/abs/2605.19152
作者: Rahul Uma Ramachandran,Serge Massar
类目: Machine Learning (stat.ML); Emerging Technologies (cs.ET); Information Theory (cs.IT); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:Physical computing systems provide a promising route toward hardware-native machine learning, but their computational capabilities remain difficult to characterize in a principled, task-independent, and data-efficient way. We extend the Information Processing Capacity (IPC) framework to stationary physical computing systems and establish several fundamental results: individual capacities are bounded between zero and one, their sum over a complete basis is bounded by the number of readouts, and noise strictly reduces this bound. We address the finite-sample estimation of IPC and derive the asymptotic form of the systematic positive bias affecting naive estimators. Building on these results, we introduce data-efficient estimation methods based on Richardson extrapolation and Sobol quasi-random sampling. We validate the framework experimentally using a photonic computing system based on picosecond laser pulses propagating through a nonlinear optical fibre. By varying the laser power and fibre length, we observe systematic shifts of the IPC distribution toward higher-order nonlinear capacities induced by the Kerr effect. Finally, we demonstrate that the total IPC strongly correlates with performance on benchmark machine-learning tasks and provides a reliable estimate of the effective dimensionality of the system. These results establish IPC as a practical bridge between the intrinsic dynamics of physical computing systems and their machine-learning performance.

[LG-120] Atomistic Modeling of Chemical Disorder in Materials: Bridging Classical Methods and AI-Assisted Approaches

链接: https://arxiv.org/abs/2605.19124
作者: Jiayu Peng,Peichen Zhong
类目: Materials Science (cond-mat.mtrl-sci); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Chemical disorder, originating from the mixed occupation of crystallographic sites by multiple elements, is widespread in alloys, ceramics, and compositionally complex materials, where short- and long-range orderings can strongly influence properties. A central obstacle is the representation gap between experiments and simulations: experiments often report disorder as partial occupancies and ensemble-averaged behaviors, whereas atomistic simulations and AI workflows usually require fully specified configurations. Tackling this gap requires computational methods that convert averaged disorder descriptions into representative configurational ensembles while balancing cost, bias, and fidelity. This challenge has become more urgent in AI-driven computational discovery, where ignoring disorder may cause AI workflows to misrank stability, misjudge novelty, and misdirect experiments with too-idealized representations. This Review highlights how classical and AI-driven methods can bridge this representation gap. We assess the strengths and limitations of approaches spanning mean-field theories, cluster expansion, quasi-random approximations, Monte Carlo, and emerging schemes powered by universal interatomic potentials and generative models. We further highlight how AI can accelerate classical computational schemes by lowering the cost of microstate evaluation, configurational exploration, and atomistic-to-thermodynamic closure. We also emphasize how AI can enable disorder-native capabilities, including workflow triage, ordering-sensitive and alchemical representations, generative models of disordered structures and distributions, and kinetics-aware disorder prediction. Together, this framework outlines a practical roadmap toward disorder-native AI, which can transform chemical disorder from a representational obstacle into a controllable variable for realistic AI-accelerated materials discovery.

[LG-121] Dual-Channel Tensor Neural Networks: Finite-Sample Theory and Conformal Structure Selection

链接: https://arxiv.org/abs/2605.19122
作者: Elynn Chen,Jiayu Li,Zheshi Zheng,Jian Pei
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tensor-valued data arise naturally in neuroimaging, genomics, climate science, and spatiotemporal networks, where multilinear dependencies across modes carry information that is destroyed under vectorization. Existing approaches either impose a single low-rank structure, which can miss localized signal, or treat the tensor as a long vector, which discards its multiway geometry. We propose a Dual-Channel Tensor Neural Network (DC-TNN) that decomposes each tensor input into a low-rank core and a sparse refinement, and processes the two components through coupled neural channels. The framework is structure-agnostic and accommodates CP, Tucker, and tensor-train cores within a single architecture. For estimation, we establish non-asymptotic risk bounds for the DC-TNN estimator that decompose into network approximation, core estimation, and refinement-selection terms, and show that the effective dimension is determined jointly by the core rank and refinement sparsity rather than by the ambient tensor size. For inference, we develop a structure-aware conformal ROC procedure that calibrates within the core-refinement latent space and produces ROC and AUC confidence bands with finite-sample, distribution-free coverage. Building on this, we propose a conformal structure selector that, to our knowledge, is the first distribution-free procedure for choosing among candidate tensor decompositions with finite-sample validity. Simulations and an analysis of a protein dataset demonstrate competitive predictive accuracy, reliable uncertainty quantification, and consistent recovery of the tensor structure.

[LG-122] Learning Interpretable Point-Based Clinical Risk Scores via Direct Optimization

链接: https://arxiv.org/abs/2605.19113
作者: Ying Cui,Albert M Li,Vivek Charu,Yeon-Mi Hwang,Tina Hernandez-Boussard,Lu Tian
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 23 pages, 4 figures

点击查看摘要

Abstract:Many clinical risk scores are deployed as additive rules with nonnegative integer points assigned to relevant binary predictive features. These integer weights not only make the score easier to use in practice but also promote sparsity in the resulting prediction model. Such risk scores are often derived by first fitting a regression model and then rounding the estimated coefficients to the nearest integer after appropriate scaling. This approach is computationally fast but does not guarantee optimality of the resulting score. Alternatively, one may search over all possible integer weights to directly optimize a value function by posing the problem as an integer programming task. However, the associated computational burden can be substantial, especially when the value function is nonconcave or even discontinuous. In this paper, we develop new machine learning algorithms that employ a flexible greedy optimization strategy to learn such additive scoring directly under explicit and sensible optimality objectives. We apply the proposed method to a large electronic health record (EHR) cohort in Epic Cosmos to construct an integer-weighted comorbidity score for measuring the risk of post-discharge mortality. We also conduct a simulation study to examine the finite-sample operating characteristics.

[LG-123] Provably Data-driven Lagrangian Relaxation for Mixed Integer Linear Programming ICML2026

链接: https://arxiv.org/abs/2605.19052
作者: Tung Quoc Le,Anh Tuan Nguyen,Viet Anh Nguyen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted to ICML 2026

点击查看摘要

Abstract:Lagrangian Relaxation (LR) is a powerful technique for solving large-scale Mixed Integer Linear Programming (MILP), particularly those with decomposable structures, such as vehicle routing or unit commitment problems. By relaxing the coupling constraints, LR enables parallel subproblem solving and often yields tighter dual bounds than standard linear programming relaxations, which is crucial for efficient branch-and-bound pruning. While recent empirical work has shown promising results using machine learning to predict these multipliers, a theoretical understanding of such methods remains an open question. In this work, we bridge this gap by analyzing the problem of learning LR through the lens of Data-driven Algorithm Design, i.e., a statistical learning problem over a distribution of problem instances. Our contributions are as follows: first, we derive a generalization bound of \mathcalO(s^1.5/\sqrtN) for the learned multipliers, where s is the number of coupling constraints and N is the sample size. Second, we provide a minimax lower-bound of \Omega(s/\sqrtN) , proving that a linear dependency is unavoidable. Third, we constructively close this theoretical gap by proving that Stochastic Gradient Ascent (SGA) with averaging achieves the minimax optimal rate \Theta(s/\sqrtN) . Finally, we extend our framework to the learning-to-warm-start setting, proving that it achieves a fast, minimax-optimal rate of \Theta(s/N) and establishing a theoretical advantage over direct multiplier prediction.

[LG-124] Conformal Prediction via Transported Beta Laws

链接: https://arxiv.org/abs/2605.19024
作者: Thiago R. Ramos,Helton Graziadei,Luben M. C. Cabezas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Split conformal prediction provides finite-sample marginal coverage under exchangeability, but this guarantee averages over the random calibration sample. We study instead the law of the calibration-conditional coverage induced by a realized conformal threshold. In the continuous i.i.d. setting this law is exactly Beta(k,n+1-k) , so the usual marginal guarantee corresponds to its mean. We take this beta law as a finite-sample reference object and quantify departures from it using Wasserstein distances on [0,1] . The framework yields direct bounds on marginal coverage gaps and on bad-calibration probabilities, and separates different sources of non-i.i.d. behavior according to how they deform the beta reference: test-side shift acts through a transport map on the coverage scale, while calibration dependence changes the order-statistic law itself. We instantiate the framework in scale-shift, clustered, and stationary mixing settings, where the induced deformations can be characterized explicitly or through Berry-Esseen approximations. Simulations on dependent processes confirm that the first-order approximation tracks the empirical Wasserstein distance even at moderate sample sizes.

[LG-125] Hyrax: An Extensible Framework for Rapid ML Experimentation and Unsupervised Discovery in the Era of Rubin Roman and Euclid

链接: https://arxiv.org/abs/2605.18959
作者: Aritra Ghosh,Drew Oldag,Michael Tauraso,Andrew J. Connolly,Peter Ferguson,Derek Jones,Gourav Khullar,Argyro Sasli,Samarth Venkatesh,Gracia Wang,Maxine West,Dylan Berry,Neven Caplar,Colin Orion Chandler,Tanawan Chatchadanoraset,Michael W. Coughlin,Melissa DeLucchi,Alexandra Junell,Diego Miura,Felipe Fontinele Nunes,Wilson Beebe,Doug Branton,Sandro Campos,Liam Cunningham,Mi Dai,Jeremy Kubica,Konstantin Malanchev,Rachel Mandelbaum,Sean McGuire,Imad Pasha,Dan S. Taranu,Tianqing Zhang
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Earth and Planetary Astrophysics (astro-ph.EP); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: 28 pages, 20 figures, submitted to AJ

点击查看摘要

Abstract:The NSF-DOE Vera C. Rubin Observatory, Roman Space Telescope, Euclid, and other next-generation surveys will deliver imaging, spectroscopic, and time-domain data at scales that increasingly shift the bottleneck in astronomical machine learning (ML) projects from model design to infrastructure. We present Hyrax, an open-source, modular, GPU-enabled Python framework that supports the full ML lifecycle in astronomy: from data acquisition and training to inference and experiment comparison, with capabilities including multimodal dataset support, integrated vector databases for similarity search, and interactive two- and three-dimensional latent-space exploration for unsupervised discovery. We demonstrate Hyrax’s versatility through five representative applications on real survey data: (i) unsupervised representation learning on \sim 4\times10^5 Rubin Legacy Survey of Space and Time (LSST) Data Preview 1 (DP1) galaxies, surfacing new merger and low-surface-brightness candidates missing from reference Euclid and Dark Energy Survey catalogs, while also isolating imaging artifacts – all without labeled training data; (ii) hybrid density-based clustering for identifying cluster-scale gravitational lens candidates in DP1 data; (iii) multimodal early-time transient classification in the Zwicky Transient Facility leveraging light curves, spectra, images, and metadata; (iv) supervised false-positive filtering in shift-and-stack searches for distant solar system objects in the Dark Energy Camera Ecliptic Exploration Project survey; and (v) supervised detection of semi-resolved dwarf galaxies in Hyper Suprime-Cam and LSST-like imaging using synthetic source injection. Together, these results demonstrate that Hyrax provides astronomy-specific ML infrastructure that enables systematic discovery and rapid methodological iteration across next-generation astronomical surveys.

[LG-126] Bayesian Latent Space Models for Graphs Are Misspecified: Toward Robust Inference via Generalized Posteriors

链接: https://arxiv.org/abs/2605.18927
作者: Aldric Labarthe(CB, UNIGE)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Bayesian latent space models offer a principled approach to network representation, but rely on correct specification of both geometry and link function. Real-world networks often violate these assumptions, exhibiting geometric mismatch and structural anomalies that break standard metric properties. We show that such misspecification pushes the data-generating distribution outside the model class, causing Bayesian inference to become overconfident and poorly calibrated. To address this, we propose a generalized posterior framework for random geometric graphs. We introduce Link-Sequential R-SafeBayes, a method that exploits dyadic conditional independence to estimate prequential risk and adaptively tune posterior regularization. Experiments on synthetic and real-world networks demonstrate improved calibration, better link prediction performance, and a reliable criterion for selecting latent geometries across Euclidean, spherical, and hyperbolic spaces.

[LG-127] A Logistic Regression Model to Predict Malaria Severity in Children

链接: https://arxiv.org/abs/2605.18900
作者: Mary Opokua Ansong,Asare Yaw Obeng,Samuel King Opoku
类目: Other Quantitative Biology (q-bio.OT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the main causes of death around the globe is malaria. Researchers have sought to develop predictive models for malaria outbreaks based on meteorological data, climate data and the breeding cycle of Plasmodium, the causative agent of malaria. This study predicts the severity of malaria based on environmental and biological factors. A logistic regression model was developed in this study to predict the severity of malaria based on such factors as sickle cell disease, stagnant water, garbage dump, wet lawns, and the use of treated mosquito nets, with an 83.3% accuracy rate. The study was carried out in the Bosomtwe District of Ghana with 417 respondents. It was deduced that although children in the District are highly prone to malaria infection, the severity is very low. The study recommends that not just having a good sample size alone is important during machine learning model development, but also having a good sample representation of the various class labels is equally important.

[LG-128] owards Discovery of Polymers for Insulin Delivery via Physics-Grounded Agent ic Workflows

链接: https://arxiv.org/abs/2605.18831
作者: Martins Otun
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cold-chain storage limits access to insulin for hundreds of millions of people; a thermally protective patch polymer could help, but the design space is too large for exhaustive experiment. Starting from that problem, we narrow to an agentic workflow: a large language model (LLM) calls physics-based tools through the Model Context Protocol (MCP), searching the discrete PSMILES space under a budget of OpenMM Packmol-matrix evaluations. The LLM acts as an implicit acquisition function conditioned on a persistent “discovery world”: hypotheses, literature claims, and simulation outcomes updated each iteration. Under matched oracle budgets, the best autonomous campaign reaches an insulin-polymer interaction energy of -2263 kJ/mol, outperforming reinforcement-learning baselines by 68% and Bayesian optimization by 19%. Three independent campaigns converge on one structural motif (dense hydrogen-bond donors and acceptors per repeat unit) while physics checks reject infeasible packings and name-structure mismatches before they steer the next step. The science stage is CPU-bound and runs on commodity hardware. More broadly, the same architecture and workflow designed here applies to other protein-stabilization tasks whenever a tractable screening oracle is available.

[LG-129] Noise scheduling and linear dynamics in diffusion models on Lie groups

链接: https://arxiv.org/abs/2605.17326
作者: Javad Komijani
类目: High Energy Physics - Lattice (hep-lat); Machine Learning (cs.LG)
*备注: 5 pages

点击查看摘要

Abstract:We investigate the role of the noise schedule in diffusion processes on Lie groups, with particular emphasis on applications to lattice gauge theory. We show that a specific noise schedule leads to a linear decay of the expectation value of the Wilson action as a function of diffusion time. We compare this with Euclidean diffusion models, where such behavior requires an explicitly designed drift term, while in the Lie-group setting it arises naturally.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2026-05-20

目录

概览 (2026-05-20)

多智能体系统

自然语言处理

信息检索

人机交互

计算机视觉

人工智能

机器学习

附件下载